Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair: Add ranges_parallelism option #4848

Closed
wants to merge 1 commit into from

Conversation

asias
Copy link
Contributor

@asias asias commented Aug 14, 2019

In commit 131acc0 (repair: Adjust
parallelism according to memory size), we choose the ranges to repair in
parallel according to the memory size automatically.

However, this automatic number might not the optimal one. For example,
user wants repair to have minimal impact on the user cql reads and
writes, or user wants to repair at full speed allowing more cpu and
memory resources.

This patch introduces a ranges_parallelism option that controls number
of ranges that we repair in parallel, so that the advanced
user can control the parallelism.

Here is an example:

In a cluster with high network latency, e.g., multiple DCs with high
latency link, user can increase ranges_parallelism to compensate.

For example in two DCs cluster with 7 nodes + 60ms round trip time + RF
(dc1:3, dc2:4) + a keyspace with 3 empty tables:

2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ...
2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ...

2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ...
2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ...

2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ...
2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ...

That is 1949s vs 257s vs 75s to complete repair respectively, which
gives 7X and 25X difference.

Fixes #4847

@slivne slivne requested a review from avikivity August 19, 2019 20:31
@slivne
Copy link
Contributor

slivne commented Aug 19, 2019

@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links

1 similar comment
@slivne
Copy link
Contributor

slivne commented Sep 2, 2019

@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links

@asias
Copy link
Contributor Author

asias commented Oct 9, 2019

ping.

In commit 131acc0 (repair: Adjust
parallelism according to memory size), we choose the ranges to repair in
parallel according to the memory size automatically.

However, this automatic number might not the optimal one. For example,
user wants repair to have minimal impact on the user cql reads and
writes, or user wants to repair at full speed allowing more cpu and
memory resources.

This patch introduces a ranges_parallelism option that controls number
of ranges that we repair in parallel, so that the advanced
user can control the parallelism.

Here is an example:

In a cluster with high network latency, e.g., multiple DCs with high
latency link, user can increase ranges_parallelism to compensate.

For example in two DCs cluster with 7 nodes + 60ms round trip time + RF
(dc1:3, dc2:4) + a keyspace with 3 empty tables:

2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ...
2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ...

2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ...
2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ...

2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ...
2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ...

That is 1949s vs 257s vs 75s to complete repair respectively, which
gives 7X and 25X difference.

Fixes scylladb#4847
@asias asias force-pushed the repair_add_ranges_parallelism branch from a8dd933 to bc605cf Compare October 9, 2019 09:22
@asias
Copy link
Contributor Author

asias commented Oct 9, 2019

Rebased to latest master.

@asias
Copy link
Contributor Author

asias commented Nov 19, 2019

@avikivity Ping. This is needed by 3.2.

1 similar comment
@slivne
Copy link
Contributor

slivne commented Dec 4, 2019

@avikivity Ping. This is needed by 3.2.

@glommer
Copy link
Contributor

glommer commented Dec 9, 2019

Last ping on this was 21 days ago.
There is no maintainer action: no request for changes, no merge.

Patches need to either move or be closed.

I am tagging all 4 people with maintainer status according to the maintainers file: @avikivity @nyh @tgrabiec , @penberg

This is an unrelated discussion, but I think we need more maintainers that can commit code.

We can't have a pull request with no action sitting here for 3 weeks.

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong with this patch's code, but I don't like the approach:

Exactly as demonstrated int he commit message, the meaning of the "repair parallelism" is very different on high-latency vs low-latency links. While on low-latency links a low parallelism can nevertheless take a big chunk of the machine's CPU, disk and network capacity, on a high-latency link low parallelism can result in underused network, disk and CPU. Moreover, what does the "range_paralellism" number actually mean? What is "a range"? Are there more layers of parallelism inside each range (including read-ahead, socket buffers, etc.) or not? Is this a master-side, slave-side or what parameter? Besides changing the parallelism (and noticably - the speed) of the repair, what does it do to the CPU or disk percentage used up by repair? As a user I wouldn't have a clue how to use this parameter. Although if @slivne or @glommer request this specific implementation as an emergency tweak for problems in the field, I'll oblige and commit this patch as-is.

If I think about how to solve this problem ideally, not in the way done here, I think that if the user is to configure anything - it should be the fraction of CPU, disk bandwidth and/or network bandwidth that they want to devote to repair (on all machines involved!). That is our traditional "boring database"/"zero configuration" approach. We already support CPU and disk schedulers (and usually assumed the network split is derived from the CPU/disk available), but need to make it repair actually uses them. In other words, what I envision was repair adaptively controlling its own concurrency, increasing it while the CPU/disk quotas of the involved machines aren't fully used (I think it is quite fine to overshoot a bit, and use a slightly too high concurrency). This probably won't be easy to do, so maybe we need this patch in the interim?

Moreover, orthogonally to this specific patch there is another big problem in our repair approach on high-latency high-throughput links. Consider that we have a very high latency. You'll need very high parallelism to compensate, but this - with the current code which keeps in memory every row read for this entire latency - will need huge amounts of memory. A really good solution for this problem will use less memory in this case. Perhaps not caching all the rows in memory and hoping that only a few end up needing to be reread. This is why in the past I asked you to benchmark also the case of high latency with almost no inconsistencies in the data - this sort of repair should ideally be able to proceed at disk and network bandwidth, unrelated to the amount of memory or network latency.

},
{
"name":"ranges_parallelism",
"description":"An integer specifying the number of ranges to repair in parallel.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is so opaque... What integer should a user use here? 3? 100? 1000000? What is the default if I don't specify this? As a user, I wouldn't know how to use this option (although I guess it might come handy for our field engineers?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is used to be 16. Now it is changed to be configured according to the number of memory allowed for repair, see 131acc0.

@nyh
Copy link
Contributor

nyh commented Dec 9, 2019

@glommer I left a lengthy review of this patch. I don't like this approach myself, but I also don't see anything specifically wrong with this code, so if you or @slivne believe it is important for users right now and request it in - despite not being the ideal solution - I can commit it. Please let me know.

@glommer
Copy link
Contributor

glommer commented Dec 9, 2019

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here.
This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

@glommer
Copy link
Contributor

glommer commented Dec 9, 2019

FWIW I also favor a solution without this tunable.

@asias
Copy link
Contributor Author

asias commented Dec 10, 2019

I don't see anything wrong with this patch's code, but I don't like the approach:

Exactly as demonstrated int he commit message, the meaning of the "repair parallelism" is very different on high-latency vs low-latency links. While on low-latency links a low parallelism can nevertheless take a big chunk of the machine's CPU, disk and network capacity, on a high-latency link low parallelism can result in underused network, disk and CPU. Moreover, what does the "range_paralellism" number actually mean? What is "a range"? Are there more layers of parallelism inside each range (including read-ahead, socket buffers, etc.) or not? Is this a master-side, slave-side or what parameter? Besides changing the parallelism (and noticably - the speed) of the repair, what does it do to the CPU or disk percentage used up by repair? As a user I wouldn't have a clue how to use this parameter. Although if @slivne or @glommer request this specific implementation as an emergency tweak for problems in the field, I'll oblige and commit this patch as-is.

If I think about how to solve this problem ideally, not in the way done here, I think that if the user is to configure anything - it should be the fraction of CPU, disk bandwidth and/or network bandwidth that they want to devote to repair (on all machines involved!). That is our traditional "boring database"/"zero configuration" approach. We already support CPU and disk schedulers (and usually assumed the network split is derived from the CPU/disk available), but need to make it repair actually uses them. In other words, what I envision was repair adaptively controlling its own concurrency, increasing it while the CPU/disk quotas of the involved machines aren't fully used (I think it is quite fine to overshoot a bit, and use a slightly too high concurrency). This probably won't be easy to do, so maybe we need this patch in the interim?

Moreover, orthogonally to this specific patch there is another big problem in our repair approach on high-latency high-throughput links. Consider that we have a very high latency. You'll need very high parallelism to compensate, but this - with the current code which keeps in memory every row read for this entire latency - will need huge amounts of memory. A really good solution for this problem will use less memory in this case. Perhaps not caching all the rows in memory and hoping that only a few end up needing to be reread. This is why in the past I asked you to benchmark also the case of high latency with almost no inconsistencies in the data - this sort of repair should ideally be able to proceed at disk and network bandwidth, unrelated to the amount of memory or network latency.

Nadav, thanks for the lengthy review.

First of all, I do not like user to configure anything neither. We already have the ranges parallelism auto-configured according to the memory size. See 131acc0 (repair: Adjust parallelism according to memory size (#4696)). If user does not specify the "advanced" option, everything is zero configuration. The parameter introduced is for the case where the auto-configured parameter does not work. For example, why limit repair to 10% of memory if all I need to do is repair the cluster at the moment. The point is the auto-configured parameter is not optimal for all the use cases.

What does ranges_parallelism option actually control? It controls the number of repair instances per shard in parallel. Each repair instance runs a full row level repair algorithm. It is a repair master side parameter. The follower side will follow whatever the repair master chooses.

Regarding not caching all rows in memory and reread, what if the rows are from peers, rereading means retransmitting from the peers. I do not think this something we want to pursuit.

@asias
Copy link
Contributor Author

asias commented Dec 10, 2019

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here.

Because nobody is running the repair in 3.2 release which is not released yet.

This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

Increasing parallelism is the common way to compensate the latency. The 25X improvement mentioned in the commit log is good example it works well. I am open for new ideas but let's proceed and gain more actual experience before arguing if this this works or not.

@glommer
Copy link
Contributor

glommer commented Dec 10, 2019

Hi @asias

I was mostly referring to the tunable.
I think we can do away with it.

On/off tunables are okay for features we are afraid may have issues - like repair based streaming, but these kinds of tunables that you are introducing tend to become hard to remove later.

@asias
Copy link
Contributor Author

asias commented Dec 10, 2019

Hi @asias

I was mostly referring to the tunable.
I think we can do away with it.

You are welcome to give suggestions how to get rid of. No matter how smart we are, the auto-configured number could go wrong in some cases.

On/off tunables are okay for features we are afraid may have issues - like repair based streaming, but these kinds of tunables that you are introducing tend to become hard to remove later.

Yes, it is hard to remove but why not just stopping using it. The option is present does not mean we have to use it.

@glommer
Copy link
Contributor

glommer commented Dec 10, 2019

Users start depending on it, Asias.

My suggestion on how to remove it is by not adding it.

@asias
Copy link
Contributor Author

asias commented Dec 10, 2019

Users start depending on it, Asias.

The option is not wired to nodetool repair, so normal user won't use it. We can easily make Scylla manager stop using it if we want. In case we really want to remove the option we can return an error in the restful API when the option is provided.

My suggestion on how to remove it is by not adding it.

In my view, doing nothing is not better than providing more sophisticated control. Also, offering such control does not prevent us looking for better automatic method.

@slivne
Copy link
Contributor

slivne commented Dec 10, 2019

@nyh - asias provided answers to your questions, are we missing anything - can this go in ?

@nyh
Copy link
Contributor

nyh commented Dec 10, 2019

@slivne as I said, it can go in, there is nothing really "wrong" with it. Whether it should go in, however, is a different question. This solution goes against our usual "zero configuration" philosophy. So I think you should say if this should go in - considering what you know about users or field-engineers or whatever who want such a feature. I don't think there's any risk with this patch - it just that it feels like taking a step in the wrong direction.

@slivne
Copy link
Contributor

slivne commented Dec 10, 2019

@nyh thanks - this was agreed by avi in the review process @avikivity - any input from you - about this added configuration - its not great, but we agreed that we will add it.

@penberg
Copy link
Contributor

penberg commented Dec 11, 2019

As discussed here, it's highly unusual that we add such tunables, because they violate the "no configuration" rule and are hard to get rid of later. However, if @avikivity approved the approach, he can certainly merge the pull request.

@avikivity
Copy link
Member

I think we need to make this automatic. Allocate 10% of memory for repair and run as many ranges as will fit there.

Of course, running many ranges requires more compaction work later (or some other work to unify many small sstables).

Too bad we didn't go for rolling repair which could do this in a single stream.

@penberg
Copy link
Contributor

penberg commented Jul 17, 2020

@asias is this pull request still relevant or should we just close it?

@asias
Copy link
Contributor Author

asias commented Jul 21, 2020

I will close it for now.

@vladzcloudius
Copy link
Contributor

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here. This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

One "pressing need" is our control of a repair parallelism (which is effectively defined by a number of ranges being repaired at the same time) from tools like nodetool.

I suggest reviving this PR and linking the corresponding API to one of the nodetool 's options, e.g. -j.

@scylladb-promoter
Copy link
Contributor

@benipeled
Copy link
Contributor

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2699/

The PR missing the merge branch - we should try to rebase and fix conflicts before retriggering the CI

benipeled@fedora ~/repos/scylla-4848 $ git ls-remote git@github.com:scylladb/scylla.git 'refs/pull/4848/*'
bc605cf4227422783f52bc0d10c5dd9d78fa522c        refs/pull/4848/head

For comparison, common PR contains two branches - head * merge - and the CI uses the merge branch

benipeled@fedora ~/repos/scylla-4848 $ git ls-remote git@github.com:scylladb/scylla.git 'refs/pull/14835/*'
dc3475901608ca12f61e871e98048ed32dcee85f        refs/pull/14835/head
5b31c3396cf2c7e6cee9e26e78134609d91bc6da        refs/pull/14835/merge

asias added a commit to asias/scylla that referenced this pull request Jul 31, 2023
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR scylladb#4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes scylladb#4847
asias added a commit to asias/scylla that referenced this pull request Jul 31, 2023
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR scylladb#4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes scylladb#4847
asias added a commit to asias/scylla that referenced this pull request Jul 31, 2023
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR scylladb#4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes scylladb#4847
@asias
Copy link
Contributor Author

asias commented Aug 1, 2023

Replaced by #14886

@asias asias closed this Aug 1, 2023
asias added a commit to asias/scylla that referenced this pull request Aug 1, 2023
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR scylladb#4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes scylladb#4847
denesb added a commit that referenced this pull request Aug 3, 2023
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job.

To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel.

Fixes #4847

Closes #14886

* github.com:scylladb/scylladb:
  repair: Add ranges_parallelism option
  repair: Change to use coroutine in do_repair_ranges
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants