-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repair: Add ranges_parallelism option #4848
Conversation
@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links |
1 similar comment
@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links |
ping. |
In commit 131acc0 (repair: Adjust parallelism according to memory size), we choose the ranges to repair in parallel according to the memory size automatically. However, this automatic number might not the optimal one. For example, user wants repair to have minimal impact on the user cql reads and writes, or user wants to repair at full speed allowing more cpu and memory resources. This patch introduces a ranges_parallelism option that controls number of ranges that we repair in parallel, so that the advanced user can control the parallelism. Here is an example: In a cluster with high network latency, e.g., multiple DCs with high latency link, user can increase ranges_parallelism to compensate. For example in two DCs cluster with 7 nodes + 60ms round trip time + RF (dc1:3, dc2:4) + a keyspace with 3 empty tables: 2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ... 2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ... 2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ... 2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ... 2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ... 2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ... That is 1949s vs 257s vs 75s to complete repair respectively, which gives 7X and 25X difference. Fixes scylladb#4847
a8dd933
to
bc605cf
Compare
Rebased to latest master. |
@avikivity Ping. This is needed by 3.2. |
1 similar comment
@avikivity Ping. This is needed by 3.2. |
Last ping on this was 21 days ago. Patches need to either move or be closed. I am tagging all 4 people with maintainer status according to the maintainers file: @avikivity @nyh @tgrabiec , @penberg This is an unrelated discussion, but I think we need more maintainers that can commit code. We can't have a pull request with no action sitting here for 3 weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything wrong with this patch's code, but I don't like the approach:
Exactly as demonstrated int he commit message, the meaning of the "repair parallelism" is very different on high-latency vs low-latency links. While on low-latency links a low parallelism can nevertheless take a big chunk of the machine's CPU, disk and network capacity, on a high-latency link low parallelism can result in underused network, disk and CPU. Moreover, what does the "range_paralellism" number actually mean? What is "a range"? Are there more layers of parallelism inside each range (including read-ahead, socket buffers, etc.) or not? Is this a master-side, slave-side or what parameter? Besides changing the parallelism (and noticably - the speed) of the repair, what does it do to the CPU or disk percentage used up by repair? As a user I wouldn't have a clue how to use this parameter. Although if @slivne or @glommer request this specific implementation as an emergency tweak for problems in the field, I'll oblige and commit this patch as-is.
If I think about how to solve this problem ideally, not in the way done here, I think that if the user is to configure anything - it should be the fraction of CPU, disk bandwidth and/or network bandwidth that they want to devote to repair (on all machines involved!). That is our traditional "boring database"/"zero configuration" approach. We already support CPU and disk schedulers (and usually assumed the network split is derived from the CPU/disk available), but need to make it repair actually uses them. In other words, what I envision was repair adaptively controlling its own concurrency, increasing it while the CPU/disk quotas of the involved machines aren't fully used (I think it is quite fine to overshoot a bit, and use a slightly too high concurrency). This probably won't be easy to do, so maybe we need this patch in the interim?
Moreover, orthogonally to this specific patch there is another big problem in our repair approach on high-latency high-throughput links. Consider that we have a very high latency. You'll need very high parallelism to compensate, but this - with the current code which keeps in memory every row read for this entire latency - will need huge amounts of memory. A really good solution for this problem will use less memory in this case. Perhaps not caching all the rows in memory and hoping that only a few end up needing to be reread. This is why in the past I asked you to benchmark also the case of high latency with almost no inconsistencies in the data - this sort of repair should ideally be able to proceed at disk and network bandwidth, unrelated to the amount of memory or network latency.
}, | ||
{ | ||
"name":"ranges_parallelism", | ||
"description":"An integer specifying the number of ranges to repair in parallel.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number is so opaque... What integer should a user use here? 3? 100? 1000000? What is the default if I don't specify this? As a user, I wouldn't know how to use this option (although I guess it might come handy for our field engineers?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is used to be 16. Now it is changed to be configured according to the number of memory allowed for repair, see 131acc0.
@glommer I left a lengthy review of this patch. I don't like this approach myself, but I also don't see anything specifically wrong with this code, so if you or @slivne believe it is important for users right now and request it in - despite not being the ideal solution - I can commit it. Please let me know. |
Thanks a lot Nadav. I am not aware of any pressing need for a hack here. |
FWIW I also favor a solution without this tunable. |
Nadav, thanks for the lengthy review. First of all, I do not like user to configure anything neither. We already have the ranges parallelism auto-configured according to the memory size. See 131acc0 (repair: Adjust parallelism according to memory size (#4696)). If user does not specify the "advanced" option, everything is zero configuration. The parameter introduced is for the case where the auto-configured parameter does not work. For example, why limit repair to 10% of memory if all I need to do is repair the cluster at the moment. The point is the auto-configured parameter is not optimal for all the use cases. What does ranges_parallelism option actually control? It controls the number of repair instances per shard in parallel. Each repair instance runs a full row level repair algorithm. It is a repair master side parameter. The follower side will follow whatever the repair master chooses. Regarding not caching all rows in memory and reread, what if the rows are from peers, rereading means retransmitting from the peers. I do not think this something we want to pursuit. |
Because nobody is running the repair in 3.2 release which is not released yet.
Increasing parallelism is the common way to compensate the latency. The 25X improvement mentioned in the commit log is good example it works well. I am open for new ideas but let's proceed and gain more actual experience before arguing if this this works or not. |
Hi @asias I was mostly referring to the tunable. On/off tunables are okay for features we are afraid may have issues - like repair based streaming, but these kinds of tunables that you are introducing tend to become hard to remove later. |
You are welcome to give suggestions how to get rid of. No matter how smart we are, the auto-configured number could go wrong in some cases.
Yes, it is hard to remove but why not just stopping using it. The option is present does not mean we have to use it. |
Users start depending on it, Asias. My suggestion on how to remove it is by not adding it. |
The option is not wired to nodetool repair, so normal user won't use it. We can easily make Scylla manager stop using it if we want. In case we really want to remove the option we can return an error in the restful API when the option is provided.
In my view, doing nothing is not better than providing more sophisticated control. Also, offering such control does not prevent us looking for better automatic method. |
@nyh - asias provided answers to your questions, are we missing anything - can this go in ? |
@slivne as I said, it can go in, there is nothing really "wrong" with it. Whether it should go in, however, is a different question. This solution goes against our usual "zero configuration" philosophy. So I think you should say if this should go in - considering what you know about users or field-engineers or whatever who want such a feature. I don't think there's any risk with this patch - it just that it feels like taking a step in the wrong direction. |
@nyh thanks - this was agreed by avi in the review process @avikivity - any input from you - about this added configuration - its not great, but we agreed that we will add it. |
As discussed here, it's highly unusual that we add such tunables, because they violate the "no configuration" rule and are hard to get rid of later. However, if @avikivity approved the approach, he can certainly merge the pull request. |
I think we need to make this automatic. Allocate 10% of memory for repair and run as many ranges as will fit there. Of course, running many ranges requires more compaction work later (or some other work to unify many small sstables). Too bad we didn't go for rolling repair which could do this in a single stream. |
7127c5b
to
a46f235
Compare
@asias is this pull request still relevant or should we just close it? |
I will close it for now. |
One "pressing need" is our control of a repair parallelism (which is effectively defined by a number of ranges being repaired at the same time) from tools like I suggest reviving this PR and linking the corresponding API to one of the |
CI state |
The PR missing the
For comparison, common PR contains two branches - head * merge - and the CI uses the
|
This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847
This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847
This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847
Replaced by #14886 |
This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847
This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes #4847 Closes #14886 * github.com:scylladb/scylladb: repair: Add ranges_parallelism option repair: Change to use coroutine in do_repair_ranges
In commit 131acc0 (repair: Adjust
parallelism according to memory size), we choose the ranges to repair in
parallel according to the memory size automatically.
However, this automatic number might not the optimal one. For example,
user wants repair to have minimal impact on the user cql reads and
writes, or user wants to repair at full speed allowing more cpu and
memory resources.
This patch introduces a ranges_parallelism option that controls number
of ranges that we repair in parallel, so that the advanced
user can control the parallelism.
Here is an example:
In a cluster with high network latency, e.g., multiple DCs with high
latency link, user can increase ranges_parallelism to compensate.
For example in two DCs cluster with 7 nodes + 60ms round trip time + RF
(dc1:3, dc2:4) + a keyspace with 3 empty tables:
2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ...
2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ...
2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ...
2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ...
2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ...
2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ...
That is 1949s vs 257s vs 75s to complete repair respectively, which
gives 7X and 25X difference.
Fixes #4847