repair: Add ranges_parallelism option #4848

asias · 2019-08-14T09:33:12Z

In commit 131acc0 (repair: Adjust
parallelism according to memory size), we choose the ranges to repair in
parallel according to the memory size automatically.

However, this automatic number might not the optimal one. For example,
user wants repair to have minimal impact on the user cql reads and
writes, or user wants to repair at full speed allowing more cpu and
memory resources.

This patch introduces a ranges_parallelism option that controls number
of ranges that we repair in parallel, so that the advanced
user can control the parallelism.

Here is an example:

In a cluster with high network latency, e.g., multiple DCs with high
latency link, user can increase ranges_parallelism to compensate.

For example in two DCs cluster with 7 nodes + 60ms round trip time + RF
(dc1:3, dc2:4) + a keyspace with 3 empty tables:

2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ...
2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ...

2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ...
2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ...

2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ...
2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ...

That is 1949s vs 257s vs 75s to complete repair respectively, which
gives 7X and 25X difference.

Fixes #4847

slivne · 2019-08-19T20:31:39Z

@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links

slivne · 2019-09-02T15:12:05Z

@avikivity this is one of the changes we agreed for row level repair to make it work with high latency links

asias · 2019-10-09T09:07:39Z

ping.

In commit 131acc0 (repair: Adjust parallelism according to memory size), we choose the ranges to repair in parallel according to the memory size automatically. However, this automatic number might not the optimal one. For example, user wants repair to have minimal impact on the user cql reads and writes, or user wants to repair at full speed allowing more cpu and memory resources. This patch introduces a ranges_parallelism option that controls number of ranges that we repair in parallel, so that the advanced user can control the parallelism. Here is an example: In a cluster with high network latency, e.g., multiple DCs with high latency link, user can increase ranges_parallelism to compensate. For example in two DCs cluster with 7 nodes + 60ms round trip time + RF (dc1:3, dc2:4) + a keyspace with 3 empty tables: 2019-04-10 11:00:03.303549 Repair with ks ranges_parallelism=1 on node 1 started ... 2019-04-10 11:32:32.673043 Repair with ks on node 1 finished ... 2019-04-10 11:32:32.673292 Repair with ks ranges_parallelism=16 on node 1 started ... 2019-04-10 11:36:49.475977 Repair with ks on node 1 finished ... 2019-04-10 11:36:49.476145 Repair with ks ranges_parallelism=256 on node 1 started ... 2019-04-10 11:38:04.242553 Repair with ks on node 1 finished ... That is 1949s vs 257s vs 75s to complete repair respectively, which gives 7X and 25X difference. Fixes scylladb#4847

asias · 2019-10-09T09:23:04Z

Rebased to latest master.

asias · 2019-11-19T02:14:32Z

@avikivity Ping. This is needed by 3.2.

slivne · 2019-12-04T08:34:13Z

@avikivity Ping. This is needed by 3.2.

glommer · 2019-12-09T15:18:49Z

Last ping on this was 21 days ago.
There is no maintainer action: no request for changes, no merge.

Patches need to either move or be closed.

I am tagging all 4 people with maintainer status according to the maintainers file: @avikivity @nyh @tgrabiec , @penberg

This is an unrelated discussion, but I think we need more maintainers that can commit code.

We can't have a pull request with no action sitting here for 3 weeks.

nyh

I don't see anything wrong with this patch's code, but I don't like the approach:

Exactly as demonstrated int he commit message, the meaning of the "repair parallelism" is very different on high-latency vs low-latency links. While on low-latency links a low parallelism can nevertheless take a big chunk of the machine's CPU, disk and network capacity, on a high-latency link low parallelism can result in underused network, disk and CPU. Moreover, what does the "range_paralellism" number actually mean? What is "a range"? Are there more layers of parallelism inside each range (including read-ahead, socket buffers, etc.) or not? Is this a master-side, slave-side or what parameter? Besides changing the parallelism (and noticably - the speed) of the repair, what does it do to the CPU or disk percentage used up by repair? As a user I wouldn't have a clue how to use this parameter. Although if @slivne or @glommer request this specific implementation as an emergency tweak for problems in the field, I'll oblige and commit this patch as-is.

If I think about how to solve this problem ideally, not in the way done here, I think that if the user is to configure anything - it should be the fraction of CPU, disk bandwidth and/or network bandwidth that they want to devote to repair (on all machines involved!). That is our traditional "boring database"/"zero configuration" approach. We already support CPU and disk schedulers (and usually assumed the network split is derived from the CPU/disk available), but need to make it repair actually uses them. In other words, what I envision was repair adaptively controlling its own concurrency, increasing it while the CPU/disk quotas of the involved machines aren't fully used (I think it is quite fine to overshoot a bit, and use a slightly too high concurrency). This probably won't be easy to do, so maybe we need this patch in the interim?

Moreover, orthogonally to this specific patch there is another big problem in our repair approach on high-latency high-throughput links. Consider that we have a very high latency. You'll need very high parallelism to compensate, but this - with the current code which keeps in memory every row read for this entire latency - will need huge amounts of memory. A really good solution for this problem will use less memory in this case. Perhaps not caching all the rows in memory and hoping that only a few end up needing to be reread. This is why in the past I asked you to benchmark also the case of high latency with almost no inconsistencies in the data - this sort of repair should ideally be able to proceed at disk and network bandwidth, unrelated to the amount of memory or network latency.

nyh · 2019-12-09T17:22:44Z

api/api-doc/storage_service.json

+                  },
+                  {
+                     "name":"ranges_parallelism",
+                     "description":"An integer specifying the number of ranges to repair in parallel.",


This number is so opaque... What integer should a user use here? 3? 100? 1000000? What is the default if I don't specify this? As a user, I wouldn't know how to use this option (although I guess it might come handy for our field engineers?).

The default is used to be 16. Now it is changed to be configured according to the number of memory allowed for repair, see 131acc0.

nyh · 2019-12-09T17:49:23Z

@glommer I left a lengthy review of this patch. I don't like this approach myself, but I also don't see anything specifically wrong with this code, so if you or @slivne believe it is important for users right now and request it in - despite not being the ideal solution - I can commit it. Please let me know.

glommer · 2019-12-09T17:59:03Z

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here.
This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

glommer · 2019-12-09T18:00:22Z

FWIW I also favor a solution without this tunable.

asias · 2019-12-10T01:45:48Z

I don't see anything wrong with this patch's code, but I don't like the approach:

Exactly as demonstrated int he commit message, the meaning of the "repair parallelism" is very different on high-latency vs low-latency links. While on low-latency links a low parallelism can nevertheless take a big chunk of the machine's CPU, disk and network capacity, on a high-latency link low parallelism can result in underused network, disk and CPU. Moreover, what does the "range_paralellism" number actually mean? What is "a range"? Are there more layers of parallelism inside each range (including read-ahead, socket buffers, etc.) or not? Is this a master-side, slave-side or what parameter? Besides changing the parallelism (and noticably - the speed) of the repair, what does it do to the CPU or disk percentage used up by repair? As a user I wouldn't have a clue how to use this parameter. Although if @slivne or @glommer request this specific implementation as an emergency tweak for problems in the field, I'll oblige and commit this patch as-is.

If I think about how to solve this problem ideally, not in the way done here, I think that if the user is to configure anything - it should be the fraction of CPU, disk bandwidth and/or network bandwidth that they want to devote to repair (on all machines involved!). That is our traditional "boring database"/"zero configuration" approach. We already support CPU and disk schedulers (and usually assumed the network split is derived from the CPU/disk available), but need to make it repair actually uses them. In other words, what I envision was repair adaptively controlling its own concurrency, increasing it while the CPU/disk quotas of the involved machines aren't fully used (I think it is quite fine to overshoot a bit, and use a slightly too high concurrency). This probably won't be easy to do, so maybe we need this patch in the interim?

Moreover, orthogonally to this specific patch there is another big problem in our repair approach on high-latency high-throughput links. Consider that we have a very high latency. You'll need very high parallelism to compensate, but this - with the current code which keeps in memory every row read for this entire latency - will need huge amounts of memory. A really good solution for this problem will use less memory in this case. Perhaps not caching all the rows in memory and hoping that only a few end up needing to be reread. This is why in the past I asked you to benchmark also the case of high latency with almost no inconsistencies in the data - this sort of repair should ideally be able to proceed at disk and network bandwidth, unrelated to the amount of memory or network latency.

Nadav, thanks for the lengthy review.

First of all, I do not like user to configure anything neither. We already have the ranges parallelism auto-configured according to the memory size. See 131acc0 (repair: Adjust parallelism according to memory size (#4696)). If user does not specify the "advanced" option, everything is zero configuration. The parameter introduced is for the case where the auto-configured parameter does not work. For example, why limit repair to 10% of memory if all I need to do is repair the cluster at the moment. The point is the auto-configured parameter is not optimal for all the use cases.

What does ranges_parallelism option actually control? It controls the number of repair instances per shard in parallel. Each repair instance runs a full row level repair algorithm. It is a repair master side parameter. The follower side will follow whatever the repair master chooses.

Regarding not caching all rows in memory and reread, what if the rows are from peers, rereading means retransmitting from the peers. I do not think this something we want to pursuit.

asias · 2019-12-10T02:02:04Z

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here.

Because nobody is running the repair in 3.2 release which is not released yet.

This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

Increasing parallelism is the common way to compensate the latency. The 25X improvement mentioned in the commit log is good example it works well. I am open for new ideas but let's proceed and gain more actual experience before arguing if this this works or not.

glommer · 2019-12-10T02:09:34Z

Hi @asias

I was mostly referring to the tunable.
I think we can do away with it.

On/off tunables are okay for features we are afraid may have issues - like repair based streaming, but these kinds of tunables that you are introducing tend to become hard to remove later.

asias · 2019-12-10T02:19:21Z

Hi @asias

I was mostly referring to the tunable.
I think we can do away with it.

You are welcome to give suggestions how to get rid of. No matter how smart we are, the auto-configured number could go wrong in some cases.

On/off tunables are okay for features we are afraid may have issues - like repair based streaming, but these kinds of tunables that you are introducing tend to become hard to remove later.

Yes, it is hard to remove but why not just stopping using it. The option is present does not mean we have to use it.

glommer · 2019-12-10T02:57:34Z

Users start depending on it, Asias.

My suggestion on how to remove it is by not adding it.

asias · 2019-12-10T06:49:35Z

Users start depending on it, Asias.

The option is not wired to nodetool repair, so normal user won't use it. We can easily make Scylla manager stop using it if we want. In case we really want to remove the option we can return an error in the restful API when the option is provided.

My suggestion on how to remove it is by not adding it.

In my view, doing nothing is not better than providing more sophisticated control. Also, offering such control does not prevent us looking for better automatic method.

slivne · 2019-12-10T08:27:49Z

@nyh - asias provided answers to your questions, are we missing anything - can this go in ?

nyh · 2019-12-10T08:47:36Z

@slivne as I said, it can go in, there is nothing really "wrong" with it. Whether it should go in, however, is a different question. This solution goes against our usual "zero configuration" philosophy. So I think you should say if this should go in - considering what you know about users or field-engineers or whatever who want such a feature. I don't think there's any risk with this patch - it just that it feels like taking a step in the wrong direction.

slivne · 2019-12-10T10:25:32Z

@nyh thanks - this was agreed by avi in the review process @avikivity - any input from you - about this added configuration - its not great, but we agreed that we will add it.

penberg · 2019-12-11T08:03:26Z

As discussed here, it's highly unusual that we add such tunables, because they violate the "no configuration" rule and are hard to get rid of later. However, if @avikivity approved the approach, he can certainly merge the pull request.

avikivity · 2019-12-11T12:07:39Z

I think we need to make this automatic. Allocate 10% of memory for repair and run as many ranges as will fit there.

Of course, running many ranges requires more compaction work later (or some other work to unify many small sstables).

Too bad we didn't go for rolling repair which could do this in a single stream.

penberg · 2020-07-17T05:43:51Z

@asias is this pull request still relevant or should we just close it?

asias · 2020-07-21T04:05:20Z

I will close it for now.

vladzcloudius · 2023-07-26T14:01:20Z

Thanks a lot Nadav.

I am not aware of any pressing need for a hack here. This needs to be solved for repairs to work better in high latency links, but if the approach is not good @asias should engage in the search for a new one.

One "pressing need" is our control of a repair parallelism (which is effectively defined by a number of ranges being repaired at the same time) from tools like nodetool.

I suggest reviving this PR and linking the corresponding API to one of the nodetool 's options, e.g. -j.

scylladb-promoter · 2023-07-26T14:15:18Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2699/

benipeled · 2023-07-26T14:37:46Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/2699/

The PR missing the merge branch - we should try to rebase and fix conflicts before retriggering the CI

benipeled@fedora ~/repos/scylla-4848 $ git ls-remote git@github.com:scylladb/scylla.git 'refs/pull/4848/*'
bc605cf4227422783f52bc0d10c5dd9d78fa522c        refs/pull/4848/head

For comparison, common PR contains two branches - head * merge - and the CI uses the merge branch

benipeled@fedora ~/repos/scylla-4848 $ git ls-remote git@github.com:scylladb/scylla.git 'refs/pull/14835/*'
dc3475901608ca12f61e871e98048ed32dcee85f        refs/pull/14835/head
5b31c3396cf2c7e6cee9e26e78134609d91bc6da        refs/pull/14835/merge

This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847

asias · 2023-08-01T02:51:05Z

Replaced by #14886

This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR scylladb#4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes scylladb#4847

This patch adds the ranges_parallelism option to repair restful API. Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel. Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job. To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel. Fixes #4847 Closes #14886 * github.com:scylladb/scylladb: repair: Add ranges_parallelism option repair: Change to use coroutine in do_repair_ranges

slivne requested a review from avikivity August 19, 2019 20:31

asias mentioned this pull request Oct 8, 2019

repair: Add ranges_parallelism option #4847

Closed

asias force-pushed the repair_add_ranges_parallelism branch from a8dd933 to bc605cf Compare October 9, 2019 09:22

tgrabiec force-pushed the next branch from b466d7e to fef3342 Compare October 15, 2019 10:24

avikivity force-pushed the next branch from ddb4e07 to 27ef73f Compare October 26, 2019 20:02

avikivity force-pushed the next branch from 0c42de6 to 5b2f35a Compare December 5, 2019 09:58

nyh reviewed Dec 9, 2019

View reviewed changes

avikivity force-pushed the next branch from b0e1b48 to db0e2d8 Compare December 10, 2019 09:52

avikivity force-pushed the next branch from 1f1bfbe to 576f80b Compare February 8, 2020 09:10

avikivity force-pushed the next branch 3 times, most recently from 7127c5b to a46f235 Compare February 18, 2020 18:13

psarna force-pushed the next branch from 37d26c9 to 375cb8a Compare April 2, 2020 12:45

avikivity force-pushed the next branch from b5f8b43 to 1f9c37f Compare April 22, 2020 07:05

avikivity force-pushed the next branch from cc7dba2 to 2b0c317 Compare May 7, 2020 07:06

psarna force-pushed the next branch from 466dcc2 to 097a5e9 Compare May 29, 2020 06:27

avikivity force-pushed the next branch from 9846b5a to a4c44ca Compare June 2, 2020 06:12

avikivity force-pushed the next branch from 8f01a98 to 64c8164 Compare June 23, 2020 16:41

avikivity force-pushed the next branch from 020ee87 to 0143aaa Compare July 7, 2020 15:27

asias closed this Jul 21, 2020

asias mentioned this pull request Sep 18, 2020

Allow throttling repair-based node ops #7255

Closed

asias mentioned this pull request Sep 28, 2020

Scylla 4.1.7 crashing on repairs (uncaught exceptions) #7285

Closed

1 task

asias mentioned this pull request Jul 17, 2023

[nodetool]: Scylla ignores the parallelism configuration #14691

Open

vladzcloudius reopened this Jul 26, 2023

vladzcloudius requested a review from tgrabiec as a code owner July 26, 2023 14:01

asias mentioned this pull request Jul 31, 2023

repair: Add ranges_parallelism option #14886

Merged

asias closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repair: Add ranges_parallelism option #4848

repair: Add ranges_parallelism option #4848

asias commented Aug 14, 2019

slivne commented Aug 19, 2019

slivne commented Sep 2, 2019

asias commented Oct 9, 2019

asias commented Oct 9, 2019

asias commented Nov 19, 2019

slivne commented Dec 4, 2019

glommer commented Dec 9, 2019

nyh left a comment

nyh Dec 9, 2019

asias Dec 10, 2019

nyh commented Dec 9, 2019

glommer commented Dec 9, 2019

glommer commented Dec 9, 2019

asias commented Dec 10, 2019

asias commented Dec 10, 2019

glommer commented Dec 10, 2019

asias commented Dec 10, 2019

glommer commented Dec 10, 2019

asias commented Dec 10, 2019

slivne commented Dec 10, 2019

nyh commented Dec 10, 2019

slivne commented Dec 10, 2019

penberg commented Dec 11, 2019

avikivity commented Dec 11, 2019

penberg commented Jul 17, 2020

asias commented Jul 21, 2020

vladzcloudius commented Jul 26, 2023

scylladb-promoter commented Jul 26, 2023

benipeled commented Jul 26, 2023

asias commented Aug 1, 2023

repair: Add ranges_parallelism option #4848

repair: Add ranges_parallelism option #4848

Conversation

asias commented Aug 14, 2019

slivne commented Aug 19, 2019

slivne commented Sep 2, 2019

asias commented Oct 9, 2019

asias commented Oct 9, 2019

asias commented Nov 19, 2019

slivne commented Dec 4, 2019

glommer commented Dec 9, 2019

nyh left a comment

Choose a reason for hiding this comment

nyh Dec 9, 2019

Choose a reason for hiding this comment

asias Dec 10, 2019

Choose a reason for hiding this comment

nyh commented Dec 9, 2019

glommer commented Dec 9, 2019

glommer commented Dec 9, 2019

asias commented Dec 10, 2019

asias commented Dec 10, 2019

glommer commented Dec 10, 2019

asias commented Dec 10, 2019

glommer commented Dec 10, 2019

asias commented Dec 10, 2019

slivne commented Dec 10, 2019

nyh commented Dec 10, 2019

slivne commented Dec 10, 2019

penberg commented Dec 11, 2019

avikivity commented Dec 11, 2019

penberg commented Jul 17, 2020

asias commented Jul 21, 2020

vladzcloudius commented Jul 26, 2023

scylladb-promoter commented Jul 26, 2023

benipeled commented Jul 26, 2023

asias commented Aug 1, 2023