API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

mmatczuk · 2020-05-13T09:07:37Z

This is needed to make repairs faster as measured in walltime.

Scylla Manager asks Scylla to repair a given token range and then waits for Scylla to end using /storage_service/repair_async/{keyspace} endpoint.

The API where we need to poll for repair job results each Xms introduces a significant delay in the whole process as seen by user.

Let's say we have

9 nodes
5k tables
4 shards each node
16 ranges per shard
RF=3

If we poll each 50ms, then the mean wasted time per operation is 25ms.

That makes 25 minutes of walltime spent on waiting that could be avoided (9 * 256 * 5000) / (4 * 16) * 25 / 1000 / 3 / 60. The wasted CPU time is 3 times as much. It explodes with nr. of tables and if you want to go slower it degradates as well.

Repairing 1 token range at a time results in over a DAY (1600 minutes) of waiting that could be avoided (9 * 256 * 5000) * 25 / 1000 / 60 / 3.

To work around it we can:

Increase poll time or spin waiting for result - expensive?
Repair the whole keyspace - in that case progress UX and resume speed suffer - slow updates and need to redo killed repairs

Suggested API changes:

Extend /storage_service/repair_async/{keyspace} endpoint with int wait parameter that describes how many seconds to wait for repair result.

The text was updated successfully, but these errors were encountered:

mmatczuk · 2020-05-13T09:09:42Z

Please give it a priority @tzach @eyalgutkind cc: @slivne

slivne · 2020-05-17T17:38:18Z

@mmatczuk In the case of row level repair - I am not sure that repairing segment by segment makes sense - I think we have changed that - no ?

mmatczuk · 2020-05-18T08:08:36Z

API did not change, we repair a token range @slivne.

slivne · 2020-05-18T09:27:49Z

for row_level repair we have https://github.com/scylladb/mermaid/issues/1736 - no ? that means that we will have "larger token ranges" (not per shard split by msb_bit) - or am I missing something ?

mmatczuk · 2020-05-18T09:40:25Z

The Scylla API that Manager (or Nodetool) uses remains the same, the way it's used changes in the issue you linked.

For this issue it's completely irrelevant how you schedule repair job in Scylla. It's important how you wait for Scylla to end.

This new api blocks until the repair job is either finished or failed. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445

asias · 2020-07-06T02:29:31Z

Patch posted: #6743

This new api blocks until the repair job is either finished or failed or timeout. E.g., - Without timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 - With timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout_ms=5000 The timeout_ms is in millisecond. The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445

This new api blocks until the repair job is either finished or failed or timeout. E.g., - Without timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 - With timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5 The timeout is in second. The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445

mmatczuk · 2020-09-02T14:39:09Z

@tzach can we release that patch in a version?

avikivity · 2020-09-03T12:32:17Z

It will be in 4.4. You can use the nightly builds for testing.

slivne · 2020-09-03T13:40:45Z

Avi means 4.3 - and you can use the nightly build :)

…

On Thu, Sep 3, 2020 at 3:32 PM Avi Kivity ***@***.***> wrote: It will be in 4.4. You can use the nightly builds for testing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6445 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2OCCFOMNAERC44PE54MCTSD6EGBANCNFSM4M7R665A> .

mmatczuk · 2021-03-11T13:29:35Z

Any plans to porting that to Enterprise?

mmatczuk · 2021-03-23T16:27:08Z

Ping.

slivne added the enhancement label Jun 4, 2020

slivne added this to the 4.x milestone Jun 4, 2020

asias mentioned this issue Jul 2, 2020

repair: Add synchronous API to query repair status #6743

Merged

asias self-assigned this Jul 2, 2020

scylladb-promoter closed this as completed in 271fac5 Jul 14, 2020

tzach modified the milestones: 4.x, 4.3 Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

mmatczuk commented May 13, 2020

mmatczuk commented May 13, 2020

slivne commented May 17, 2020

mmatczuk commented May 18, 2020

slivne commented May 18, 2020

mmatczuk commented May 18, 2020

asias commented Jul 6, 2020

mmatczuk commented Sep 2, 2020

avikivity commented Sep 3, 2020

slivne commented Sep 3, 2020 via email

mmatczuk commented Mar 11, 2021

mmatczuk commented Mar 23, 2021

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

Comments

mmatczuk commented May 13, 2020

mmatczuk commented May 13, 2020

slivne commented May 17, 2020

mmatczuk commented May 18, 2020

slivne commented May 18, 2020

mmatczuk commented May 18, 2020

asias commented Jul 6, 2020

mmatczuk commented Sep 2, 2020

avikivity commented Sep 3, 2020

slivne commented Sep 3, 2020 via email

mmatczuk commented Mar 11, 2021

mmatczuk commented Mar 23, 2021