Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

Closed
mmatczuk opened this issue May 13, 2020 · 11 comments
Closed

API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445

mmatczuk opened this issue May 13, 2020 · 11 comments
Assignees
Milestone

Comments

@mmatczuk
Copy link
Contributor

This is needed to make repairs faster as measured in walltime.

Scylla Manager asks Scylla to repair a given token range and then waits for Scylla to end using /storage_service/repair_async/{keyspace} endpoint.

The API where we need to poll for repair job results each Xms introduces a significant delay in the whole process as seen by user.

Let's say we have

  • 9 nodes
  • 5k tables
  • 4 shards each node
  • 16 ranges per shard
  • RF=3

If we poll each 50ms, then the mean wasted time per operation is 25ms.

That makes 25 minutes of walltime spent on waiting that could be avoided (9 * 256 * 5000) / (4 * 16) * 25 / 1000 / 3 / 60. The wasted CPU time is 3 times as much. It explodes with nr. of tables and if you want to go slower it degradates as well.

Repairing 1 token range at a time results in over a DAY (1600 minutes) of waiting that could be avoided (9 * 256 * 5000) * 25 / 1000 / 60 / 3.

To work around it we can:

  1. Increase poll time or spin waiting for result - expensive?
  2. Repair the whole keyspace - in that case progress UX and resume speed suffer - slow updates and need to redo killed repairs

Suggested API changes:

Extend /storage_service/repair_async/{keyspace} endpoint with int wait parameter that describes how many seconds to wait for repair result.

@mmatczuk
Copy link
Contributor Author

Please give it a priority @tzach @eyalgutkind cc: @slivne

@slivne
Copy link
Contributor

slivne commented May 17, 2020

@mmatczuk In the case of row level repair - I am not sure that repairing segment by segment makes sense - I think we have changed that - no ?

@mmatczuk
Copy link
Contributor Author

API did not change, we repair a token range @slivne.

@slivne
Copy link
Contributor

slivne commented May 18, 2020

for row_level repair we have https://github.com/scylladb/mermaid/issues/1736 - no ? that means that we will have "larger token ranges" (not per shard split by msb_bit) - or am I missing something ?

@mmatczuk
Copy link
Contributor Author

The Scylla API that Manager (or Nodetool) uses remains the same, the way it's used changes in the issue you linked.

For this issue it's completely irrelevant how you schedule repair job in Scylla. It's important how you wait for Scylla to end.

@slivne slivne added this to the 4.x milestone Jun 4, 2020
asias added a commit to asias/scylla that referenced this issue Jul 2, 2020
This new api blocks until the repair job is either finished or failed.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes scylladb#6445
@asias asias self-assigned this Jul 2, 2020
@asias
Copy link
Contributor

asias commented Jul 6, 2020

Patch posted: #6743

asias added a commit to asias/scylla that referenced this issue Jul 7, 2020
This new api blocks until the repair job is either finished or failed or timeout.

E.g.,

- Without timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

- With timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout_ms=5000

The timeout_ms is in millisecond.

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes scylladb#6445
asias added a commit to asias/scylla that referenced this issue Jul 13, 2020
This new api blocks until the repair job is either finished or failed or timeout.

E.g.,

- Without timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

- With timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5

The timeout is in second.

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes scylladb#6445
@mmatczuk
Copy link
Contributor Author

mmatczuk commented Sep 2, 2020

@tzach can we release that patch in a version?

@avikivity
Copy link
Member

It will be in 4.4. You can use the nightly builds for testing.

@slivne
Copy link
Contributor

slivne commented Sep 3, 2020 via email

@tzach tzach modified the milestones: 4.x, 4.3 Nov 16, 2020
@mmatczuk
Copy link
Contributor Author

Any plans to porting that to Enterprise?

@mmatczuk
Copy link
Contributor Author

Ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants