-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add long polling to StorageServiceRepairAsyncByKeyspaceGet #6445
Comments
Please give it a priority @tzach @eyalgutkind cc: @slivne |
@mmatczuk In the case of row level repair - I am not sure that repairing segment by segment makes sense - I think we have changed that - no ? |
API did not change, we repair a token range @slivne. |
for row_level repair we have https://github.com/scylladb/mermaid/issues/1736 - no ? that means that we will have "larger token ranges" (not per shard split by msb_bit) - or am I missing something ? |
The Scylla API that Manager (or Nodetool) uses remains the same, the way it's used changes in the issue you linked. For this issue it's completely irrelevant how you schedule repair job in Scylla. It's important how you wait for Scylla to end. |
This new api blocks until the repair job is either finished or failed. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445
Patch posted: #6743 |
This new api blocks until the repair job is either finished or failed or timeout. E.g., - Without timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 - With timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout_ms=5000 The timeout_ms is in millisecond. The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445
This new api blocks until the repair job is either finished or failed or timeout. E.g., - Without timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123 - With timeout curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5 The timeout is in second. The current asynchronous api returns immediately even if the repair is in progress. E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123 User can use the new synchronous API to avoid keep sending the query to poll if the repair job is finished. Fixes scylladb#6445
@tzach can we release that patch in a version? |
It will be in 4.4. You can use the nightly builds for testing. |
Avi means 4.3 - and you can use the nightly build :)
…On Thu, Sep 3, 2020 at 3:32 PM Avi Kivity ***@***.***> wrote:
It will be in 4.4. You can use the nightly builds for testing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6445 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2OCCFOMNAERC44PE54MCTSD6EGBANCNFSM4M7R665A>
.
|
Any plans to porting that to Enterprise? |
Ping. |
This is needed to make repairs faster as measured in walltime.
Scylla Manager asks Scylla to repair a given token range and then waits for Scylla to end using
/storage_service/repair_async/{keyspace}
endpoint.The API where we need to poll for repair job results each Xms introduces a significant delay in the whole process as seen by user.
Let's say we have
If we poll each 50ms, then the mean wasted time per operation is 25ms.
That makes 25 minutes of walltime spent on waiting that could be avoided
(9 * 256 * 5000) / (4 * 16) * 25 / 1000 / 3 / 60
. The wasted CPU time is 3 times as much. It explodes with nr. of tables and if you want to go slower it degradates as well.Repairing 1 token range at a time results in over a DAY (1600 minutes) of waiting that could be avoided
(9 * 256 * 5000) * 25 / 1000 / 60 / 3
.To work around it we can:
Suggested API changes:
Extend
/storage_service/repair_async/{keyspace}
endpoint withint wait
parameter that describes how many seconds to wait for repair result.The text was updated successfully, but these errors were encountered: