New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repair: finish repair immediately on local keyspaces #12459
repair: finish repair immediately on local keyspaces #12459
Conversation
@bhalevy review please |
Please also add to the commit log that this fixes a bug and describe the symptoms you saw when attempting to repair the system keyspace. It's worth adding a rest_api unit test to repair the system keyspace and verify this patch. |
CI state |
3831079
to
fc2e1a6
Compare
|
CI state |
CI state |
CI state |
CI state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@asias please review too
ping @asias |
…ksandra Martyniuk System keyspace is a keyspace with local replication strategy and thus it does not need to be repaired. It is possible to invoke repair of this keyspace through the api, which leads to runtime error since peer_events and scylla_table_schema_history have different sharding logic. For keyspaces with local replication strategy repair_service::do_repair_start returns immediately. Closes #12459 * github.com:scylladb/scylladb: test: rest_api: check if repair of system keyspace returns immediately repair: finish repair immediately on local keyspaces
I had to dequeue this PR, the test from the second patch failed next promotion four times in a row: |
|
||
resp = rest_api.send("GET", "task_manager/list_module_tasks/repair") | ||
resp.raise_for_status() | ||
assert not resp.json(), "Repair of system keyspace didn't finish immediately" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should wait for the task to handle the race.
As for verifying the "immediately" part, we can measure wall clock and make sure it's fast enough, although I do hate the idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local keyspace's repairs new repair_id is created, but corresponding task is not. So maybe it should be "Repair task for keyspace with local replication strategy was created"?
We could get false positives though (if 10s is not enugh). If we want to avoid that, then we can set ttl to some large number. That will unfortunately make us move the test to task_manager_test.py to skip the test in release mode. I think it is not where it belongs to, though.
The reason why it keeps failing is that it is run multiple times on the same scylla instance, each time getting the next sequence number. Changing line 477 to assert resp.json() > 0, "Repair got invalid sequence number"
will solve that.
fc2e1a6
to
66b6a35
Compare
|
CI state |
|
||
resp = rest_api.send("GET", "task_manager/list_module_tasks/repair") | ||
resp.raise_for_status() | ||
assert not resp.json(), "Repair task for keyspace with local replication strategy was created" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can other (repair) tests run concurrenrly on the same node? If not, than this LGTM.
Otherwise, how about getting the particular task status and validating its task id isn't found? (or making sure it isn't in the list returned above, if it happens to be non-empty)
66b6a35
to
284b5c0
Compare
|
CI state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
System keyspace is a keyspace with local replication strategy and thus it does not need to be repaired. It is possible to invoke repair of this keyspace through the api, which leads to runtime error since peer_events and scylla_table_schema_history have different sharding logic. For keyspaces with local replication strategy repair_service::do_repair_start returns immediately.
…responding task is created
284b5c0
to
8cb3190
Compare
|
CI state |
CI state |
@scylladb/scylla-maint please consider merging |
Hmm... in Broadcast Tables we use a local replication strategy table underneath which we update through Raft commands, and we wanted to use repair as part of sending snapshot from leader to follower :D (and if I recall correctly, it actually did manage to synchronize the data - @margdoc?) This will make it a bit harder for us... :( nothing that another cc @margdoc |
I don't think we wanted to use |
For ad-hoc repair you should probably call the mid-level repair functions that node operations et. al use. Cc @asias |
System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.
For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.