Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair: finish repair immediately on local keyspaces #12459

Merged
merged 2 commits into from Feb 10, 2023

Conversation

Deexie
Copy link
Contributor

@Deexie Deexie commented Jan 5, 2023

System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.

For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.

@Deexie
Copy link
Contributor Author

Deexie commented Jan 5, 2023

@bhalevy review please

@bhalevy bhalevy requested a review from asias January 5, 2023 18:13
repair/repair.cc Outdated Show resolved Hide resolved
@bhalevy
Copy link
Member

bhalevy commented Jan 5, 2023

Please also add to the commit log that this fixes a bug and describe the symptoms you saw when attempting to repair the system keyspace.

It's worth adding a rest_api unit test to repair the system keyspace and verify this patch.

@scylladb-promoter
Copy link
Contributor

repair/repair.cc Outdated Show resolved Hide resolved
@Deexie
Copy link
Contributor Author

Deexie commented Jan 11, 2023

  • repair_service::do_repair_start returns repair id instead of 0
  • extended git message to describe encountered bug
  • added test

@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

@Deexie Deexie requested review from bhalevy and removed request for tgrabiec and nyh January 18, 2023 09:30
Copy link
Member

@bhalevy bhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asias please review too

@Deexie Deexie requested a review from asias January 19, 2023 13:47
@bhalevy
Copy link
Member

bhalevy commented Jan 30, 2023

ping @asias

denesb added a commit that referenced this pull request Jan 30, 2023
…ksandra Martyniuk

System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.

For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.

Closes #12459

* github.com:scylladb/scylladb:
  test: rest_api: check if repair of system keyspace returns immediately
  repair: finish repair immediately on local keyspaces
@xemul
Copy link
Contributor

xemul commented Feb 1, 2023

I had to dequeue this PR, the test from the second patch failed next promotion four times in a row:
https://jenkins.scylladb.com/job/scylla-master/job/next/5628
https://jenkins.scylladb.com/job/scylla-master/job/next/5627
https://jenkins.scylladb.com/job/scylla-master/job/next/5626
https://jenkins.scylladb.com/job/scylla-master/job/next/5625
Running manually passes 🤷‍♂️ . @Deexie , please, check


resp = rest_api.send("GET", "task_manager/list_module_tasks/repair")
resp.raise_for_status()
assert not resp.json(), "Repair of system keyspace didn't finish immediately"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should wait for the task to handle the race.
As for verifying the "immediately" part, we can measure wall clock and make sure it's fast enough, although I do hate the idea...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local keyspace's repairs new repair_id is created, but corresponding task is not. So maybe it should be "Repair task for keyspace with local replication strategy was created"?

We could get false positives though (if 10s is not enugh). If we want to avoid that, then we can set ttl to some large number. That will unfortunately make us move the test to task_manager_test.py to skip the test in release mode. I think it is not where it belongs to, though.

The reason why it keeps failing is that it is run multiple times on the same scylla instance, each time getting the next sequence number. Changing line 477 to assert resp.json() > 0, "Repair got invalid sequence number" will solve that.

@Deexie
Copy link
Contributor Author

Deexie commented Feb 2, 2023

  • resp.json() == 1 -> resp.json() > 0
  • "Repair of system keyspace didn't finish immediately" -> "Repair task for keyspace with local replication strategy was created" (and changed commit message respectively)

@scylladb-promoter
Copy link
Contributor


resp = rest_api.send("GET", "task_manager/list_module_tasks/repair")
resp.raise_for_status()
assert not resp.json(), "Repair task for keyspace with local replication strategy was created"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can other (repair) tests run concurrenrly on the same node? If not, than this LGTM.
Otherwise, how about getting the particular task status and validating its task id isn't found? (or making sure it isn't in the list returned above, if it happens to be non-empty)

@Deexie
Copy link
Contributor Author

Deexie commented Feb 3, 2023

  • filter tasks list to only contain tasks with given sequence number before checking if it's empty

@scylladb-promoter
Copy link
Contributor

Copy link
Member

@bhalevy bhalevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.

For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.
@Deexie
Copy link
Contributor Author

Deexie commented Feb 3, 2023

  • rebased on master

@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

@bhalevy
Copy link
Member

bhalevy commented Feb 9, 2023

@scylladb/scylla-maint please consider merging

@kbr-scylla
Copy link
Contributor

Hmm... in Broadcast Tables we use a local replication strategy table underneath which we update through Raft commands, and we wanted to use repair as part of sending snapshot from leader to follower :D (and if I recall correctly, it actually did manage to synchronize the data - @margdoc?) This will make it a bit harder for us... :( nothing that another if cannot solve though! Or maybe we can use a different API. We'll see.

cc @margdoc

@kbr-scylla
Copy link
Contributor

I don't think we wanted to use do_repair_start anyway, as it works on keyspace granularity, we need something much more granular.

@bhalevy
Copy link
Member

bhalevy commented Feb 9, 2023

I don't think we wanted to use do_repair_start anyway, as it works on keyspace granularity, we need something much more granular.

For ad-hoc repair you should probably call the mid-level repair functions that node operations et. al use.

Cc @asias

@scylladb-promoter scylladb-promoter merged commit e2064f4 into scylladb:master Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants