repair: finish repair immediately on local keyspaces #12459

Deexie · 2023-01-05T16:27:12Z

System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.

For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.

Deexie · 2023-01-05T16:27:31Z

@bhalevy review please

repair/repair.cc

bhalevy · 2023-01-05T18:18:20Z

Please also add to the commit log that this fixes a bug and describe the symptoms you saw when attempting to repair the system keyspace.

It's worth adding a rest_api unit test to repair the system keyspace and verify this patch.

scylladb-promoter · 2023-01-05T20:56:01Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3725/

repair/repair.cc

Deexie · 2023-01-11T11:02:31Z

repair_service::do_repair_start returns repair id instead of 0
extended git message to describe encountered bug
added test

scylladb-promoter · 2023-01-11T11:10:37Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3780/

scylladb-promoter · 2023-01-11T11:34:01Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3781/

scylladb-promoter · 2023-01-12T20:54:12Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3828/

scylladb-promoter · 2023-01-13T14:05:55Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/3836/

bhalevy

LGTM

@asias please review too

bhalevy · 2023-01-30T06:25:51Z

ping @asias

…ksandra Martyniuk System keyspace is a keyspace with local replication strategy and thus it does not need to be repaired. It is possible to invoke repair of this keyspace through the api, which leads to runtime error since peer_events and scylla_table_schema_history have different sharding logic. For keyspaces with local replication strategy repair_service::do_repair_start returns immediately. Closes #12459 * github.com:scylladb/scylladb: test: rest_api: check if repair of system keyspace returns immediately repair: finish repair immediately on local keyspaces

xemul · 2023-02-01T09:39:17Z

I had to dequeue this PR, the test from the second patch failed next promotion four times in a row:
https://jenkins.scylladb.com/job/scylla-master/job/next/5628
https://jenkins.scylladb.com/job/scylla-master/job/next/5627
https://jenkins.scylladb.com/job/scylla-master/job/next/5626
https://jenkins.scylladb.com/job/scylla-master/job/next/5625
Running manually passes 🤷‍♂️ . @Deexie , please, check

bhalevy · 2023-02-01T09:46:01Z

test/rest_api/test_storage_service.py

+
+    resp = rest_api.send("GET", "task_manager/list_module_tasks/repair")
+    resp.raise_for_status()
+    assert not resp.json(), "Repair of system keyspace didn't finish immediately"


we should wait for the task to handle the race.
As for verifying the "immediately" part, we can measure wall clock and make sure it's fast enough, although I do hate the idea...

For local keyspace's repairs new repair_id is created, but corresponding task is not. So maybe it should be "Repair task for keyspace with local replication strategy was created"?

We could get false positives though (if 10s is not enugh). If we want to avoid that, then we can set ttl to some large number. That will unfortunately make us move the test to task_manager_test.py to skip the test in release mode. I think it is not where it belongs to, though.

The reason why it keeps failing is that it is run multiple times on the same scylla instance, each time getting the next sequence number. Changing line 477 to assert resp.json() > 0, "Repair got invalid sequence number" will solve that.

Deexie · 2023-02-02T12:06:27Z

resp.json() == 1 -> resp.json() > 0
"Repair of system keyspace didn't finish immediately" -> "Repair task for keyspace with local replication strategy was created" (and changed commit message respectively)

scylladb-promoter · 2023-02-02T16:11:09Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4137/

bhalevy · 2023-02-03T05:28:16Z

test/rest_api/test_storage_service.py

+
+    resp = rest_api.send("GET", "task_manager/list_module_tasks/repair")
+    resp.raise_for_status()
+    assert not resp.json(), "Repair task for keyspace with local replication strategy was created"


Hmm, can other (repair) tests run concurrenrly on the same node? If not, than this LGTM.
Otherwise, how about getting the particular task status and validating its task id isn't found? (or making sure it isn't in the list returned above, if it happens to be non-empty)

Deexie · 2023-02-03T10:07:56Z

filter tasks list to only contain tasks with given sequence number before checking if it's empty

scylladb-promoter · 2023-02-03T11:32:21Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4159/

bhalevy

LGTM, thanks

System keyspace is a keyspace with local replication strategy and thus it does not need to be repaired. It is possible to invoke repair of this keyspace through the api, which leads to runtime error since peer_events and scylla_table_schema_history have different sharding logic. For keyspaces with local replication strategy repair_service::do_repair_start returns immediately.

…responding task is created

Deexie · 2023-02-03T12:51:41Z

rebased on master

scylladb-promoter · 2023-02-03T13:54:37Z

CI state FAILURE - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4164/

scylladb-promoter · 2023-02-09T14:36:25Z

CI state SUCCESS - https://jenkins.scylladb.com/job/releng/job/Scylla-CI/4276/

bhalevy · 2023-02-09T17:30:23Z

@scylladb/scylla-maint please consider merging

kbr-scylla · 2023-02-09T17:37:28Z

Hmm... in Broadcast Tables we use a local replication strategy table underneath which we update through Raft commands, and we wanted to use repair as part of sending snapshot from leader to follower :D (and if I recall correctly, it actually did manage to synchronize the data - @margdoc?) This will make it a bit harder for us... :( nothing that another if cannot solve though! Or maybe we can use a different API. We'll see.

cc @margdoc

kbr-scylla · 2023-02-09T17:40:54Z

I don't think we wanted to use do_repair_start anyway, as it works on keyspace granularity, we need something much more granular.

bhalevy · 2023-02-09T18:12:56Z

I don't think we wanted to use do_repair_start anyway, as it works on keyspace granularity, we need something much more granular.

For ad-hoc repair you should probably call the mid-level repair functions that node operations et. al use.

Cc @asias

Deexie requested review from tgrabiec and nyh as code owners January 5, 2023 16:27

bhalevy requested a review from asias January 5, 2023 18:13

bhalevy reviewed Jan 5, 2023

View reviewed changes

repair/repair.cc Outdated Show resolved Hide resolved

asias reviewed Jan 6, 2023

View reviewed changes

repair/repair.cc Outdated Show resolved Hide resolved

Deexie force-pushed the local-keyspace-repair branch from 3831079 to fc2e1a6 Compare January 11, 2023 10:59

Deexie requested review from bhalevy and removed request for tgrabiec and nyh January 18, 2023 09:30

bhalevy approved these changes Jan 18, 2023

View reviewed changes

Deexie requested a review from asias January 19, 2023 13:47

asias approved these changes Jan 30, 2023

View reviewed changes

bhalevy reviewed Feb 1, 2023

View reviewed changes

Deexie force-pushed the local-keyspace-repair branch from fc2e1a6 to 66b6a35 Compare February 2, 2023 12:03

bhalevy reviewed Feb 3, 2023

View reviewed changes

Deexie force-pushed the local-keyspace-repair branch from 66b6a35 to 284b5c0 Compare February 3, 2023 10:07

bhalevy approved these changes Feb 3, 2023

View reviewed changes

Deexie added 2 commits February 3, 2023 13:35

test: rest_api: check if repair of system keyspace returns before cor…

8cb3190

…responding task is created

Deexie force-pushed the local-keyspace-repair branch from 284b5c0 to 8cb3190 Compare February 3, 2023 12:51

scylladb-promoter merged commit e2064f4 into scylladb:master Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repair: finish repair immediately on local keyspaces #12459

repair: finish repair immediately on local keyspaces #12459

Deexie commented Jan 5, 2023 •

edited

Deexie commented Jan 5, 2023

bhalevy commented Jan 5, 2023

scylladb-promoter commented Jan 5, 2023

Deexie commented Jan 11, 2023

scylladb-promoter commented Jan 11, 2023

scylladb-promoter commented Jan 11, 2023

scylladb-promoter commented Jan 12, 2023

scylladb-promoter commented Jan 13, 2023

bhalevy left a comment

bhalevy commented Jan 30, 2023

xemul commented Feb 1, 2023

bhalevy Feb 1, 2023

Deexie Feb 2, 2023

Deexie commented Feb 2, 2023

scylladb-promoter commented Feb 2, 2023

bhalevy Feb 3, 2023

Deexie commented Feb 3, 2023

scylladb-promoter commented Feb 3, 2023

bhalevy left a comment

Deexie commented Feb 3, 2023

scylladb-promoter commented Feb 3, 2023

scylladb-promoter commented Feb 9, 2023

bhalevy commented Feb 9, 2023

kbr-scylla commented Feb 9, 2023

kbr-scylla commented Feb 9, 2023

bhalevy commented Feb 9, 2023

repair: finish repair immediately on local keyspaces #12459

repair: finish repair immediately on local keyspaces #12459

Conversation

Deexie commented Jan 5, 2023 • edited

Deexie commented Jan 5, 2023

bhalevy commented Jan 5, 2023

scylladb-promoter commented Jan 5, 2023

Deexie commented Jan 11, 2023

scylladb-promoter commented Jan 11, 2023

scylladb-promoter commented Jan 11, 2023

scylladb-promoter commented Jan 12, 2023

scylladb-promoter commented Jan 13, 2023

bhalevy left a comment

Choose a reason for hiding this comment

bhalevy commented Jan 30, 2023

xemul commented Feb 1, 2023

bhalevy Feb 1, 2023

Choose a reason for hiding this comment

Deexie Feb 2, 2023

Choose a reason for hiding this comment

Deexie commented Feb 2, 2023

scylladb-promoter commented Feb 2, 2023

bhalevy Feb 3, 2023

Choose a reason for hiding this comment

Deexie commented Feb 3, 2023

scylladb-promoter commented Feb 3, 2023

bhalevy left a comment

Choose a reason for hiding this comment

Deexie commented Feb 3, 2023

scylladb-promoter commented Feb 3, 2023

scylladb-promoter commented Feb 9, 2023

bhalevy commented Feb 9, 2023

kbr-scylla commented Feb 9, 2023

kbr-scylla commented Feb 9, 2023

bhalevy commented Feb 9, 2023

Deexie commented Jan 5, 2023 •

edited