Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair_tracker run failed: data_dictionary::no_such_column_family #13045

Closed
bhalevy opened this issue Mar 1, 2023 · 6 comments · Fixed by #13068
Closed

repair_tracker run failed: data_dictionary::no_such_column_family #13045

bhalevy opened this issue Mar 1, 2023 · 6 comments · Fixed by #13068
Assignees
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Mar 1, 2023

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/153/testReport/repair_additional_test/TestRepairAdditional/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split037___test_repair_while_table_is_dropped/
With scylla version b6e4275

ccmlib.node.ToolError: Subprocess /jenkins/workspace/scylla-master/dtest-daily-debug/scylla/.ccm/scylla-repository/b6e427551193fd9a06d12e2c1de832e51a26ac00/share/cassandra/bin/nodetool -h 127.0.4.2 -p 7199 -Dcom.sun.jndi.rmiURLParsing=legacy repair exited with non-zero status; exit status: 2; 
stdout: [2023-03-01 06:45:04,160] Starting repair command #1, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2023-03-01 06:45:45,264] Repair session 1 
[2023-03-01 06:45:45,264] Repair session 1 finished
[2023-03-01 06:45:45,294] Starting repair command #2, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2023-03-01 06:47:30,414] Repair session 2 
[2023-03-01 06:47:30,414] Repair session 2 finished
[2023-03-01 06:47:30,434] Starting repair command #3, repairing 1 ranges for keyspace ks (parallelism=SEQUENTIAL, full=true)
[2023-03-01 06:47:31,550] Repair session 3 failed
[2023-03-01 06:47:31,551] Repair session 3 finished
; 
stderr: error: Repair job has failed with the error message: [2023-03-01 06:47:31,550] Repair session 3 failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2023-03-01 06:47:31,550] Repair session 3 failed
	at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)
	at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
	at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
	at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
	at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
	at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/153/artifact/logs-full.debug.037/1677653415823_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_while_table_is_dropped/node2.log

INFO  2023-03-01 06:47:30,423 [shard 0] repair - repair[10b337d9-8fb0-4980-b65e-96b6edd0695f]: starting user-requested repair for keyspace ks, repair id 3, options {{trace -> false}, {jobThreads -> 1}, {incremental -> false}, {parallelism -> parallel}, {primaryRange -> false}}
INFO  2023-03-01 06:47:30,582 [shard 0] schema_tables - Dropping ks.cf_del0 id=3fa2f840-b7fc-11ed-b7ae-fdbea18bf840 version=a5682afd-6629-3976-8596-6c2619d25988
INFO  2023-03-01 06:47:30,583 [shard 0] database - Dropping ks.cf_del0 with auto-snapshot
INFO  2023-03-01 06:47:30,585 [shard 0] database - Truncating ks.cf_del0 with auto-snapshot
INFO  2023-03-01 06:47:30,607 [shard 0] repair - repair[10b337d9-8fb0-4980-b65e-96b6edd0695f]: Skipped sending repair_flush_hints_batchlog to nodes=[127.0.4.3, 127.0.4.1, 127.0.4.2]
WARN  2023-03-01 06:47:30,609 [shard 0] repair - repair[10b337d9-8fb0-4980-b65e-96b6edd0695f]: repair_tracker run failed: data_dictionary::no_such_column_family (Can't find a column family cf_del0 in keyspace ks)

BTW, this should be logged as an error, not a warning since it is returned to the api.

From first glance, I see

auto s = db.find_column_family(keyspace, table).schema();

that may hit this error, where we should ignore replica::no_such_column_family error
but there could still be other sites.

There's also

throw replica::no_such_column_family(keyspace, table_ids[idx]);

called from
, sharder(get_sharder_for_tables(db, keyspace, table_ids_))

Note that this particular error the test failed on is data_dictionary::no_such_column_family, not replica::no_such_column_family.

Cc @asias @Deexie

@bhalevy
Copy link
Member Author

bhalevy commented Mar 1, 2023

There are also a few places in streaming that need to ignore dropped tables, but it's likely out of scope for this issue:

, _cf(_db.local().find_column_family(_id))

auto op = _db.local().find_column_family(cf_id).stream_in_progress();

db.find_column_family(ks, cf);

db.find_column_family(cf_id);

@mykaul mykaul added this to the 5.3 milestone Mar 1, 2023
denesb added a commit that referenced this issue Mar 17, 2023
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function
avikivity pushed a commit that referenced this issue Mar 19, 2023
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function
@avikivity
Copy link
Member

@bhalevy @Deexie should we backport this? how far?

@bhalevy
Copy link
Member Author

bhalevy commented Apr 27, 2023

I think that the issue exists in this form or close to it since row-level repair inception so backport is technically desired to all living branches. It was hit also in the field, outside of QA, by an untypical application that creates and deletes tables rapidly, also during repair, so I think it'd be a good idea to backport it to all 5.x branches.

avikivity pushed a commit that referenced this issue Aug 20, 2023
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function

(cherry picked from commit 9859bae)
avikivity pushed a commit that referenced this issue Aug 20, 2023
While repair requested by user is performed, some tables
may be dropped. When the repair proceeds to these tables,
it should skip them and continue with others.

When no_such_column_family is thrown during user requested
repair, it is logged and swallowed. Then the repair continues with
the remaining tables.

Fixes: #13045

Closes #13068

* github.com:scylladb/scylladb:
  repair: fix indentation
  repair: continue user requested repair if no_such_column_family is thrown
  repair: add find_column_family_if_exists function

(cherry picked from commit 9859bae)
@avikivity
Copy link
Member

Backported to 5.1, 5.2

@mykaul
Copy link
Contributor

mykaul commented Sep 3, 2023

There are also a few places in streaming that need to ignore dropped tables, but it's likely out of scope for this issue:

@bhalevy - did you open an issue for the below items?

, _cf(_db.local().find_column_family(_id))

auto op = _db.local().find_column_family(cf_id).stream_in_progress();

db.find_column_family(ks, cf);

db.find_column_family(cf_id);

@bhalevy
Copy link
Member Author

bhalevy commented Sep 3, 2023

There are also a few places in streaming that need to ignore dropped tables, but it's likely out of scope for this issue:

@bhalevy - did you open an issue for the below items?

, _cf(_db.local().find_column_family(_id))

auto op = _db.local().find_column_family(cf_id).stream_in_progress();

db.find_column_family(ks, cf);

db.find_column_family(cf_id);

Now there is.

#15257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants