New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having an index being dropped during the bootstrap causes a node to fail to start (Startup failed: std::runtime_error ({shard 2: std::runtime_error (repair[433ec093-991c-46c1-939f-9a9585a706e9]: 2 out of 2331 ranges failed ) #15598
Comments
cc @fruch |
@mykaul is labeling with triage is enough to get it assigned? or anything else is needed ? |
It is sufficient. I'm not sure I understand when the node failed to start. And if we attempted again restarting it. It failed to repair and did not start. And then? |
And that's it, we are not retrying those situations, we stop the test, and collect the information. |
@Deexie please try to find out what went wrong. |
@asias do we support a table being dropped in the middle of repair? I think we will have to, because of cloud, but is there code currently in repair, ensuring a table is kept alive while repair is ongoing? |
Dropped table isn't a direct reason behind failing bootstrap. The two mentioned ranges failed due to:
I'm not familiar with the code that throws it, though, so I will need some time to figure out what exactly and why happens. |
This error means the failure happened on the other side. You will need to look into the log files of |
Oh, makes sense, thanks! |
reproduced in this week run: node-17, failed to bootstrap:
Installation detailsKernel Version: 5.15.0-1045-aws Cluster size: 12 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
happened also on different test case:
Installation detailsKernel Version: 5.15.0-1045-aws Cluster size: 5 nodes (i4i.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Dropping a table during repair is supposed to work. We do not try to keep
the node as alive but ignore the "table not found" exception. There are
multiple places we need to take care of. Probably there are places that are
missed.
…On Mon, Oct 2, 2023 at 4:24 PM Botond Dénes ***@***.***> wrote:
@asias <https://github.com/asias> do we support a table being dropped in
the middle of repair? I think we will have to, because of cloud, but is
there code currently in repair, ensuring a table is kept alive while repair
is ongoing?
—
Reply to this email directly, view it on GitHub
<#15598 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACOETHROOTAVP6FCXJ7ERLX5J25RANCNFSM6AAAAAA5OA7SQY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Asias
|
Any thoughts on how to flush those out ? |
It should be assigned to a developer and reproduced locally. This doesn't require a full s-c-t run with 7,432 nodes and 8PB data, it just requires running the scenario enough times to trigger the right timing. |
We should write a dedicated test for this. We can use the failure injection framework to time the drop table/drop index such that it happens in the middle of streaming. |
once more case of repair that fails cause of dropping a view:
and on node-8, we see the repair fails casue it can find the table anymore (or at least this how it seems):
Installation detailsKernel Version: 5.15.0-1047-aws Cluster size: 5 nodes (i4i.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
This is essentially #12373 |
Hmm, maybe not. The error is apparently coming from a different path. |
@k0machi we will not be investigating such issues in eventually consistent schema/topology mode. Well, unless @avikivity pushed really hard, but given we have a bunch of similar issues, the turn of this one will not come quickly. |
Makes sense. Consistent topology (and schema) changes can be considered a fix for this issue, if they indeed fix it. |
TBH they probably don't |
we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598
we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598
@Deexie - any updates? |
Reproduced again in the weekly tier1 runs: Installation detailsKernel Version: 5.15.0-1051-aws Cluster size: 12 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Let's fix it already, it's wasting everyone's time. |
@gleb-cloudius is working on scheduling the repair through the topology coordinator. As such repair won't block DDL, so the error will still be there. Also, with tablets and file-based streaming, the issue will not be present. With row-based streaming we'll still need to have a fix. |
Also the repair here is not really nodetool repair operation, but streaming that uses repair during node bootstrap which is already done by the topology coordinator. |
I am willing to help with the test scenario @Deexie |
So I understood what's going on (and recently added new logs confirm): A new node (A) starts repair and it comes to the step when it sends missing rows to the followers. In the meantime one of the tables is dropped. Node A still sends mutation fragment of that table to the follower (B).
Which throws since B's already dropped the table and so So I guess we should just send But I'm stuck writing a test for that (especially with providing data loss on a node). The test should probably:
@kostja (or others) could you, please, help? I would be thankful for any tips, examples of similiar usages etc. |
Didn't we fix this already? 9859bae Maybe we just missed a few cases. |
* Aleksandra Martyniuk ***@***.***> [24/01/22 17:02]:
A new node (A) starts repair and it comes to the step when it sends missing rows to the followers. In the meantime one of the tables is dropped. Node A still sends mutation fragment of that table to the follower (B).
Then on B we have:
```
repair_put_row_diff_with_rpc_stream_handler ->
repair_put_row_diff_with_rpc_stream_process_op ->
repair_meta::put_row_diff_handler ->
repair_meta::apply_rows_on_follower ->
repair_meta::do_apply_rows ->
repair_writer::create_writer ->
repair_writer_impl::create_writer ->
database::find_column_family
```
Which throws since B's already dropped the table and so `repair_stream_cmd::error` is sinked.
So I guess we should just send `repair_stream_cmd::put_rows_done` when no_such_column_family is thrown.
But I'm stuck writing a test for that (especially with providing data loss on a node). The test should probably:
- disable consistent topology,
- enable rbno,
- support rpc stream,
- lose some data on node B,
- maybe some other options I'm not aware of.
@kostja (or others) could you, please, help? I would be thankful for any tips, examples of similiar usages etc.
I think you should make sure there is a steady stream of
create table/insert a few semi-random ranges/drop table in one thread, for X
seconds, and concurrently, nodetool repair with a range
specification for one of the ranges above. You can have many
nodetool repair running as long as the ranges dont' interfere.
Sooner or later the repair will hit the issue at hand - I think
it's rather sooner, I guess this test can take 5-10 seconds.
…--
Konstantin Osipov, Moscow, Russia
|
If a table is dropped during repair, repair master may send row of a dropped table to a follower. Currently, in this situation no_such_column_family is thrown on the follower node which responds with repair_stream_cmd::error and then handles the exception at its side. When follower receives repair_stream_cmd::error, it assumes that the repair failed. To avoid that add table_dropped option to repair_stream_cmd and send it in this case. Handle table_dropped as if the range repair succedd on repair master. Fixes: scylladb#15598.
If a table is dropped during repair, repair master may send row of a dropped table to a follower. Currently, in this situation no_such_column_family is thrown on the follower node which responds with repair_stream_cmd::error and then handles the exception at its side. When follower receives repair_stream_cmd::error, it assumes that the repair failed. To avoid that add table_dropped option to repair_stream_cmd and send it in this case. Handle table_dropped as if the range repair succedd on repair master. Fixes: scylladb#15598.
Issue reproduced with: PackagesScylla version: Issue description
node25 failed to bootstrap, because index was removed at this moment ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 12 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Hit a variant of this issue in https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/494/testReport/repair_additional_test/TestRepairAdditional/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split008___test_repair_while_table_is_dropped/
|
… from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: #17028. Fixes: #15370. Fixes: #15598. Closes #17525 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table
… from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: #17028. Fixes: #15370. Fixes: #15598. Closes #17528 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table
Issue description
A decommission on node-1 starts, at the same time a bit later another nemesis runnning in parallel starts to create an index:
We get standard warning messages about updating the column definitions at this time:
Then during bootstrap we get a following error during repair once the index is being dropped:
Before that we get a lot of storage_proxy warnings about mutation updates, a lot of them, 11618 lines to be exact:
All of the updates come from
longevity-parallel-topology-schema--db-node-136349f0-6 (54.171.56.159 | 10.4.8.181) (shards: 7)
Impact
Node fails to start up.
How frequently does it reproduce?
Unknown, did not reproduce on the subsequent run.
Installation details
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash):
5.4.0~dev-20230921.a56a4b6226e6
with build-id616f734e7c7fb5e3ee8898792b3c415d2574a132
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-00f051bf1c684c01a
(aws: undefined_region)Test:
longevity-schema-topology-changes-12h-test
Test id:
136349f0-90e6-450a-a48b-61106861f0dd
Test name:
scylla-master/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 136349f0-90e6-450a-a48b-61106861f0dd
$ hydra investigate show-logs 136349f0-90e6-450a-a48b-61106861f0dd
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: