New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206
Comments
@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ? |
I can't tell which is which, and as I said there wasn't any agreement to ignore it across the board someone would need to explicitly say so, until then it's an issue that fails tier1 jobs. |
I say so. I thought I did previously, but now I do, again. |
@eliransin / @nyh this seems to fall in your area. |
Does this test do schema changes in parallel with the node restart? |
No, this case is doing the nemesis sequencely (in one thread) this is the code that was running, changing scylla internode compression, and restarting the node, one by one: @decorate_with_context(ignore_ycsb_connection_refused)
def disrupt_rolling_config_change_internode_compression(self):
def get_internode_compression_new_value_randomly(current_compression):
self.log.debug(f"Current compression is {current_compression}")
values = ['dc', 'all', 'none']
values_to_toggle = list(filter(lambda value: value != current_compression, values))
return random.choice(values_to_toggle)
if self._is_it_on_kubernetes():
# NOTE: on K8S update of 'scylla.yaml' and 'cassandra-rackdc.properties' files is done
# via update of the single reused place and serial restart of Scylla pods.
raise UnsupportedNemesis(
"This logic will be covered by an operator functional test. Skipping.")
with self.target_node.remote_scylla_yaml() as scylla_yaml:
current = scylla_yaml.internode_compression
new_value = get_internode_compression_new_value_randomly(current)
for node in self.cluster.nodes:
self.log.debug(f"Changing {node} inter node compression to {new_value}")
with node.remote_scylla_yaml() as scylla_yaml:
scylla_yaml.internode_compression = new_value
self.log.info(f"Restarting node {node}")
node.restart_scylla_server() |
@fruch is it possible the nemesis was started right after creating the table and the views, and perhaps some node was killed before it got the new schema? We have seen this no_such_column_family in view updates a lot when one node sends view updates and the other node doesn't know this node exists - we saw it when creating a view (so one of the nodes didn't yet get the news that was created) and when deleting a view (so one node continues to build view updates to a view that other nodes think was already deleted), but I don't think we saw it just in the middle of normal work. I suspected your test case is also one of these cases, but I don't know how to prove it, or if it's truely not the case (e.g., the view was already created 10 minutes before this nemesis started), I'm out of ideas what it can be. |
we check the status of all nodes between nemesis, all of them were up before this nemesis |
But that was not my question - my question is when was the view created and how sure are we that all the nodes already know of this view's existance. Is it possible it was created just a fraction of a second before the nemesis started? |
No, as I said between each nemesis we are checking node status, it that take more than a fraction of a second. after taking a close look, seems I was missing this time difference at the beginning of the event the event of the error is create 13min after timestamp in the print: the filter that was applied while the nemesi that |
we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598
we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598
cause of the issues with identifying the actual places when those would be expected |
we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598
Issue description
during rolling restart of the cluster (disrupt_rolling_config_change_internode_compression), we are seeing multiple
errors like that from node-1:
Impact
No other impact other then confusing errors in the log
How frequently does it reproduce?
this have been seen multiple times
Installation details
Kernel Version: 5.15.0-1050-aws
Scylla version (or git commit hash):
5.5.0~dev-20231122.65e42e4166ce
with build-id44721ca3535e729f9dd7dee34b519bfc91b8c564
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0a28a97891191910e
(aws: undefined_region)Test:
longevity-50gb-3days-test
Test id:
a9348823-953c-4c29-8829-a8e7b0d45b81
Test name:
scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor a9348823-953c-4c29-8829-a8e7b0d45b81
$ hydra investigate show-logs a9348823-953c-4c29-8829-a8e7b0d45b81
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: