Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

fruch · 2023-11-28T15:35:18Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

during rolling restart of the cluster (disrupt_rolling_config_change_internode_compression), we are seeing multiple
errors like that from node-1:

2023-11-25 16:15:14.961 <2023-11-25 16:02:16.381>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6dab20e6-cc0d-4a3c-b298-7276f1b01d29: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=225846 node=longevity-tls-50gb-3d-master-db-node-a9348823-1
2023-11-25T16:02:16.381+00:00 longevity-tls-50gb-3d-master-db-node-a9348823-1      !ERR | scylla[17388]:  [shard 10:stmt] view - Error applying view update to 10.4.9.222 (view: keyspace1.standard1_view, base token: -8756056605010619006, view token: 986997830309495339): data_dictionary::no_such_column_family (Can't find a column family with UUID c357cfd0-8b5f-11ee-9cb8-6f24733b598e)

Impact

No other impact other then confusing errors in the log

How frequently does it reproduce?

this have been seen multiple times

Installation details

Kernel Version: 5.15.0-1050-aws
Scylla version (or git commit hash): 5.5.0~dev-20231122.65e42e4166ce with build-id 44721ca3535e729f9dd7dee34b519bfc91b8c564

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-a9348823-9 (3.250.90.57 | 10.4.8.242) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-8 (34.244.226.200 | 10.4.11.54) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-7 (34.245.208.67 | 10.4.10.181) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-6 (54.170.239.41 | 10.4.9.153) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-5 (34.244.91.38 | 10.4.9.200) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-4 (3.250.36.122 | 10.4.10.69) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-3 (3.252.73.251 | 10.4.9.10) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-2 (54.74.142.142 | 10.4.11.40) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-15 (54.76.99.218 | 10.4.10.108) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-14 (34.244.144.56 | 10.4.10.27) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-13 (34.244.22.141 | 10.4.11.122) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-12 (3.254.132.70 | 10.4.11.41) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-11 (18.202.35.250 | 10.4.9.247) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-10 (34.245.54.32 | 10.4.11.132) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a9348823-1 (3.252.149.198 | 10.4.9.222) (shards: 14)

OS / Image: ami-0a28a97891191910e (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: a9348823-953c-4c29-8829-a8e7b0d45b81
Test name: scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor a9348823-953c-4c29-8829-a8e7b0d45b81
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a9348823-953c-4c29-8829-a8e7b0d45b81

Logs:

db-cluster-a9348823.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a9348823-953c-4c29-8829-a8e7b0d45b81/20231126_052957/db-cluster-a9348823.tar.gz
sct-runner-events-a9348823.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a9348823-953c-4c29-8829-a8e7b0d45b81/20231126_052957/sct-runner-events-a9348823.tar.gz
sct-a9348823.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a9348823-953c-4c29-8829-a8e7b0d45b81/20231126_052957/sct-a9348823.log.tar.gz
loader-set-a9348823.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a9348823-953c-4c29-8829-a8e7b0d45b81/20231126_052957/loader-set-a9348823.tar.gz
monitor-set-a9348823.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a9348823-953c-4c29-8829-a8e7b0d45b81/20231126_052957/monitor-set-a9348823.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

mykaul · 2023-11-28T15:42:26Z

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

fruch · 2023-11-28T15:46:23Z

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

I can't tell which is which, and as I said there wasn't any agreement to ignore it across the board

someone would need to explicitly say so, until then it's an issue that fails tier1 jobs.

mykaul · 2023-11-28T17:24:41Z

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

I can't tell which is which, and as I said there wasn't any agreement to ignore it across the board

someone would need to explicitly say so, until then it's an issue that fails tier1 jobs.

I say so. I thought I did previously, but now I do, again.

bhalevy · 2023-12-05T07:26:15Z

@eliransin / @nyh this seems to fall in your area.
any chance it is a dup?

nyh · 2023-12-05T14:29:12Z

Does this test do schema changes in parallel with the node restart?
We have #15048 about delays in schema changes causing these errors.

fruch · 2023-12-05T16:03:10Z

Does this test do schema changes in parallel with the node restart? We have #15048 about delays in schema changes causing these errors.

No, this case is doing the nemesis sequencely (in one thread)

this is the code that was running, changing scylla internode compression, and restarting the node, one by one:

    @decorate_with_context(ignore_ycsb_connection_refused)
    def disrupt_rolling_config_change_internode_compression(self):
        def get_internode_compression_new_value_randomly(current_compression):
            self.log.debug(f"Current compression is {current_compression}")
            values = ['dc', 'all', 'none']
            values_to_toggle = list(filter(lambda value: value != current_compression, values))
            return random.choice(values_to_toggle)

        if self._is_it_on_kubernetes():
            # NOTE: on K8S update of 'scylla.yaml' and 'cassandra-rackdc.properties' files is done
            #       via update of the single reused place and serial restart of Scylla pods.
            raise UnsupportedNemesis(
                "This logic will be covered by an operator functional test. Skipping.")
        with self.target_node.remote_scylla_yaml() as scylla_yaml:
            current = scylla_yaml.internode_compression
        new_value = get_internode_compression_new_value_randomly(current)
        for node in self.cluster.nodes:
            self.log.debug(f"Changing {node} inter node compression to {new_value}")
            with node.remote_scylla_yaml() as scylla_yaml:
                scylla_yaml.internode_compression = new_value
            self.log.info(f"Restarting node {node}")
            node.restart_scylla_server()

nyh · 2023-12-05T16:12:15Z

@fruch is it possible the nemesis was started right after creating the table and the views, and perhaps some node was killed before it got the new schema?

We have seen this no_such_column_family in view updates a lot when one node sends view updates and the other node doesn't know this node exists - we saw it when creating a view (so one of the nodes didn't yet get the news that was created) and when deleting a view (so one node continues to build view updates to a view that other nodes think was already deleted), but I don't think we saw it just in the middle of normal work. I suspected your test case is also one of these cases, but I don't know how to prove it, or if it's truely not the case (e.g., the view was already created 10 minutes before this nemesis started), I'm out of ideas what it can be.

fruch · 2023-12-05T16:18:58Z

@fruch is it possible the nemesis was started right after creating the table and the views, and perhaps some node was killed before it got the new schema?

We have seen this no_such_column_family in view updates a lot when one node sends view updates and the other node doesn't know this node exists - we saw it when creating a view (so one of the nodes didn't yet get the news that was created) and when deleting a view (so one node continues to build view updates to a view that other nodes think was already deleted), but I don't think we saw it just in the middle of normal work. I suspected your test case is also one of these cases, but I don't know how to prove it, or if it's truely not the case (e.g., the view was already created 10 minutes before this nemesis started), I'm out of ideas what it can be.

we check the status of all nodes between nemesis, all of them were up before this nemesis

nyh · 2023-12-05T16:34:05Z

we check the status of all nodes between nemesis, all of them were up before this nemesis

But that was not my question - my question is when was the view created and how sure are we that all the nodes already know of this view's existance. Is it possible it was created just a fraction of a second before the nemesis started?

fruch · 2023-12-05T17:32:34Z

we check the status of all nodes between nemesis, all of them were up before this nemesis

But that was not my question - my question is when was the view created and how sure are we that all the nodes already know of this view's existance. Is it possible it was created just a fraction of a second before the nemesis started?

No, as I said between each nemesis we are checking node status, it that take more than a fraction of a second.

after taking a close look, seems I was missing this time difference at the beginning of the event

the event of the error is create 13min after timestamp in the print:
2023-11-25 16:15:14.961 <2023-11-25 16:02:16.381>:

the filter that was applied while the nemesi that create_remove_mv is long expired
so the log reading part slowed down, and the error pop in during the next nemesis.

we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598

fruch · 2023-12-19T11:58:17Z

cause of the issues with identifying the actual places when those would be expected
ignoring all of the view update errors:
scylladb/scylla-cluster-tests#6965

we had multiple places where we tried to apply a filtering/demoating of view update errors, and they keep popping up in all kind of cases * cases of parallel nemesis * cases our log reading slow down, and those pop out of context, since filter is gone so cause of those issue, and the fact those aren't gonna be fixed any time soon, we'll apply this filter globaly until all of the view update issues would be addressed Ref: scylladb/scylladb#16206 Ref: scylladb/scylladb#16259 Ref: scylladb/scylladb#15598

fruch added area/materialized views triage/master Looking for assignee labels Nov 28, 2023

mykaul added this to the 6.1 milestone Nov 29, 2023

mykaul removed the triage/master Looking for assignee label Nov 29, 2023

fruch mentioned this issue Dec 19, 2023

fix(MV-errors): skip all view update error globaly scylladb/scylla-cluster-tests#6965

Merged

7 tasks

mykaul assigned piodul May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

fruch commented Nov 28, 2023

Logs:

mykaul commented Nov 28, 2023

fruch commented Nov 28, 2023

mykaul commented Nov 28, 2023

bhalevy commented Dec 5, 2023

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023 •

edited

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023

fruch commented Dec 19, 2023

Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

Comments

fruch commented Nov 28, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

mykaul commented Nov 28, 2023

fruch commented Nov 28, 2023

mykaul commented Nov 28, 2023

bhalevy commented Dec 5, 2023

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023 • edited

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023

nyh commented Dec 5, 2023

fruch commented Dec 5, 2023

fruch commented Dec 19, 2023

fruch commented Dec 5, 2023 •

edited