Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error applying view update during rolling restart of a cluster [data_dictionary::no_such_column_family] #16206

Open
1 of 2 tasks
fruch opened this issue Nov 28, 2023 · 11 comments
Assignees
Milestone

Comments

@fruch
Copy link
Contributor

fruch commented Nov 28, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

during rolling restart of the cluster (disrupt_rolling_config_change_internode_compression), we are seeing multiple
errors like that from node-1:

2023-11-25 16:15:14.961 <2023-11-25 16:02:16.381>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6dab20e6-cc0d-4a3c-b298-7276f1b01d29: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=225846 node=longevity-tls-50gb-3d-master-db-node-a9348823-1
2023-11-25T16:02:16.381+00:00 longevity-tls-50gb-3d-master-db-node-a9348823-1      !ERR | scylla[17388]:  [shard 10:stmt] view - Error applying view update to 10.4.9.222 (view: keyspace1.standard1_view, base token: -8756056605010619006, view token: 986997830309495339): data_dictionary::no_such_column_family (Can't find a column family with UUID c357cfd0-8b5f-11ee-9cb8-6f24733b598e)

Impact

No other impact other then confusing errors in the log

How frequently does it reproduce?

this have been seen multiple times

Installation details

Kernel Version: 5.15.0-1050-aws
Scylla version (or git commit hash): 5.5.0~dev-20231122.65e42e4166ce with build-id 44721ca3535e729f9dd7dee34b519bfc91b8c564

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-a9348823-9 (3.250.90.57 | 10.4.8.242) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-8 (34.244.226.200 | 10.4.11.54) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-7 (34.245.208.67 | 10.4.10.181) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-6 (54.170.239.41 | 10.4.9.153) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-5 (34.244.91.38 | 10.4.9.200) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-4 (3.250.36.122 | 10.4.10.69) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-3 (3.252.73.251 | 10.4.9.10) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-2 (54.74.142.142 | 10.4.11.40) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-15 (54.76.99.218 | 10.4.10.108) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-14 (34.244.144.56 | 10.4.10.27) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-13 (34.244.22.141 | 10.4.11.122) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-12 (3.254.132.70 | 10.4.11.41) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-11 (18.202.35.250 | 10.4.9.247) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-10 (34.245.54.32 | 10.4.11.132) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-a9348823-1 (3.252.149.198 | 10.4.9.222) (shards: 14)

OS / Image: ami-0a28a97891191910e (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: a9348823-953c-4c29-8829-a8e7b0d45b81
Test name: scylla-master/longevity/longevity-50gb-3days-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor a9348823-953c-4c29-8829-a8e7b0d45b81
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a9348823-953c-4c29-8829-a8e7b0d45b81

Logs:

Jenkins job URL
Argus

@mykaul
Copy link
Contributor

mykaul commented Nov 28, 2023

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

@fruch
Copy link
Contributor Author

fruch commented Nov 28, 2023

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

I can't tell which is which, and as I said there wasn't any agreement to ignore it across the board

someone would need to explicitly say so, until then it's an issue that fails tier1 jobs.

@mykaul
Copy link
Contributor

mykaul commented Nov 28, 2023

@fruch - isn't it the same line we discussed earlier this morning to ignore? (as a known issue) ?

I can't tell which is which, and as I said there wasn't any agreement to ignore it across the board

someone would need to explicitly say so, until then it's an issue that fails tier1 jobs.

I say so. I thought I did previously, but now I do, again.

@mykaul mykaul added this to the 6.1 milestone Nov 29, 2023
@mykaul mykaul removed the triage/master Looking for assignee label Nov 29, 2023
@bhalevy
Copy link
Member

bhalevy commented Dec 5, 2023

@eliransin / @nyh this seems to fall in your area.
any chance it is a dup?

@nyh
Copy link
Contributor

nyh commented Dec 5, 2023

Does this test do schema changes in parallel with the node restart?
We have #15048 about delays in schema changes causing these errors.

@fruch
Copy link
Contributor Author

fruch commented Dec 5, 2023

Does this test do schema changes in parallel with the node restart? We have #15048 about delays in schema changes causing these errors.

No, this case is doing the nemesis sequencely (in one thread)

this is the code that was running, changing scylla internode compression, and restarting the node, one by one:

    @decorate_with_context(ignore_ycsb_connection_refused)
    def disrupt_rolling_config_change_internode_compression(self):
        def get_internode_compression_new_value_randomly(current_compression):
            self.log.debug(f"Current compression is {current_compression}")
            values = ['dc', 'all', 'none']
            values_to_toggle = list(filter(lambda value: value != current_compression, values))
            return random.choice(values_to_toggle)

        if self._is_it_on_kubernetes():
            # NOTE: on K8S update of 'scylla.yaml' and 'cassandra-rackdc.properties' files is done
            #       via update of the single reused place and serial restart of Scylla pods.
            raise UnsupportedNemesis(
                "This logic will be covered by an operator functional test. Skipping.")
        with self.target_node.remote_scylla_yaml() as scylla_yaml:
            current = scylla_yaml.internode_compression
        new_value = get_internode_compression_new_value_randomly(current)
        for node in self.cluster.nodes:
            self.log.debug(f"Changing {node} inter node compression to {new_value}")
            with node.remote_scylla_yaml() as scylla_yaml:
                scylla_yaml.internode_compression = new_value
            self.log.info(f"Restarting node {node}")
            node.restart_scylla_server()

@nyh
Copy link
Contributor

nyh commented Dec 5, 2023

@fruch is it possible the nemesis was started right after creating the table and the views, and perhaps some node was killed before it got the new schema?

We have seen this no_such_column_family in view updates a lot when one node sends view updates and the other node doesn't know this node exists - we saw it when creating a view (so one of the nodes didn't yet get the news that was created) and when deleting a view (so one node continues to build view updates to a view that other nodes think was already deleted), but I don't think we saw it just in the middle of normal work. I suspected your test case is also one of these cases, but I don't know how to prove it, or if it's truely not the case (e.g., the view was already created 10 minutes before this nemesis started), I'm out of ideas what it can be.

@fruch
Copy link
Contributor Author

fruch commented Dec 5, 2023

@fruch is it possible the nemesis was started right after creating the table and the views, and perhaps some node was killed before it got the new schema?

We have seen this no_such_column_family in view updates a lot when one node sends view updates and the other node doesn't know this node exists - we saw it when creating a view (so one of the nodes didn't yet get the news that was created) and when deleting a view (so one node continues to build view updates to a view that other nodes think was already deleted), but I don't think we saw it just in the middle of normal work. I suspected your test case is also one of these cases, but I don't know how to prove it, or if it's truely not the case (e.g., the view was already created 10 minutes before this nemesis started), I'm out of ideas what it can be.

we check the status of all nodes between nemesis, all of them were up before this nemesis

@nyh
Copy link
Contributor

nyh commented Dec 5, 2023

we check the status of all nodes between nemesis, all of them were up before this nemesis

But that was not my question - my question is when was the view created and how sure are we that all the nodes already know of this view's existance. Is it possible it was created just a fraction of a second before the nemesis started?

@fruch
Copy link
Contributor Author

fruch commented Dec 5, 2023

we check the status of all nodes between nemesis, all of them were up before this nemesis

But that was not my question - my question is when was the view created and how sure are we that all the nodes already know of this view's existance. Is it possible it was created just a fraction of a second before the nemesis started?

No, as I said between each nemesis we are checking node status, it that take more than a fraction of a second.

after taking a close look, seems I was missing this time difference at the beginning of the event

the event of the error is create 13min after timestamp in the print:
2023-11-25 16:15:14.961 <2023-11-25 16:02:16.381>:

the filter that was applied while the nemesi that create_remove_mv is long expired
so the log reading part slowed down, and the error pop in during the next nemesis.

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Dec 19, 2023
we had multiple places where we tried to apply a filtering/demoating
of view update errors, and they keep popping up in all kind of cases

* cases of parallel nemesis
* cases our log reading slow down, and those pop out of context, since filter is gone

so cause of those issue, and the fact those aren't gonna be fixed any
time soon, we'll apply this filter globaly until all of the view update
issues would be addressed

Ref: scylladb/scylladb#16206
Ref: scylladb/scylladb#16259
Ref: scylladb/scylladb#15598
fruch added a commit to fruch/scylla-cluster-tests that referenced this issue Dec 19, 2023
we had multiple places where we tried to apply a filtering/demoating
of view update errors, and they keep popping up in all kind of cases

* cases of parallel nemesis
* cases our log reading slow down, and those pop out of context, since filter is gone

so cause of those issue, and the fact those aren't gonna be fixed any
time soon, we'll apply this filter globaly until all of the view update
issues would be addressed

Ref: scylladb/scylladb#16206
Ref: scylladb/scylladb#16259
Ref: scylladb/scylladb#15598
@fruch
Copy link
Contributor Author

fruch commented Dec 19, 2023

cause of the issues with identifying the actual places when those would be expected
ignoring all of the view update errors:
scylladb/scylla-cluster-tests#6965

fruch added a commit to scylladb/scylla-cluster-tests that referenced this issue Dec 19, 2023
we had multiple places where we tried to apply a filtering/demoating
of view update errors, and they keep popping up in all kind of cases

* cases of parallel nemesis
* cases our log reading slow down, and those pop out of context, since filter is gone

so cause of those issue, and the fact those aren't gonna be fixed any
time soon, we'll apply this filter globaly until all of the view update
issues would be addressed

Ref: scylladb/scylladb#16206
Ref: scylladb/scylladb#16259
Ref: scylladb/scylladb#15598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants