Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong schema version in system.peers when using raft #15078

Closed
sylwiaszunejko opened this issue Aug 17, 2023 · 12 comments
Closed

Wrong schema version in system.peers when using raft #15078

sylwiaszunejko opened this issue Aug 17, 2023 · 12 comments

Comments

@sylwiaszunejko
Copy link
Contributor

sylwiaszunejko commented Aug 17, 2023

I have started three node cluster with --experimental-features consistent-topology-changes --experimental-features tablets. I have used master version of Scylla (34c3688). Without any previous steps, I have connected to the empty cluster using cqlsh 127.0.0.2 and I have executed the following commands:

SELECT schema_version FROM system.local;
SELECT peer, host_id, schema_version FROM system.peers;

The output looks like that:

 schema_version
--------------------------------------
 c1f44a7f-c4ab-3022-b607-66f0b8bdd582

 peer      | host_id                              | schema_version
-----------+--------------------------------------+--------------------------------------
 127.0.0.3 | 5a4581a5-389b-4075-88d9-2dd533804f9c | 59adb24e-f3cd-3e02-97f0-5b395827453f
 127.0.0.4 | e5fa0b2c-de94-45f9-b9aa-1387b56be28b | 59adb24e-f3cd-3e02-97f0-5b395827453f

Accordingly, I have connected to 127.0.0.3 and 127.0.0.4, run the same commands, and got the outputs:

 schema_version
--------------------------------------
 c1f44a7f-c4ab-3022-b607-66f0b8bdd582

 peer      | host_id                              | schema_version
-----------+--------------------------------------+--------------------------------------
 127.0.0.2 | 749d1300-474e-46cb-8dfc-870ff058a20a | c1f44a7f-c4ab-3022-b607-66f0b8bdd582
 127.0.0.4 | e5fa0b2c-de94-45f9-b9aa-1387b56be28b | 59adb24e-f3cd-3e02-97f0-5b395827453f
 schema_version
--------------------------------------
 c1f44a7f-c4ab-3022-b607-66f0b8bdd582

 peer      | host_id                              | schema_version
-----------+--------------------------------------+--------------------------------------
 127.0.0.3 | 5a4581a5-389b-4075-88d9-2dd533804f9c | 59adb24e-f3cd-3e02-97f0-5b395827453f
 127.0.0.2 | 749d1300-474e-46cb-8dfc-870ff058a20a | c1f44a7f-c4ab-3022-b607-66f0b8bdd582

The schema versions in system.peers does not align with the schema versions in system.local.
This cause problems with for example running any command using gocql.

I have noticed that if I perform operation that changes schema (creating keyspace/table) it fixes system.peers.

I tried to reproduce this on cluster with nodes without --experimental-features consistent-topology-changes and the bug does not occurs.

I have also tested it with scylladb/scylla-nightly and the last properly working version is 5.4.0-dev-0.20230802.0239ba45272f-x86_64 (5.4.0-dev-0.20230803.39ca07c49b25-x86_64 has this bug).

I have performed a kind of bisect to find out from which commit this problem occurs, it happens to be 7c30954, the last commit without that bug is 3c1ca12.

I attach logs from all three nodes.

node3_logs.txt
node2_logs.txt
node1_logs.txt

@sylwiaszunejko sylwiaszunejko changed the title Wrong schema version in system.peers Wrong schema version in system.peers when using raft Aug 17, 2023
@avelanarius
Copy link
Member

avelanarius commented Aug 17, 2023

cc @kbr-scylla @kostja

@kbr-scylla
Copy link
Contributor

cc @gleb-cloudius

@gleb-cloudius
Copy link
Contributor

Strange. Gossiper has different schema versions as well.

@kostja
Copy link
Contributor

kostja commented Aug 27, 2023

@kbr-scylla how do we test for this to stay in sync in the future? @gleb-cloudius can we make sure we don't update system.local through the observer on the raft command apply path?

@gleb-cloudius
Copy link
Contributor

@gleb-cloudius can we make sure we don't update system.local through the observer on the raft command apply path?

We do not update system.local through the observer. We update gossiper state through it.

@kostja
Copy link
Contributor

kostja commented Aug 27, 2023

But how does the gossiper manage to get the wrong state in the first place? The node is just starting after all, it got to be taking it from the system table.

@gleb-cloudius
Copy link
Contributor

The node was starting. Was receiving a schema update through raft, but not updating its local version in the gossiper.

@kbr-scylla
Copy link
Contributor

@kbr-scylla how do we test for this to stay in sync in the future?

I don't know. We could compare versions from system.peers in our system tests -- similarly to how we check group 0 and token ring consistency.

But with the bug present, the tests would be flaky, sometimes they would pass because the problem didn't always reproduce (?) But maybe a flaky test is better than no test...

@gleb-cloudius
Copy link
Contributor

This bug was 100% reproducible. But since peers table is updated as node learns about other nodes versions over the network a test may see different versions for some time.

raphaelsc pushed a commit to raphaelsc/scylla that referenced this issue Aug 29, 2023
…p0 and starting gossiper

The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.

Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.

Fixes: scylladb#15078

Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
@DoronArazii DoronArazii added this to the 5.4 milestone Aug 29, 2023
@avikivity
Copy link
Member

@gleb-cloudius please evaluate for backport

@gleb-cloudius
Copy link
Contributor

All the version with non experimental schema over raft should get it.

kbr-scylla pushed a commit that referenced this issue Dec 20, 2023
…p0 and starting gossiper

The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.

Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.

Fixes: #15078

Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
(cherry picked from commit d1654cc)
@kbr-scylla
Copy link
Contributor

Backported to 5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants