Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_add_udt_to_another_data_types: Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs> #15530

Closed
bhalevy opened this issue Sep 23, 2023 · 60 comments
Assignees
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Sep 23, 2023

Although the symptom looks identical to #5336 I'm opening a new issue since it seems like we have a recent regression.

@kbr-scylla wrote:

I've just got a very similar failure on #15331:

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3711/testReport/junit/user_types_test/TestUserTypes/Tests___Sanity_Tests___test_add_udt_to_another_data_types/

cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_

@Deexie hit this in #15506

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3843/

I just saw this in #15480 (comment)

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3886/testReport/junit/user_types_test/TestUserTypes/Tests___Sanity_Tests___test_add_udt_to_another_data_types/
E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>"

@bhalevy bhalevy added triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Sep 23, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Sep 23, 2023

let's determine first if the issue is with the product or with the test.

@bhalevy
Copy link
Member Author

bhalevy commented Sep 23, 2023

@nyh was there an recent change in this area on scylla's side?

@nyh
Copy link
Contributor

nyh commented Sep 24, 2023

@avikivity and @cvybhu have been changing related area of the CQL expression code. But if this failure isn't consistent (i.e., flaky. Is it?) perhaps a more likely possibility is that it's a schema change race - the request is done immediately after an ALTER TYPE and maybe it isn't aware of the change. It's not impossible that we had a regression somewhere - we could bisect. I can't think of any specific change that I suspect.

@bhalevy
Copy link
Member Author

bhalevy commented Sep 24, 2023

@tgrabiec please confirm schema change sync issue

@nyh
Copy link
Contributor

nyh commented Sep 24, 2023

@tgrabiec please confirm schema change sync issue

@bhalevy is this test failure flaky, or it recently started to fail consistently?

@bhalevy
Copy link
Member Author

bhalevy commented Sep 24, 2023

@tgrabiec please confirm schema change sync issue

@bhalevy is this test failure flaky, or it recently started to fail consistently?

It was flaky in the past and that issue was closed with no explicit fix (See #5336 (comment)), but it became more flaky again recently so I opened this issue rather than reopening #5336

@cvybhu
Copy link
Contributor

cvybhu commented Sep 25, 2023

@cvybhu have been changing related area of the CQL expression code.

To me it looks like some kind of schema version race.

We have a 3 node cluster and the following schema (schema v1):

CREATE TABLE IF NOT EXISTS entity
(
    entity_id text,
    other_entity_id text,
    entity_type text,
    type text,
    service text,
    entity_info frozen < obs_entity >,
    import_timestamp timestamp,
    import_timestamp_day timestamp,
    primary key(entity_id)
);

CREATE TYPE IF NOT EXISTS user_refs
(
    id text,
    alt_name text,
    firstname text,
    lastname text,
    email text
);

CREATE TYPE IF NOT EXISTS obs_entity
(
    entity_id text,
    version int,
    entity_type text,
    type text,
    service text,
    user_refs frozen<user_refs>
);

We run the following query (query v1):

INSERT INTO ENTITY
(
    entity_id,
    entity_info,
    entity_type,
    import_timestamp,
    import_timestamp_day,
    other_entity_id,
    service,
    type
)
VALUES
(
    'text{i}',
    (
        'entity_id{i}',
        {i},
        'entity_type{i}',
        'type{i}',
        'service{i}', 
        (
            'id{i}',
            'alt_name{i}',
            'firstname{i}',
            'lastname{i}',
            'email{i}'
        )
    ), 
    'entity_type{i}',
    1234568979,
    45621313131,
    'other_entity_id{i}',
    'service{i}',
    'type{i}'
);

Then both UDTs are altered, a new field is added to both of them (schema v2):

CREATE TYPE IF NOT EXISTS user_refs
(
    id text,
    alt_name text,
    firstname text,
    lastname text,
    email text,
+  priority_refs frozen<priority_refs>
);

CREATE TYPE IF NOT EXISTS obs_entity
(
    entity_id text,
    version int,
    entity_type text,
    type text,
    service text,
    user_refs frozen<user_refs>,
+  priority_refs frozen<priority_refs>
);

And then we run the query again, but we also include data for the newly added UDT fields (query v2):

INSERT INTO ENTITY
(
    entity_id,
    entity_info,
    entity_type,
    import_timestamp,
    import_timestamp_day,
    other_entity_id,
    service,
    type
)
VALUES
(
    'text{i}',
    (
        'entity_id{i}',
        {i},
        'entity_type{i}',
        'type{i}',
        'service{i}', 
        (
            'id{i}',
            'alt_name{i}',
            'firstname{i}',
            'lastname{i}',
            'email{i}'
            {new_type}
        )
        {new_type}
    ), 
    'entity_type{i}',
    1234568979,
    45621313131,
    'other_entity_id{i}',
    'service{i}',
    'type{i}'
)

The error suggests that we are running query v2 on schema v1.
The error message isn't great, but basically it complains that the tuple ('id{i}', 'alt_name{i}', 'firstname{i}', 'lastname{i}', 'email{i}', {new_type}) can't be used to initialize the user_refs field of obs_entity.

As a test I removed the ALTER TYPE statements from the test, and I got the same error as in the issue description.

It looks like the schema hasn't been updated to schema v2 by the time we start running query v2.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 25, 2023

Usually the driver should wait for schema agreement after any ALTER * statements. It should wait until all nodes report the same schema version. Rust driver does this automatically, but I'm not sure about other drivers.

@avelanarius do you know if the Python driver automatically waits for schema agreement after ALTER queries? Or should we do it manually.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 25, 2023

It looks like the default timeout for schema agreement in Python driver is 10 seconds. That might not be enough for slow clusters in dtest runners:

max_schema_agreement_wait:

image

@kbr-scylla
Copy link
Contributor

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

If it does though, I don't believe it is a matter of increasing the timeout. Why did the previous timeout suffice until now, but no longer does? It could be that for some reason, we never reach this agreement.

We could of course increase the timeout and see if it helps.

@mykaul
Copy link
Contributor

mykaul commented Sep 26, 2023

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

Who should do this modification?

@kbr-scylla
Copy link
Contributor

Whoever gets assigned to this issue and starts working on it, I guess.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 26, 2023

I had a look at Python driver's code, and I noticed two things:

  • I don't see any logic that would await schema agreement after ALTER queries automatically. Maybe we have to call refresh_metadata() manually after each ALTER? Or maybe the logic is there but I missed it.

  • Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node. This could lead to races, for example:

    • Driver has control connection to node A
    • Driver sends an ALTER query to node B
    • Driver asks node A for schema versions, they are all the same because information about the ALTER query hasn't propagated to node A yet.
    • Driver thinks that schema agreement has been reached, but node A still doesn't know about the schema change

    Rust driver does it differently, it queries the local schema version on all available connections and checks whether the local schema versions are the same on all shards. This makes such race conditions impossible.

I suspect that the issue is caused by Python driver's handling of schema agreement, but I'm not familiar with this codebase. I think it would be good to get an opinion from @Lorak-mmk or @avelanarius here. Maybe there are some known quirks in this area.

Here's python driver's schema agreement code: https://github.com/scylladb/python-driver/blob/d12d2c13c8d646eb3e376c9e69fa014daba8fba6/cassandra/cluster.py#L4071
And here's the rust driver: https://github.com/scylladb/scylla-rust-driver/blob/1854c96099432d5db3a7e4228dd8f5d306a8c3ab/scylla/src/transport/session.rs#L1787

@avelanarius
Copy link
Member

  • I don't see any logic that would await schema agreement after ALTER queries automatically. Maybe we have to call refresh_metadata() manually after each ALTER? Or maybe the logic is there but I missed it.

It does it here: https://github.com/scylladb/python-driver/blob/d735957e3a7b7178f4fff26f6dbae588e58cf314/cassandra/cluster.py#L4728-L4735

@avelanarius
Copy link
Member

  • Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node.

A query to fetch schema versions from system.local and system.peers if performed on the same connection that we have received RESULT_KIND_SCHEMA_CHANGE from. I don't think the issue you described can happen.

@nyh
Copy link
Contributor

nyh commented Sep 26, 2023

It looks like the default timeout for schema agreement in Python driver is 10 seconds. That might not be enough for slow clusters in dtest runners:

max_schema_agreement_wait:

image

Note: dtest has in dtest_setup.py ``max_schema_agreement_wait=60`. This should be (famous last words...) more than enough. I suspect we don't wait for schema agreement at all...

@nyh
Copy link
Contributor

nyh commented Sep 26, 2023

I had a look at Python driver's code, and I noticed two things:

* I don't see any logic that would await schema agreement after `ALTER` queries automatically. Maybe we have to call `refresh_metadata()` manually after each ALTER? Or maybe the logic is there but I missed it.

The way this is done automatically is as follows:

  1. As a response to a schema-modifying request, Scylla replies with a special response code defined in the CQL protocol - "Schema_change" (0x5).
  2. Upon receiving a schema_change response ("event"), the driver should wait for the schema agreement.

As far as I can tell, Scylla does return SCHEMA_CHANGE on "ALTER TABLE" - see cql3/statements/alter_table_statement.cc,
and the Python driver does have code to handle this. But I didn't look too carefully, and didn't try to enable logging to see what really happens in practice.

* Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node. This could lead to races, for example:
  
  * Driver has control connection to node A
  * Driver sends an `ALTER` query to node B
  * Driver asks node A for schema versions, they are all the same because information about the `ALTER` query hasn't propagated to node A yet.

If this is what the driver does, it's indeed looks like a bug :-( The driver knows that it just sent a specific connection to node B a request which node B told it on this connection that the schema to change. It should ask node B on the same connection - not anyone else - if it thinks the schema is agreed. If this is not what the code does, it's a bug :-(

  * Driver thinks that schema agreement has been reached, but node A still doesn't know about the schema change
  
  Rust driver does it differently, it queries the local schema version on all available connections and checks whether the local schema versions are the same on all shards. This makes such race conditions impossible.

This makes more sense and probably converges faster, but sounds like more work for the client. Does the rust driver always open connections to all nodes in the cluster, right on the first request?

@cvybhu
Copy link
Contributor

cvybhu commented Sep 26, 2023

It does it here: https://github.com/scylladb/python-driver/blob/d735957e3a7b7178f4fff26f6dbae588e58cf314/cassandra/cluster.py#L4728-L4735
A query to fetch schema versions from system.local and system.peers if performed on the same connection that we have received RESULT_KIND_SCHEMA_CHANGE from. I don't think the issue you described can happen.

Thanks, you're right, that refutes my complaints. The race condition won't happen if we query the same connection where the ALTER was sent to. It looks like the Python driver does the right thing. I got confused because the code was in the ControlConnection class, but it does indeed send the requests on another connection.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 26, 2023

I enabled debug logging for python driver, ran the test and collected the logs.

diff --git a/conftest.py b/conftest.py
index a2373541..28d26443 100644
--- a/conftest.py
+++ b/conftest.py
@@ -194,7 +194,7 @@ def fixture_logging_setup(request, record_property, split_info):
         error_logger.addHandler(logging_plugin.log_file_handler)
         log_per_process_data['error_logger'] = error_logger
 
-    logging.getLogger("cassandra").setLevel(logging.INFO)
+    logging.getLogger("cassandra").setLevel(logging.DEBUG)
     logging.getLogger("boto3").setLevel(logging.INFO)
     logging.getLogger("botocore").setLevel(logging.INFO)
     logging.getLogger("s3transfer").setLevel(logging.INFO)

Here are the logs from a successful run:

debug-level-logs-schema-change.txt

Python driver does wait for schema agreement (All the lines with Waiting for schema agreement).

Looks okay to me.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 26, 2023

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

This sounds like a good way forward.
But is it possible to enable DEBUG logging for cassandra in this single dtest test case? I was able to do it globally, but I don't see how to do this for a single test.

@cvybhu
Copy link
Contributor

cvybhu commented Sep 26, 2023

Alright I hacked something together and opened a PR. The next time the CI fails we should be able to see logs about schema agreement.

@cvybhu
Copy link
Contributor

cvybhu commented Oct 4, 2023

Okay we got the same failure in e.g https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/5021/

There is no schema agreement timeout, but it looks like the test executes the query while the driver is still waiting for schema agreement:

<cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity')))
10:51:49,910 787     cassandra.connection           DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Message pushed from server: <EventMessage(event_type='SCHEMA_CHANGE', event_args={'target_type': 'TABLE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'table': 'entity_by_unique_id'}, stream_id=-1, trace_id=None)>
10:51:49,910 787     cassandra.cluster              DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Ignoring schedule_unique for already-scheduled task: (<bound method ControlConnection.refresh_schema of <cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity_by_unique_id')))
10:51:49,911 787     cassandra.connection           DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Message pushed from server: <EventMessage(event_type='SCHEMA_CHANGE', event_args={'target_type': 'TABLE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'table': 'entity_by_type'}, stream_id=-1, trace_id=None)>
10:51:49,911 787     cassandra.cluster              DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Ignoring schedule_unique for already-scheduled task: (<bound method ControlConnection.refresh_schema of <cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity_by_type')))
10:51:49,987 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: Refreshing schema in response to schema change. {'target_type': 'TYPE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'type': 'user_refs'}
10:51:49,988 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:49,989 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,190 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,391 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,592 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,793 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,994 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas match
10:51:50,994 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:50,995 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas match
10:51:50,996 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:51,093 787     errors                         ERROR    conftest.py         :225  | test_add_udt_to_another_data_types: test failed: 
self = <user_types_test.TestUserTypes object at 0x7f53d9754050>

    def test_add_udt_to_another_data_types(self):
        logging.getLogger("cassandra").setLevel(logging.DEBUG)
        try:
>           self.internal_add_udt_to_another_data_types()

user_types_test.py:1008: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
user_types_test.py:1124: in internal_add_udt_to_another_data_types
    insert_into_entity(rows=10, altered_type=True)
user_types_test.py:1083: in insert_into_entity
    session.execute(stmt)
cassandra/cluster.py:2700: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>"

There's Waiting for schema agreement and immediately after the failure happens. At the first glance it looks like the driver starts the schema agreement process, but the query doesn't wait for it.

Here's the full log: dtest-gw2.log

@nyh
Copy link
Contributor

nyh commented Oct 4, 2023

There's Waiting for schema agreement and immediately after the failure happens. At the first glance it looks like the driver starts the schema agreement process, but the query doesn't wait for it.

Good catch, this confirms my guess in #15530 (comment) that the problem is not the amount of wait (which is already 60 seconds), but the fact that the driver doesn't wait for it to finish.

So sounds like a driver bug, but tere's nothing smart I can say about any details. I wonder if this test failure can be reproduced locally (even if at low probability) - it will help debug it and more importantly, test a fix.

@fruch
Copy link
Contributor

fruch commented Oct 5, 2023

@Lorak-mmk
Copy link
Contributor

Lorak-mmk commented Oct 9, 2023

I've just managed to reproduce it on 1-node cluster (Scylla master, compiled in debug), without using python driver, so I think it's a problem in Scylla, namely that after Scylla returns result from schema-altering query it's possible that not all shards updated the schema yet.
Here's the code: https://github.com/Lorak-mmk/udt_schema_change_reproducer - you run with with 1 argument, ip of the node.
And the logs from faulty execution:

Bound socket to: 0.0.0.0:12036
Created connection to shard 0
Bound socket to: 0.0.0.0:12037
Created connection to shard 1
Bound socket to: 0.0.0.0:12038
Created connection to shard 2
Bound socket to: 0.0.0.0:12039
Created connection to shard 3
Bound socket to: 0.0.0.0:12040
Created connection to shard 4
Bound socket to: 0.0.0.0:12041
Created connection to shard 5
Bound socket to: 0.0.0.0:12042
Created connection to shard 6
Bound socket to: 0.0.0.0:12043
Created connection to shard 7
Bound socket to: 0.0.0.0:12044
Created connection to shard 8
Bound socket to: 0.0.0.0:12045
Created connection to shard 9
Bound socket to: 0.0.0.0:12046
Created connection to shard 10
Bound socket to: 0.0.0.0:12047
Created connection to shard 11
Drop keyspace
SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Dropped, keyspace_name: "ks" } })
Create keyspace
SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Created, keyspace_name: "ks" } })
Control connection - use keyspace
SetKeyspace(SetKeyspace { keyspace_name: "ks" })
Create user_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "user_refs" } })
Create obs_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "obs_entity" } })
Create other_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "other_entity" } })
Create entity table
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity" } })
Create index on entity(service)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } })
Create index on entity(type)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } })
Create entity_by_type materialized view
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_type" } })
Create entity_by_unique_id materialized view
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_unique_id" } })
Create entity_rel table
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(dest_entity_id)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(service)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(rel_type)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Inserted on shard 5
Inserted on shard 1
Inserted on shard 11
Inserted on shard 6
Inserted on shard 0
Inserted on shard 8
Inserted on shard 7
Inserted on shard 3
Inserted on shard 2
Inserted on shard 4
Inserted on shard 9
Inserted on shard 10
Inserted on shard 5
Inserted on shard 0
Inserted on shard 11
Inserted on shard 1
Inserted on shard 6
Inserted on shard 3
Inserted on shard 8
Inserted on shard 2
Inserted on shard 4
Inserted on shard 7
Inserted on shard 10
Inserted on shard 9
Create priority_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "priority_refs" } })
Alter obs_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "obs_entity" } })
Alter user_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "user_refs" } })
thread 'main' panicked at 'Wrong response! Error(Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" })', src/main.rs:98:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I managed to reproduced it once, after a lot of tries, so it's not yet a very good reproducer, I'll try to improve it. Main difference from dtest is that after getting a response from a schema altering query, I send inserts to all shards asap, in hope of catching one that is not yet updated.

@kbr-scylla
Copy link
Contributor

Hmm, if you suspect it's schema change code -- could it be 3d4398d?
I'm not sure how, but it was a significant change to how schema changes are done (at least in Raft mode).

@kbr-scylla
Copy link
Contributor

A simple way to revert the effect of 3d4398d without reverting the entire thing would be:

diff --git a/service/migration_manager.cc b/service/migration_manager.cc
index e289ff7ce6..5b7f491139 100644
--- a/service/migration_manager.cc
+++ b/service/migration_manager.cc
@@ -1008,7 +1008,7 @@ static void add_committed_by_group0_flag(std::vector<mutation>& schema, const gr
 
 // Returns a future on the local application of the schema
 future<> migration_manager::announce(std::vector<mutation> schema, group0_guard guard, std::string_view description) {
-    if (_feat.group0_schema_versioning) {
+    if (false) {
         schema.push_back(make_group0_schema_version_mutation(_storage_proxy.data_dictionary(), guard));
         add_committed_by_group0_flag(schema, guard);
     }

@kbr-scylla
Copy link
Contributor

Looking at the schema change code -- it looks like it's updating types on all shards before returning.

static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, sharded<db::system_keyspace>& sys_ks, std::vector<mutation> mutations, bool do_flush, bool reload)
{
...
    auto types_to_drop = co_await merge_types(proxy, std::move(old_types), std::move(new_types));
static future<user_types_to_drop> merge_types(distributed<service::storage_proxy>& proxy, schema_result before, schema_result after)
{
    auto diff = diff_rows(before, after);

    // Create and update user types before any tables/views are created that potentially
    // use those types. Similarly, defer dropping until after tables/views that may use
    // some of these user types are dropped.

    co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
        for (auto&& user_type : create_types(db, diff.created)) {
            db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
            co_await db.get_notifier().create_user_type(user_type);
        }
        for (auto&& user_type : create_types(db, diff.altered)) {
            db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
            co_await db.get_notifier().update_user_type(user_type);
        }
    });

and (do_merge_schema must finish locally before Scylla returns from the schema change request)

@avikivity
Copy link
Member

Kudos for getting a reproducer. If it reproduces 1 in 100 runs (say), just run it 1000 times (with concurrency). We can then use bisect to look for the bad commit (we can try to prove that a commit 1 week older than the first report is okay, to have a good commit.

@kbr-scylla
Copy link
Contributor

I applied the following diff to the reproducer:

diff --git a/src/main.rs b/src/main.rs
index 1c405de..a8ca9d9 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -181,6 +181,7 @@ async fn main() {
     let mut args = std::env::args();
     let _name = args.next().unwrap();
     let ip = args.next().unwrap();
+    for iter in 0..1000 {
     let mut ctrl_conn = setup_connection(&ip, None).await;
     let mut shards_conns: Vec<Connection> = Vec::new();
     for shard in 0..ctrl_conn.shard_count {
@@ -366,6 +367,8 @@ async fn main() {
             .await
     );
 
+    println!("Iter {}", iter);
+
     println!("Alter user_refs type");
     println!(
         "{:?}",
@@ -382,4 +385,5 @@ async fn main() {
 
     // assert_row_count(session=session, table_name="entity_rel", expected=rows_num,
     //                     consistency_level=ConsistencyLevel.QUORUM)
+    }
 }

it reproduced after 50 iterations

@avikivity
Copy link
Member

That's good. So 1000 iterations (with concurrency) should be a reliable pass/fail test (provided we get at least 10 fails from 1000, not 1).

@avikivity
Copy link
Member

e.g. don't stop after getting a failure

@kbr-scylla
Copy link
Contributor

More details about the error can be obtained by modifying Scylla a bit:

@@ -528,6 +529,7 @@ tuple_constructor_test_assignment(const tuple_constructor& tc, data_dictionary::
         tuple_constructor_validate_assignable_to(tc, db, keyspace, schema_opt, receiver);
         return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
     } catch (exceptions::invalid_request_exception& e) {
+        expr_logger.warn("tuple not assignable: {}", e);
         return assignment_testable::test_result::NOT_ASSIGNABLE;
     }
 }

now Scylla prints a warning before throwing:

WARN  2023-10-10 14:39:08,581 [shard 16:stat] cql_expression - tuple not assignable: exceptions::invalid_request_exception (Invalid tuple literal for entity_info[5]: too many elements. Type frozen<user_refs> expects 5 but got 6)

@Lorak-mmk
Copy link
Contributor

  • The issue probably doesn't happen with --consistent-cluster-management=false - despite a lot of tries I have not been able to reproduce it there.
  • It's not a matter of some single shard being out of date. I modified the reproducer to always print results from all shard, and when the failure happens it happens on all of them:
Bound socket to: 0.0.0.0:12024
Created connection to shard 0
Bound socket to: 0.0.0.0:12025
Created connection to shard 1
Bound socket to: 0.0.0.0:12026
Created connection to shard 2
Bound socket to: 0.0.0.0:12027
Created connection to shard 3
Bound socket to: 0.0.0.0:12028
Created connection to shard 4
Bound socket to: 0.0.0.0:12029
Created connection to shard 5
Bound socket to: 0.0.0.0:12030
Created connection to shard 6
Bound socket to: 0.0.0.0:12031
Created connection to shard 7
Bound socket to: 0.0.0.0:12032
Created connection to shard 8
Bound socket to: 0.0.0.0:12033
Created connection to shard 9
Bound socket to: 0.0.0.0:12034
Created connection to shard 10
Bound socket to: 0.0.0.0:12035
Created connection to shard 11
Drop keyspace
Ok(SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Dropped, keyspace_name: "ks" } }))
Create keyspace
Ok(SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Created, keyspace_name: "ks" } }))
Control connection - use keyspace
Ok(SetKeyspace(SetKeyspace { keyspace_name: "ks" }))
Create user_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "user_refs" } }))
Create obs_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "obs_entity" } }))
Create other_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "other_entity" } }))
Create entity table
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity" } }))
Create index on entity(service)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } }))
Create index on entity(type)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } }))
Create entity_by_type materialized view
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_type" } }))
Create entity_by_unique_id materialized view
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_unique_id" } }))
Create entity_rel table
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(dest_entity_id)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(service)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(rel_type)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Inserted on shard 5
Inserted on shard 8
Inserted on shard 1
Inserted on shard 2
Inserted on shard 4
Inserted on shard 6
Inserted on shard 10
Inserted on shard 7
Inserted on shard 9
Inserted on shard 11
Inserted on shard 0
Inserted on shard 3
Inserted on shard 8
Inserted on shard 5
Inserted on shard 0
Inserted on shard 11
Inserted on shard 3
Inserted on shard 9
Inserted on shard 4
Inserted on shard 6
Inserted on shard 2
Inserted on shard 7
Inserted on shard 10
Inserted on shard 1
Create priority_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "priority_refs" } }))
Alter obs_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "obs_entity" } }))
Alter user_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "user_refs" } }))
FAILED on shard 3: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 5: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 6: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 1: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 2: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 4: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 8: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 7: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 9: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 10: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 11: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 0: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }

@kbr-scylla
Copy link
Contributor

I modified the reproducer: after it gets this error from Scylla, I would sleep 100ms and retry the insert. If it's a race, the second insert should succeed.

However, the second insert also fails. So it looks like schema enters a broken state that is not transient?

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Oct 10, 2023

Also, running 1000 iterations with this diff:
#15530 (comment)

Passed.

So the most likely culprit is 3d4398d :(
I'm going to send a patch to disable this. (By disabling the cluster feature.)

@cvybhu cvybhu assigned kbr-scylla and unassigned cvybhu Oct 10, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Oct 10, 2023
It turns out that 3d4398d introduced a
regression. The dtest `test_add_udt_to_another_data_types` started
failing in a flaky way.  The test is adding a field to the user-defined
type `user_refs` and then inserting to a table that references this
type. The insert fails with the following error: `Invalid tuple literal
for entity_info: component 5 is not of type frozen<user_refs>`. But
Scylla apparently keeps using the old version of this type, without the
new field. And this doesn't seem a race in schema propagation across
shards in Scylla -- once the first insert fails, sleeping and retrying
the insert fails in the same way, so it seems that schema enters a
broken state that is not transient.

The feature wasn't released in any version of Scylla yet, so we can
safely disable it for now. When the feature is disabled, Scylla uses the
old way of determining schema versions (calculating hashes). I verified
using @Lorak-mmk's reproducer that disabling the feature solves the
problem.

Fixes scylladb#15530
@kbr-scylla
Copy link
Contributor

I described the root cause here: #15687 (comment)

@mykaul mykaul added this to the 5.4 milestone Oct 19, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 21, 2023
…hema in Raft mode

This logic of "including base mutations" when sending a view schema
change was implemented in case if a node receiving the view schema
mutations somehow missed earlier base mutations that the view mutations
may depend on.

First, in Raft mode it's not necessary to do this. The base mutations
are guaranteed to already be present on any node that applies the view
update, which is ordered after any base update that it may depend on in
the Raft log.

Second, including base mutations, due to the way it's implemented, may
cause problems. Suppose we have a base table with a view attached. When
we update the schema of this base table, the view's schema is also
updated. The base update generates mutations for base, obviously. But
the view update also generates mutations for base -- the "include base
mutations" logic.  It turns out that base mutations from base update
conflict with base mutations from the view update. This conflict is
currently benign.  But it manifested and caused a hard to debug failure
when we implemented group 0 schema versioning. It had to be reverted.

As a prerequisite to get group 0 schema versioning merged back, we need
to deal with this conflict. One way is to not include these base
mutations as part of view update (in Raft mode), which is what this
commit does.

Refs:
scylladb#15331 (group 0 schema versioning PR)
scylladb#15530 (issue caused by the conflicting mutations)
scylladb#15687 (comment)
(in this comment I describe in detail how the conflict works)
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 22, 2023
… view update

When a view schema is changed, the schema change command also includes
mutations for the corresponding base table; these mutations don't modify
the base schema but are included in case if the receiver of view
mutations somehow didn't receive base mutations yet (this may in theory
happen outside Raft mode).

There are situations where the schema change command contains both
mutations that describe the current state of the base table -- included
by a view update, as explained above -- and mutations that want to
modify the base table. Such situation arises, for example, when we
update a user-defined type which is referenced by both a view and its
corresponding base table. This triggers a schema change of the view,
which generates mutations to modify the view and includes mutations of
the current base schema, and at the same time it triggers a schema
change of the base, which generates mutations to modify the base.

These two sets of mutations are conflicting with each other. One set
wants to preserve the current state of the base table while the other
wants to modify it. And the two sets of mutations are generated using
the same timestamp, which means that conflict resolution between them is
made on a per-mutation-cell basis, comparing the values in each cell and
taking the "larger" one (meaning of "larger" depends on the type of each
cell).

Fortunately, this conflict is currently benign -- or at least there is
no known situation where it causes problems.

Unfortunately, it started causing problems when I attempted to implement
group 0 schema versioning (PR scylladb#15331), where instead of
calculating table versions as hashes of schema mutations, we would send
versions as part of schema change command. These versions would be
stored inside the `system_schema.scylla_tables` table, `version` column,
and sent as part of schema change mutations.

And then the conflict showed. One set of mutations wanted to preserve
the old value of `version` column while the other wanted to update it.
It turned out that sometimes the old `version` prevailed, because the
`version` column in `system_schema.scylla_tables` uses UUID-based
comparison (not timeuuid-based comparison). This manifested as issue
scylladb#15530.

To prevent this, the idea in this commit is simple: when generating
mutations for the base table as part of corresponding view update, do
not use the provided timestamp directly -- instead, decrement it by one.
This way, if the schema change command contains mutations that want to
modify the base table, these modifying mutations will win all conflicts
based on the timestamp alone (they are using the same provided
timestamp, but not decremented).

One could argue that the choice of this timestamp is anyway arbitrary.
The original purpose of including base mutations during view update was
to ensure that a node which somehow missed the base mutations, gets them
when applying the view. But in that case, the "most correct" solution
should have been to use the *original* base mutations -- i.e. the ones
that we have on disk -- instead of generating new mutations for the base
with a refreshed timestamp. The base mutations that we have on disk have
smaller timestamps already (since these mutations are from the past,
when the base was last modified or created), so the conflict would also
not happen in this case.

But that solution would require doing a disk read, and we can avoid the
read while still fixing the conflict by using an intermediate solution:
regenerating the mutations but with `timestamp - 1`.

Ref: scylladb#15530
@mykaul mykaul removed the triage/master Looking for assignee label Nov 22, 2023
@mykaul mykaul modified the milestones: 5.4, 6.0 Nov 22, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 24, 2023
… view update

When a view schema is changed, the schema change command also includes
mutations for the corresponding base table; these mutations don't modify
the base schema but are included in case if the receiver of view
mutations somehow didn't receive base mutations yet (this may in theory
happen outside Raft mode).

There are situations where the schema change command contains both
mutations that describe the current state of the base table -- included
by a view update, as explained above -- and mutations that want to
modify the base table. Such situation arises, for example, when we
update a user-defined type which is referenced by both a view and its
corresponding base table. This triggers a schema change of the view,
which generates mutations to modify the view and includes mutations of
the current base schema, and at the same time it triggers a schema
change of the base, which generates mutations to modify the base.

These two sets of mutations are conflicting with each other. One set
wants to preserve the current state of the base table while the other
wants to modify it. And the two sets of mutations are generated using
the same timestamp, which means that conflict resolution between them is
made on a per-mutation-cell basis, comparing the values in each cell and
taking the "larger" one (meaning of "larger" depends on the type of each
cell).

Fortunately, this conflict is currently benign -- or at least there is
no known situation where it causes problems.

Unfortunately, it started causing problems when I attempted to implement
group 0 schema versioning (PR scylladb#15331), where instead of
calculating table versions as hashes of schema mutations, we would send
versions as part of schema change command. These versions would be
stored inside the `system_schema.scylla_tables` table, `version` column,
and sent as part of schema change mutations.

And then the conflict showed. One set of mutations wanted to preserve
the old value of `version` column while the other wanted to update it.
It turned out that sometimes the old `version` prevailed, because the
`version` column in `system_schema.scylla_tables` uses UUID-based
comparison (not timeuuid-based comparison). This manifested as issue
scylladb#15530.

To prevent this, the idea in this commit is simple: when generating
mutations for the base table as part of corresponding view update, do
not use the provided timestamp directly -- instead, decrement it by one.
This way, if the schema change command contains mutations that want to
modify the base table, these modifying mutations will win all conflicts
based on the timestamp alone (they are using the same provided
timestamp, but not decremented).

One could argue that the choice of this timestamp is anyway arbitrary.
The original purpose of including base mutations during view update was
to ensure that a node which somehow missed the base mutations, gets them
when applying the view. But in that case, the "most correct" solution
should have been to use the *original* base mutations -- i.e. the ones
that we have on disk -- instead of generating new mutations for the base
with a refreshed timestamp. The base mutations that we have on disk have
smaller timestamps already (since these mutations are from the past,
when the base was last modified or created), so the conflict would also
not happen in this case.

But that solution would require doing a disk read, and we can avoid the
read while still fixing the conflict by using an intermediate solution:
regenerating the mutations but with `timestamp - 1`.

Ref: scylladb#15530
tgrabiec pushed a commit that referenced this issue Nov 28, 2023
… view update

When a view schema is changed, the schema change command also includes
mutations for the corresponding base table; these mutations don't modify
the base schema but are included in case if the receiver of view
mutations somehow didn't receive base mutations yet (this may in theory
happen outside Raft mode).

There are situations where the schema change command contains both
mutations that describe the current state of the base table -- included
by a view update, as explained above -- and mutations that want to
modify the base table. Such situation arises, for example, when we
update a user-defined type which is referenced by both a view and its
corresponding base table. This triggers a schema change of the view,
which generates mutations to modify the view and includes mutations of
the current base schema, and at the same time it triggers a schema
change of the base, which generates mutations to modify the base.

These two sets of mutations are conflicting with each other. One set
wants to preserve the current state of the base table while the other
wants to modify it. And the two sets of mutations are generated using
the same timestamp, which means that conflict resolution between them is
made on a per-mutation-cell basis, comparing the values in each cell and
taking the "larger" one (meaning of "larger" depends on the type of each
cell).

Fortunately, this conflict is currently benign -- or at least there is
no known situation where it causes problems.

Unfortunately, it started causing problems when I attempted to implement
group 0 schema versioning (PR #15331), where instead of
calculating table versions as hashes of schema mutations, we would send
versions as part of schema change command. These versions would be
stored inside the `system_schema.scylla_tables` table, `version` column,
and sent as part of schema change mutations.

And then the conflict showed. One set of mutations wanted to preserve
the old value of `version` column while the other wanted to update it.
It turned out that sometimes the old `version` prevailed, because the
`version` column in `system_schema.scylla_tables` uses UUID-based
comparison (not timeuuid-based comparison). This manifested as issue
#15530.

To prevent this, the idea in this commit is simple: when generating
mutations for the base table as part of corresponding view update, do
not use the provided timestamp directly -- instead, decrement it by one.
This way, if the schema change command contains mutations that want to
modify the base table, these modifying mutations will win all conflicts
based on the timestamp alone (they are using the same provided
timestamp, but not decremented).

One could argue that the choice of this timestamp is anyway arbitrary.
The original purpose of including base mutations during view update was
to ensure that a node which somehow missed the base mutations, gets them
when applying the view. But in that case, the "most correct" solution
should have been to use the *original* base mutations -- i.e. the ones
that we have on disk -- instead of generating new mutations for the base
with a refreshed timestamp. The base mutations that we have on disk have
smaller timestamps already (since these mutations are from the past,
when the base was last modified or created), so the conflict would also
not happen in this case.

But that solution would require doing a disk read, and we can avoid the
read while still fixing the conflict by using an intermediate solution:
regenerating the mutations but with `timestamp - 1`.

Ref: #15530

Closes #16139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants