test_add_udt_to_another_data_types: Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs> #15530

bhalevy · 2023-09-23T08:09:03Z

Although the symptom looks identical to #5336 I'm opening a new issue since it seems like we have a recent regression.

I've just got a very similar failure on #15331:

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3711/testReport/junit/user_types_test/TestUserTypes/Tests___Sanity_Tests___test_add_udt_to_another_data_types/
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_

@Deexie hit this in #15506

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3843/

I just saw this in #15480 (comment)

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3886/testReport/junit/user_types_test/TestUserTypes/Tests___Sanity_Tests___test_add_udt_to_another_data_types/
E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>"

The text was updated successfully, but these errors were encountered:

bhalevy · 2023-09-23T08:10:57Z

let's determine first if the issue is with the product or with the test.

bhalevy · 2023-09-23T08:11:30Z

@nyh was there an recent change in this area on scylla's side?

nyh · 2023-09-24T09:00:49Z

@avikivity and @cvybhu have been changing related area of the CQL expression code. But if this failure isn't consistent (i.e., flaky. Is it?) perhaps a more likely possibility is that it's a schema change race - the request is done immediately after an ALTER TYPE and maybe it isn't aware of the change. It's not impossible that we had a regression somewhere - we could bisect. I can't think of any specific change that I suspect.

bhalevy · 2023-09-24T10:41:50Z

@tgrabiec please confirm schema change sync issue

nyh · 2023-09-24T11:06:05Z

@tgrabiec please confirm schema change sync issue

@bhalevy is this test failure flaky, or it recently started to fail consistently?

bhalevy · 2023-09-24T11:13:18Z

@tgrabiec please confirm schema change sync issue

@bhalevy is this test failure flaky, or it recently started to fail consistently?

It was flaky in the past and that issue was closed with no explicit fix (See #5336 (comment)), but it became more flaky again recently so I opened this issue rather than reopening #5336

kbr-scylla · 2023-09-25T12:01:58Z

Another occurrence
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3908/testReport/junit/user_types_test/TestUserTypes/Tests___Sanity_Tests___test_add_udt_to_another_data_types/

cvybhu · 2023-09-25T13:39:08Z

@cvybhu have been changing related area of the CQL expression code.

To me it looks like some kind of schema version race.

We have a 3 node cluster and the following schema (schema v1):

CREATE TABLE IF NOT EXISTS entity
(
    entity_id text,
    other_entity_id text,
    entity_type text,
    type text,
    service text,
    entity_info frozen < obs_entity >,
    import_timestamp timestamp,
    import_timestamp_day timestamp,
    primary key(entity_id)
);

CREATE TYPE IF NOT EXISTS user_refs
(
    id text,
    alt_name text,
    firstname text,
    lastname text,
    email text
);

CREATE TYPE IF NOT EXISTS obs_entity
(
    entity_id text,
    version int,
    entity_type text,
    type text,
    service text,
    user_refs frozen<user_refs>
);

We run the following query (query v1):

INSERT INTO ENTITY
(
    entity_id,
    entity_info,
    entity_type,
    import_timestamp,
    import_timestamp_day,
    other_entity_id,
    service,
    type
)
VALUES
(
    'text{i}',
    (
        'entity_id{i}',
        {i},
        'entity_type{i}',
        'type{i}',
        'service{i}', 
        (
            'id{i}',
            'alt_name{i}',
            'firstname{i}',
            'lastname{i}',
            'email{i}'
        )
    ), 
    'entity_type{i}',
    1234568979,
    45621313131,
    'other_entity_id{i}',
    'service{i}',
    'type{i}'
);

Then both UDTs are altered, a new field is added to both of them (schema v2):

CREATE TYPE IF NOT EXISTS user_refs
(
    id text,
    alt_name text,
    firstname text,
    lastname text,
    email text,
+  priority_refs frozen<priority_refs>
);

CREATE TYPE IF NOT EXISTS obs_entity
(
    entity_id text,
    version int,
    entity_type text,
    type text,
    service text,
    user_refs frozen<user_refs>,
+  priority_refs frozen<priority_refs>
);

And then we run the query again, but we also include data for the newly added UDT fields (query v2):

INSERT INTO ENTITY
(
    entity_id,
    entity_info,
    entity_type,
    import_timestamp,
    import_timestamp_day,
    other_entity_id,
    service,
    type
)
VALUES
(
    'text{i}',
    (
        'entity_id{i}',
        {i},
        'entity_type{i}',
        'type{i}',
        'service{i}', 
        (
            'id{i}',
            'alt_name{i}',
            'firstname{i}',
            'lastname{i}',
            'email{i}'
            {new_type}
        )
        {new_type}
    ), 
    'entity_type{i}',
    1234568979,
    45621313131,
    'other_entity_id{i}',
    'service{i}',
    'type{i}'
)

The error suggests that we are running query v2 on schema v1.
The error message isn't great, but basically it complains that the tuple ('id{i}', 'alt_name{i}', 'firstname{i}', 'lastname{i}', 'email{i}', {new_type}) can't be used to initialize the user_refs field of obs_entity.

As a test I removed the ALTER TYPE statements from the test, and I got the same error as in the issue description.

It looks like the schema hasn't been updated to schema v2 by the time we start running query v2.

cvybhu · 2023-09-25T13:42:14Z

Usually the driver should wait for schema agreement after any ALTER * statements. It should wait until all nodes report the same schema version. Rust driver does this automatically, but I'm not sure about other drivers.

@avelanarius do you know if the Python driver automatically waits for schema agreement after ALTER queries? Or should we do it manually.

cvybhu · 2023-09-25T13:45:05Z

It looks like the default timeout for schema agreement in Python driver is 10 seconds. That might not be enough for slow clusters in dtest runners:

max_schema_agreement_wait:

kbr-scylla · 2023-09-25T13:48:01Z

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

If it does though, I don't believe it is a matter of increasing the timeout. Why did the previous timeout suffice until now, but no longer does? It could be that for some reason, we never reach this agreement.

We could of course increase the timeout and see if it helps.

mykaul · 2023-09-26T13:06:28Z

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

Who should do this modification?

kbr-scylla · 2023-09-26T13:12:58Z

Whoever gets assigned to this issue and starts working on it, I guess.

cvybhu · 2023-09-26T15:12:03Z

I had a look at Python driver's code, and I noticed two things:

I don't see any logic that would await schema agreement after ALTER queries automatically. Maybe we have to call refresh_metadata() manually after each ALTER? Or maybe the logic is there but I missed it.
Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node. This could lead to races, for example:
- Driver has control connection to node A
- Driver sends an ALTER query to node B
- Driver asks node A for schema versions, they are all the same because information about the ALTER query hasn't propagated to node A yet.
- Driver thinks that schema agreement has been reached, but node A still doesn't know about the schema change
Rust driver does it differently, it queries the local schema version on all available connections and checks whether the local schema versions are the same on all shards. This makes such race conditions impossible.

I suspect that the issue is caused by Python driver's handling of schema agreement, but I'm not familiar with this codebase. I think it would be good to get an opinion from @Lorak-mmk or @avelanarius here. Maybe there are some known quirks in this area.

Here's python driver's schema agreement code: https://github.com/scylladb/python-driver/blob/d12d2c13c8d646eb3e376c9e69fa014daba8fba6/cassandra/cluster.py#L4071
And here's the rust driver: https://github.com/scylladb/scylla-rust-driver/blob/1854c96099432d5db3a7e4228dd8f5d306a8c3ab/scylla/src/transport/session.rs#L1787

avelanarius · 2023-09-26T15:22:17Z

I don't see any logic that would await schema agreement after ALTER queries automatically. Maybe we have to call refresh_metadata() manually after each ALTER? Or maybe the logic is there but I missed it.

It does it here: https://github.com/scylladb/python-driver/blob/d735957e3a7b7178f4fff26f6dbae588e58cf314/cassandra/cluster.py#L4728-L4735

avelanarius · 2023-09-26T15:25:52Z

Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node.

A query to fetch schema versions from system.local and system.peers if performed on the same connection that we have received RESULT_KIND_SCHEMA_CHANGE from. I don't think the issue you described can happen.

nyh · 2023-09-26T15:55:55Z

It looks like the default timeout for schema agreement in Python driver is 10 seconds. That might not be enough for slow clusters in dtest runners:

max_schema_agreement_wait:

Note: dtest has in dtest_setup.py ``max_schema_agreement_wait=60`. This should be (famous last words...) more than enough. I suspect we don't wait for schema agreement at all...

nyh · 2023-09-26T16:20:57Z

I had a look at Python driver's code, and I noticed two things:

* I don't see any logic that would await schema agreement after `ALTER` queries automatically. Maybe we have to call `refresh_metadata()` manually after each ALTER? Or maybe the logic is there but I missed it.

The way this is done automatically is as follows:

As a response to a schema-modifying request, Scylla replies with a special response code defined in the CQL protocol - "Schema_change" (0x5).
Upon receiving a schema_change response ("event"), the driver should wait for the schema agreement.

As far as I can tell, Scylla does return SCHEMA_CHANGE on "ALTER TABLE" - see cql3/statements/alter_table_statement.cc,
and the Python driver does have code to handle this. But I didn't look too carefully, and didn't try to enable logging to see what really happens in practice.

* Python driver waits for schema agreement using only the control connection. It asks the control connection shard whether the local schema version and peers schema versions are the same, but all of this information comes from one node. This could lead to races, for example:
  
  * Driver has control connection to node A
  * Driver sends an `ALTER` query to node B
  * Driver asks node A for schema versions, they are all the same because information about the `ALTER` query hasn't propagated to node A yet.

If this is what the driver does, it's indeed looks like a bug :-( The driver knows that it just sent a specific connection to node B a request which node B told it on this connection that the schema to change. It should ask node B on the same connection - not anyone else - if it thinks the schema is agreed. If this is not what the code does, it's a bug :-(

  * Driver thinks that schema agreement has been reached, but node A still doesn't know about the schema change
  
  Rust driver does it differently, it queries the local schema version on all available connections and checks whether the local schema versions are the same on all shards. This makes such race conditions impossible.

This makes more sense and probably converges faster, but sounds like more work for the client. Does the rust driver always open connections to all nodes in the cluster, right on the first request?

cvybhu · 2023-09-26T17:23:09Z

It does it here: https://github.com/scylladb/python-driver/blob/d735957e3a7b7178f4fff26f6dbae588e58cf314/cassandra/cluster.py#L4728-L4735
A query to fetch schema versions from system.local and system.peers if performed on the same connection that we have received RESULT_KIND_SCHEMA_CHANGE from. I don't think the issue you described can happen.

Thanks, you're right, that refutes my complaints. The race condition won't happen if we query the same connection where the ALTER was sent to. It looks like the Python driver does the right thing. I got confused because the code was in the ControlConnection class, but it does indeed send the requests on another connection.

cvybhu · 2023-09-26T17:28:45Z

I enabled debug logging for python driver, ran the test and collected the logs.

diff --git a/conftest.py b/conftest.py
index a2373541..28d26443 100644
--- a/conftest.py
+++ b/conftest.py
@@ -194,7 +194,7 @@ def fixture_logging_setup(request, record_property, split_info):
         error_logger.addHandler(logging_plugin.log_file_handler)
         log_per_process_data['error_logger'] = error_logger
 
-    logging.getLogger("cassandra").setLevel(logging.INFO)
+    logging.getLogger("cassandra").setLevel(logging.DEBUG)
     logging.getLogger("boto3").setLevel(logging.INFO)
     logging.getLogger("botocore").setLevel(logging.INFO)
     logging.getLogger("s3transfer").setLevel(logging.INFO)

Here are the logs from a successful run:

debug-level-logs-schema-change.txt

Python driver does wait for schema agreement (All the lines with Waiting for schema agreement).

Looks okay to me.

cvybhu · 2023-09-26T17:33:43Z

We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out.

This sounds like a good way forward.
But is it possible to enable DEBUG logging for cassandra in this single dtest test case? I was able to do it globally, but I don't see how to do this for a single test.

cvybhu · 2023-09-26T18:07:16Z

Alright I hacked something together and opened a PR. The next time the CI fails we should be able to see logs about schema agreement.

cvybhu · 2023-10-04T12:39:51Z

Okay we got the same failure in e.g https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/5021/

There is no schema agreement timeout, but it looks like the test executes the query while the driver is still waiting for schema agreement:

<cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity')))
10:51:49,910 787     cassandra.connection           DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Message pushed from server: <EventMessage(event_type='SCHEMA_CHANGE', event_args={'target_type': 'TABLE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'table': 'entity_by_unique_id'}, stream_id=-1, trace_id=None)>
10:51:49,910 787     cassandra.cluster              DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Ignoring schedule_unique for already-scheduled task: (<bound method ControlConnection.refresh_schema of <cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity_by_unique_id')))
10:51:49,911 787     cassandra.connection           DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Message pushed from server: <EventMessage(event_type='SCHEMA_CHANGE', event_args={'target_type': 'TABLE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'table': 'entity_by_type'}, stream_id=-1, trace_id=None)>
10:51:49,911 787     cassandra.cluster              DEBUG    libevreactor.py     :370  | test_add_udt_to_another_data_types: Ignoring schedule_unique for already-scheduled task: (<bound method ControlConnection.refresh_schema of <cassandra.cluster.ControlConnection object at 0x7f53d2700090>>, (), (('target_type', 'TABLE'), ('change_type', 'UPDATED'), ('keyspace', 'abcinfo'), ('table', 'entity_by_type')))
10:51:49,987 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: Refreshing schema in response to schema change. {'target_type': 'TYPE', 'change_type': 'UPDATED', 'keyspace': 'abcinfo', 'type': 'user_refs'}
10:51:49,988 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:49,989 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,190 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,391 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,592 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,793 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas mismatched, trying again
10:51:50,994 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas match
10:51:50,994 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:50,995 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Schemas match
10:51:50,996 787     cassandra.cluster              DEBUG    thread.py           :58   | test_add_udt_to_another_data_types: [control connection] Waiting for schema agreement
10:51:51,093 787     errors                         ERROR    conftest.py         :225  | test_add_udt_to_another_data_types: test failed: 
self = <user_types_test.TestUserTypes object at 0x7f53d9754050>

    def test_add_udt_to_another_data_types(self):
        logging.getLogger("cassandra").setLevel(logging.DEBUG)
        try:
>           self.internal_add_udt_to_another_data_types()

user_types_test.py:1008: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
user_types_test.py:1124: in internal_add_udt_to_another_data_types
    insert_into_entity(rows=10, altered_type=True)
user_types_test.py:1083: in insert_into_entity
    session.execute(stmt)
cassandra/cluster.py:2700: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>"

There's Waiting for schema agreement and immediately after the failure happens. At the first glance it looks like the driver starts the schema agreement process, but the query doesn't wait for it.

Here's the full log: dtest-gw2.log

nyh · 2023-10-04T17:00:52Z

There's Waiting for schema agreement and immediately after the failure happens. At the first glance it looks like the driver starts the schema agreement process, but the query doesn't wait for it.

Good catch, this confirms my guess in #15530 (comment) that the problem is not the amount of wait (which is already 60 seconds), but the fact that the driver doesn't wait for it to finish.

So sounds like a driver bug, but tere's nothing smart I can say about any details. I wonder if this test failure can be reproduced locally (even if at low probability) - it will help debug it and more importantly, test a fix.

fruch · 2023-10-05T22:41:41Z

elastic data shows it happen almost on a daily basis:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/38708670-63d0-11ee-bf28-5927cc3d3ce3

Lorak-mmk · 2023-10-09T15:47:24Z

I've just managed to reproduce it on 1-node cluster (Scylla master, compiled in debug), without using python driver, so I think it's a problem in Scylla, namely that after Scylla returns result from schema-altering query it's possible that not all shards updated the schema yet.
Here's the code: https://github.com/Lorak-mmk/udt_schema_change_reproducer - you run with with 1 argument, ip of the node.
And the logs from faulty execution:

Bound socket to: 0.0.0.0:12036
Created connection to shard 0
Bound socket to: 0.0.0.0:12037
Created connection to shard 1
Bound socket to: 0.0.0.0:12038
Created connection to shard 2
Bound socket to: 0.0.0.0:12039
Created connection to shard 3
Bound socket to: 0.0.0.0:12040
Created connection to shard 4
Bound socket to: 0.0.0.0:12041
Created connection to shard 5
Bound socket to: 0.0.0.0:12042
Created connection to shard 6
Bound socket to: 0.0.0.0:12043
Created connection to shard 7
Bound socket to: 0.0.0.0:12044
Created connection to shard 8
Bound socket to: 0.0.0.0:12045
Created connection to shard 9
Bound socket to: 0.0.0.0:12046
Created connection to shard 10
Bound socket to: 0.0.0.0:12047
Created connection to shard 11
Drop keyspace
SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Dropped, keyspace_name: "ks" } })
Create keyspace
SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Created, keyspace_name: "ks" } })
Control connection - use keyspace
SetKeyspace(SetKeyspace { keyspace_name: "ks" })
Create user_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "user_refs" } })
Create obs_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "obs_entity" } })
Create other_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "other_entity" } })
Create entity table
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity" } })
Create index on entity(service)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } })
Create index on entity(type)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } })
Create entity_by_type materialized view
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_type" } })
Create entity_by_unique_id materialized view
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_unique_id" } })
Create entity_rel table
SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(dest_entity_id)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(service)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Create index on entity_rel(rel_type)
SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } })
Inserted on shard 5
Inserted on shard 1
Inserted on shard 11
Inserted on shard 6
Inserted on shard 0
Inserted on shard 8
Inserted on shard 7
Inserted on shard 3
Inserted on shard 2
Inserted on shard 4
Inserted on shard 9
Inserted on shard 10
Inserted on shard 5
Inserted on shard 0
Inserted on shard 11
Inserted on shard 1
Inserted on shard 6
Inserted on shard 3
Inserted on shard 8
Inserted on shard 2
Inserted on shard 4
Inserted on shard 7
Inserted on shard 10
Inserted on shard 9
Create priority_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "priority_refs" } })
Alter obs_entity type
SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "obs_entity" } })
Alter user_refs type
SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "user_refs" } })
thread 'main' panicked at 'Wrong response! Error(Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" })', src/main.rs:98:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I managed to reproduced it once, after a lot of tries, so it's not yet a very good reproducer, I'll try to improve it. Main difference from dtest is that after getting a response from a schema altering query, I send inserts to all shards asap, in hope of catching one that is not yet updated.

kbr-scylla · 2023-10-09T15:54:21Z

Hmm, if you suspect it's schema change code -- could it be 3d4398d?
I'm not sure how, but it was a significant change to how schema changes are done (at least in Raft mode).

kbr-scylla · 2023-10-09T15:58:27Z

A simple way to revert the effect of 3d4398d without reverting the entire thing would be:

diff --git a/service/migration_manager.cc b/service/migration_manager.cc
index e289ff7ce6..5b7f491139 100644
--- a/service/migration_manager.cc
+++ b/service/migration_manager.cc
@@ -1008,7 +1008,7 @@ static void add_committed_by_group0_flag(std::vector<mutation>& schema, const gr
 
 // Returns a future on the local application of the schema
 future<> migration_manager::announce(std::vector<mutation> schema, group0_guard guard, std::string_view description) {
-    if (_feat.group0_schema_versioning) {
+    if (false) {
         schema.push_back(make_group0_schema_version_mutation(_storage_proxy.data_dictionary(), guard));
         add_committed_by_group0_flag(schema, guard);
     }

kbr-scylla · 2023-10-09T16:22:26Z

Looking at the schema change code -- it looks like it's updating types on all shards before returning.

static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, sharded<db::system_keyspace>& sys_ks, std::vector<mutation> mutations, bool do_flush, bool reload)
{
...
    auto types_to_drop = co_await merge_types(proxy, std::move(old_types), std::move(new_types));

static future<user_types_to_drop> merge_types(distributed<service::storage_proxy>& proxy, schema_result before, schema_result after)
{
    auto diff = diff_rows(before, after);

    // Create and update user types before any tables/views are created that potentially
    // use those types. Similarly, defer dropping until after tables/views that may use
    // some of these user types are dropped.

    co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
        for (auto&& user_type : create_types(db, diff.created)) {
            db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
            co_await db.get_notifier().create_user_type(user_type);
        }
        for (auto&& user_type : create_types(db, diff.altered)) {
            db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
            co_await db.get_notifier().update_user_type(user_type);
        }
    });

and (do_merge_schema must finish locally before Scylla returns from the schema change request)

avikivity · 2023-10-09T16:39:14Z

Kudos for getting a reproducer. If it reproduces 1 in 100 runs (say), just run it 1000 times (with concurrency). We can then use bisect to look for the bad commit (we can try to prove that a commit 1 week older than the first report is okay, to have a good commit.

kbr-scylla · 2023-10-09T16:44:23Z

I applied the following diff to the reproducer:

diff --git a/src/main.rs b/src/main.rs
index 1c405de..a8ca9d9 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -181,6 +181,7 @@ async fn main() {
     let mut args = std::env::args();
     let _name = args.next().unwrap();
     let ip = args.next().unwrap();
+    for iter in 0..1000 {
     let mut ctrl_conn = setup_connection(&ip, None).await;
     let mut shards_conns: Vec<Connection> = Vec::new();
     for shard in 0..ctrl_conn.shard_count {
@@ -366,6 +367,8 @@ async fn main() {
             .await
     );
 
+    println!("Iter {}", iter);
+
     println!("Alter user_refs type");
     println!(
         "{:?}",
@@ -382,4 +385,5 @@ async fn main() {
 
     // assert_row_count(session=session, table_name="entity_rel", expected=rows_num,
     //                     consistency_level=ConsistencyLevel.QUORUM)
+    }
 }

it reproduced after 50 iterations

avikivity · 2023-10-09T16:47:03Z

That's good. So 1000 iterations (with concurrency) should be a reliable pass/fail test (provided we get at least 10 fails from 1000, not 1).

avikivity · 2023-10-09T16:47:37Z

e.g. don't stop after getting a failure

kbr-scylla · 2023-10-10T12:45:42Z

More details about the error can be obtained by modifying Scylla a bit:

@@ -528,6 +529,7 @@ tuple_constructor_test_assignment(const tuple_constructor& tc, data_dictionary::
         tuple_constructor_validate_assignable_to(tc, db, keyspace, schema_opt, receiver);
         return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
     } catch (exceptions::invalid_request_exception& e) {
+        expr_logger.warn("tuple not assignable: {}", e);
         return assignment_testable::test_result::NOT_ASSIGNABLE;
     }
 }

now Scylla prints a warning before throwing:

WARN  2023-10-10 14:39:08,581 [shard 16:stat] cql_expression - tuple not assignable: exceptions::invalid_request_exception (Invalid tuple literal for entity_info[5]: too many elements. Type frozen<user_refs> expects 5 but got 6)

Lorak-mmk · 2023-10-10T13:46:40Z

The issue probably doesn't happen with --consistent-cluster-management=false - despite a lot of tries I have not been able to reproduce it there.
It's not a matter of some single shard being out of date. I modified the reproducer to always print results from all shard, and when the failure happens it happens on all of them:

Bound socket to: 0.0.0.0:12024
Created connection to shard 0
Bound socket to: 0.0.0.0:12025
Created connection to shard 1
Bound socket to: 0.0.0.0:12026
Created connection to shard 2
Bound socket to: 0.0.0.0:12027
Created connection to shard 3
Bound socket to: 0.0.0.0:12028
Created connection to shard 4
Bound socket to: 0.0.0.0:12029
Created connection to shard 5
Bound socket to: 0.0.0.0:12030
Created connection to shard 6
Bound socket to: 0.0.0.0:12031
Created connection to shard 7
Bound socket to: 0.0.0.0:12032
Created connection to shard 8
Bound socket to: 0.0.0.0:12033
Created connection to shard 9
Bound socket to: 0.0.0.0:12034
Created connection to shard 10
Bound socket to: 0.0.0.0:12035
Created connection to shard 11
Drop keyspace
Ok(SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Dropped, keyspace_name: "ks" } }))
Create keyspace
Ok(SchemaChange(SchemaChange { event: KeyspaceChange { change_type: Created, keyspace_name: "ks" } }))
Control connection - use keyspace
Ok(SetKeyspace(SetKeyspace { keyspace_name: "ks" }))
Create user_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "user_refs" } }))
Create obs_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "obs_entity" } }))
Create other_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "other_entity" } }))
Create entity table
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity" } }))
Create index on entity(service)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } }))
Create index on entity(type)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity" } }))
Create entity_by_type materialized view
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_type" } }))
Create entity_by_unique_id materialized view
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_by_unique_id" } }))
Create entity_rel table
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Created, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(dest_entity_id)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(service)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Create index on entity_rel(rel_type)
Ok(SchemaChange(SchemaChange { event: TableChange { change_type: Updated, keyspace_name: "ks", object_name: "entity_rel" } }))
Inserted on shard 5
Inserted on shard 8
Inserted on shard 1
Inserted on shard 2
Inserted on shard 4
Inserted on shard 6
Inserted on shard 10
Inserted on shard 7
Inserted on shard 9
Inserted on shard 11
Inserted on shard 0
Inserted on shard 3
Inserted on shard 8
Inserted on shard 5
Inserted on shard 0
Inserted on shard 11
Inserted on shard 3
Inserted on shard 9
Inserted on shard 4
Inserted on shard 6
Inserted on shard 2
Inserted on shard 7
Inserted on shard 10
Inserted on shard 1
Create priority_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Created, keyspace_name: "ks", type_name: "priority_refs" } }))
Alter obs_entity type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "obs_entity" } }))
Alter user_refs type
Ok(SchemaChange(SchemaChange { event: TypeChange { change_type: Updated, keyspace_name: "ks", type_name: "user_refs" } }))
FAILED on shard 3: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 5: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 6: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 1: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 2: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 4: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 8: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 7: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 9: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 10: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 11: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }
FAILED on shard 0: Error { error: Invalid, reason: "Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>" }

kbr-scylla · 2023-10-10T13:47:46Z

I modified the reproducer: after it gets this error from Scylla, I would sleep 100ms and retry the insert. If it's a race, the second insert should succeed.

However, the second insert also fails. So it looks like schema enters a broken state that is not transient?

kbr-scylla · 2023-10-10T13:53:00Z

Also, running 1000 iterations with this diff:
#15530 (comment)

Passed.

So the most likely culprit is 3d4398d :(
I'm going to send a patch to disable this. (By disabling the cluster feature.)

@Lorak-mmk

It turns out that 3d4398d introduced a regression. The dtest `test_add_udt_to_another_data_types` started failing in a flaky way. The test is adding a field to the user-defined type `user_refs` and then inserting to a table that references this type. The insert fails with the following error: `Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>`. But Scylla apparently keeps using the old version of this type, without the new field. And this doesn't seem a race in schema propagation across shards in Scylla -- once the first insert fails, sleeping and retrying the insert fails in the same way, so it seems that schema enters a broken state that is not transient. The feature wasn't released in any version of Scylla yet, so we can safely disable it for now. When the feature is disabled, Scylla uses the old way of determining schema versions (calculating hashes). I verified using @Lorak-mmk's reproducer that disabling the feature solves the problem. Fixes scylladb#15530

kbr-scylla · 2023-10-10T17:09:23Z

I described the root cause here: #15687 (comment)

…hema in Raft mode This logic of "including base mutations" when sending a view schema change was implemented in case if a node receiving the view schema mutations somehow missed earlier base mutations that the view mutations may depend on. First, in Raft mode it's not necessary to do this. The base mutations are guaranteed to already be present on any node that applies the view update, which is ordered after any base update that it may depend on in the Raft log. Second, including base mutations, due to the way it's implemented, may cause problems. Suppose we have a base table with a view attached. When we update the schema of this base table, the view's schema is also updated. The base update generates mutations for base, obviously. But the view update also generates mutations for base -- the "include base mutations" logic. It turns out that base mutations from base update conflict with base mutations from the view update. This conflict is currently benign. But it manifested and caused a hard to debug failure when we implemented group 0 schema versioning. It had to be reverted. As a prerequisite to get group 0 schema versioning merged back, we need to deal with this conflict. One way is to not include these base mutations as part of view update (in Raft mode), which is what this commit does. Refs: scylladb#15331 (group 0 schema versioning PR) scylladb#15530 (issue caused by the conflicting mutations) scylladb#15687 (comment) (in this comment I describe in detail how the conflict works)

… view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the *original* base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb#15530

… view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR #15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue #15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the *original* base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: #15530 Closes #16139

bhalevy added triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Sep 23, 2023

bhalevy mentioned this issue Sep 23, 2023

repair: release resources of shard_repair_task_impl #15506

Closed

mykaul assigned cvybhu Oct 1, 2023

denesb mentioned this issue Oct 2, 2023

logalloc, reader_concurrency_semaphore: cooperate on OOM kills #15581

Merged

cvybhu assigned kbr-scylla and unassigned cvybhu Oct 10, 2023

kbr-scylla mentioned this issue Oct 10, 2023

feature_service: disable GROUP0_SCHEMA_VERSIONING #15687

Closed

scylladb-promoter closed this as completed in 35849fc Oct 11, 2023

bhalevy mentioned this issue Oct 11, 2023

view updates hit mutation_write_failure_exception errors due to schema change propagation delays #15048

Open

mykaul added this to the 5.4 milestone Oct 19, 2023

kbr-scylla mentioned this issue Nov 21, 2023

migration_manager: don't include base mutations when updating view schema in Raft mode #16120

Closed

kbr-scylla mentioned this issue Nov 22, 2023

schema_tables: use smaller timestamp for base mutations included with view update #16139

Closed

mykaul removed the triage/master Looking for assignee label Nov 22, 2023

mykaul modified the milestones: 5.4, 6.0 Nov 22, 2023

test_add_udt_to_another_data_types: Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs> #15530

test_add_udt_to_another_data_types: Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs> #15530

Comments

bhalevy commented Sep 23, 2023

bhalevy commented Sep 23, 2023

bhalevy commented Sep 23, 2023

nyh commented Sep 24, 2023

bhalevy commented Sep 24, 2023

nyh commented Sep 24, 2023

bhalevy commented Sep 24, 2023

kbr-scylla commented Sep 25, 2023

cvybhu commented Sep 25, 2023 • edited

cvybhu commented Sep 25, 2023

cvybhu commented Sep 25, 2023

kbr-scylla commented Sep 25, 2023

mykaul commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

cvybhu commented Sep 26, 2023 • edited

avelanarius commented Sep 26, 2023

avelanarius commented Sep 26, 2023

nyh commented Sep 26, 2023

nyh commented Sep 26, 2023

cvybhu commented Sep 26, 2023 • edited

cvybhu commented Sep 26, 2023 • edited

cvybhu commented Sep 26, 2023

cvybhu commented Sep 26, 2023

cvybhu commented Oct 4, 2023 • edited

nyh commented Oct 4, 2023

fruch commented Oct 5, 2023

Lorak-mmk commented Oct 9, 2023 • edited

kbr-scylla commented Oct 9, 2023

kbr-scylla commented Oct 9, 2023

kbr-scylla commented Oct 9, 2023

avikivity commented Oct 9, 2023

kbr-scylla commented Oct 9, 2023

avikivity commented Oct 9, 2023

avikivity commented Oct 9, 2023

kbr-scylla commented Oct 10, 2023

Lorak-mmk commented Oct 10, 2023

kbr-scylla commented Oct 10, 2023

kbr-scylla commented Oct 10, 2023 • edited

kbr-scylla commented Oct 10, 2023

cvybhu commented Sep 25, 2023 •

edited

cvybhu commented Sep 26, 2023 •

edited

cvybhu commented Sep 26, 2023 •

edited

cvybhu commented Sep 26, 2023 •

edited

cvybhu commented Oct 4, 2023 •

edited

Lorak-mmk commented Oct 9, 2023 •

edited

kbr-scylla commented Oct 10, 2023 •

edited