New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_add_udt_to_another_data_types: Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs> #15530
Comments
let's determine first if the issue is with the product or with the test. |
@nyh was there an recent change in this area on scylla's side? |
@avikivity and @cvybhu have been changing related area of the CQL expression code. But if this failure isn't consistent (i.e., flaky. Is it?) perhaps a more likely possibility is that it's a schema change race - the request is done immediately after an ALTER TYPE and maybe it isn't aware of the change. It's not impossible that we had a regression somewhere - we could bisect. I can't think of any specific change that I suspect. |
@tgrabiec please confirm schema change sync issue |
It was flaky in the past and that issue was closed with no explicit fix (See #5336 (comment)), but it became more flaky again recently so I opened this issue rather than reopening #5336 |
To me it looks like some kind of schema version race. We have a 3 node cluster and the following schema ( CREATE TABLE IF NOT EXISTS entity
(
entity_id text,
other_entity_id text,
entity_type text,
type text,
service text,
entity_info frozen < obs_entity >,
import_timestamp timestamp,
import_timestamp_day timestamp,
primary key(entity_id)
);
CREATE TYPE IF NOT EXISTS user_refs
(
id text,
alt_name text,
firstname text,
lastname text,
email text
);
CREATE TYPE IF NOT EXISTS obs_entity
(
entity_id text,
version int,
entity_type text,
type text,
service text,
user_refs frozen<user_refs>
); We run the following query ( INSERT INTO ENTITY
(
entity_id,
entity_info,
entity_type,
import_timestamp,
import_timestamp_day,
other_entity_id,
service,
type
)
VALUES
(
'text{i}',
(
'entity_id{i}',
{i},
'entity_type{i}',
'type{i}',
'service{i}',
(
'id{i}',
'alt_name{i}',
'firstname{i}',
'lastname{i}',
'email{i}'
)
),
'entity_type{i}',
1234568979,
45621313131,
'other_entity_id{i}',
'service{i}',
'type{i}'
); Then both UDTs are altered, a new field is added to both of them ( CREATE TYPE IF NOT EXISTS user_refs
(
id text,
alt_name text,
firstname text,
lastname text,
email text,
+ priority_refs frozen<priority_refs>
);
CREATE TYPE IF NOT EXISTS obs_entity
(
entity_id text,
version int,
entity_type text,
type text,
service text,
user_refs frozen<user_refs>,
+ priority_refs frozen<priority_refs>
); And then we run the query again, but we also include data for the newly added UDT fields ( INSERT INTO ENTITY
(
entity_id,
entity_info,
entity_type,
import_timestamp,
import_timestamp_day,
other_entity_id,
service,
type
)
VALUES
(
'text{i}',
(
'entity_id{i}',
{i},
'entity_type{i}',
'type{i}',
'service{i}',
(
'id{i}',
'alt_name{i}',
'firstname{i}',
'lastname{i}',
'email{i}'
{new_type}
)
{new_type}
),
'entity_type{i}',
1234568979,
45621313131,
'other_entity_id{i}',
'service{i}',
'type{i}'
) The error suggests that we are running As a test I removed the It looks like the schema hasn't been updated to |
Usually the driver should wait for schema agreement after any @avelanarius do you know if the Python driver automatically waits for schema agreement after |
We could modify the test to enable DEBUG logging in the Python driver and see if the schema agreement wait times out. If it does though, I don't believe it is a matter of increasing the timeout. Why did the previous timeout suffice until now, but no longer does? It could be that for some reason, we never reach this agreement. We could of course increase the timeout and see if it helps. |
Who should do this modification? |
Whoever gets assigned to this issue and starts working on it, I guess. |
I had a look at Python driver's code, and I noticed two things:
I suspect that the issue is caused by Python driver's handling of schema agreement, but I'm not familiar with this codebase. I think it would be good to get an opinion from @Lorak-mmk or @avelanarius here. Maybe there are some known quirks in this area. Here's python driver's schema agreement code: https://github.com/scylladb/python-driver/blob/d12d2c13c8d646eb3e376c9e69fa014daba8fba6/cassandra/cluster.py#L4071 |
It does it here: https://github.com/scylladb/python-driver/blob/d735957e3a7b7178f4fff26f6dbae588e58cf314/cassandra/cluster.py#L4728-L4735 |
A query to fetch schema versions from |
Note: dtest has in dtest_setup.py ``max_schema_agreement_wait=60`. This should be (famous last words...) more than enough. I suspect we don't wait for schema agreement at all... |
The way this is done automatically is as follows:
As far as I can tell, Scylla does return SCHEMA_CHANGE on "ALTER TABLE" - see cql3/statements/alter_table_statement.cc,
If this is what the driver does, it's indeed looks like a bug :-( The driver knows that it just sent a specific connection to node B a request which node B told it on this connection that the schema to change. It should ask node B on the same connection - not anyone else - if it thinks the schema is agreed. If this is not what the code does, it's a bug :-(
This makes more sense and probably converges faster, but sounds like more work for the client. Does the rust driver always open connections to all nodes in the cluster, right on the first request? |
Thanks, you're right, that refutes my complaints. The race condition won't happen if we query the same connection where the ALTER was sent to. It looks like the Python driver does the right thing. I got confused because the code was in the |
I enabled debug logging for python driver, ran the test and collected the logs. diff --git a/conftest.py b/conftest.py
index a2373541..28d26443 100644
--- a/conftest.py
+++ b/conftest.py
@@ -194,7 +194,7 @@ def fixture_logging_setup(request, record_property, split_info):
error_logger.addHandler(logging_plugin.log_file_handler)
log_per_process_data['error_logger'] = error_logger
- logging.getLogger("cassandra").setLevel(logging.INFO)
+ logging.getLogger("cassandra").setLevel(logging.DEBUG)
logging.getLogger("boto3").setLevel(logging.INFO)
logging.getLogger("botocore").setLevel(logging.INFO)
logging.getLogger("s3transfer").setLevel(logging.INFO) Here are the logs from a successful run: debug-level-logs-schema-change.txt Python driver does wait for schema agreement (All the lines with Looks okay to me. |
This sounds like a good way forward. |
Alright I hacked something together and opened a PR. The next time the CI fails we should be able to see logs about schema agreement. |
Okay we got the same failure in e.g https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/5021/ There is no schema agreement timeout, but it looks like the test executes the query while the driver is still waiting for schema agreement:
There's Here's the full log: dtest-gw2.log |
Good catch, this confirms my guess in #15530 (comment) that the problem is not the amount of wait (which is already 60 seconds), but the fact that the driver doesn't wait for it to finish. So sounds like a driver bug, but tere's nothing smart I can say about any details. I wonder if this test failure can be reproduced locally (even if at low probability) - it will help debug it and more importantly, test a fix. |
elastic data shows it happen almost on a daily basis: |
I've just managed to reproduce it on 1-node cluster (Scylla master, compiled in debug), without using python driver, so I think it's a problem in Scylla, namely that after Scylla returns result from schema-altering query it's possible that not all shards updated the schema yet.
I managed to reproduced it once, after a lot of tries, so it's not yet a very good reproducer, I'll try to improve it. Main difference from dtest is that after getting a response from a schema altering query, I send inserts to all shards asap, in hope of catching one that is not yet updated. |
Hmm, if you suspect it's schema change code -- could it be 3d4398d? |
A simple way to revert the effect of 3d4398d without reverting the entire thing would be: diff --git a/service/migration_manager.cc b/service/migration_manager.cc
index e289ff7ce6..5b7f491139 100644
--- a/service/migration_manager.cc
+++ b/service/migration_manager.cc
@@ -1008,7 +1008,7 @@ static void add_committed_by_group0_flag(std::vector<mutation>& schema, const gr
// Returns a future on the local application of the schema
future<> migration_manager::announce(std::vector<mutation> schema, group0_guard guard, std::string_view description) {
- if (_feat.group0_schema_versioning) {
+ if (false) {
schema.push_back(make_group0_schema_version_mutation(_storage_proxy.data_dictionary(), guard));
add_committed_by_group0_flag(schema, guard);
} |
Looking at the schema change code -- it looks like it's updating types on all shards before returning. static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, sharded<db::system_keyspace>& sys_ks, std::vector<mutation> mutations, bool do_flush, bool reload)
{
...
auto types_to_drop = co_await merge_types(proxy, std::move(old_types), std::move(new_types)); static future<user_types_to_drop> merge_types(distributed<service::storage_proxy>& proxy, schema_result before, schema_result after)
{
auto diff = diff_rows(before, after);
// Create and update user types before any tables/views are created that potentially
// use those types. Similarly, defer dropping until after tables/views that may use
// some of these user types are dropped.
co_await proxy.local().get_db().invoke_on_all([&] (replica::database& db) -> future<> {
for (auto&& user_type : create_types(db, diff.created)) {
db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
co_await db.get_notifier().create_user_type(user_type);
}
for (auto&& user_type : create_types(db, diff.altered)) {
db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
co_await db.get_notifier().update_user_type(user_type);
}
}); and ( |
Kudos for getting a reproducer. If it reproduces 1 in 100 runs (say), just run it 1000 times (with concurrency). We can then use bisect to look for the bad commit (we can try to prove that a commit 1 week older than the first report is okay, to have a good commit. |
I applied the following diff to the reproducer: diff --git a/src/main.rs b/src/main.rs
index 1c405de..a8ca9d9 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -181,6 +181,7 @@ async fn main() {
let mut args = std::env::args();
let _name = args.next().unwrap();
let ip = args.next().unwrap();
+ for iter in 0..1000 {
let mut ctrl_conn = setup_connection(&ip, None).await;
let mut shards_conns: Vec<Connection> = Vec::new();
for shard in 0..ctrl_conn.shard_count {
@@ -366,6 +367,8 @@ async fn main() {
.await
);
+ println!("Iter {}", iter);
+
println!("Alter user_refs type");
println!(
"{:?}",
@@ -382,4 +385,5 @@ async fn main() {
// assert_row_count(session=session, table_name="entity_rel", expected=rows_num,
// consistency_level=ConsistencyLevel.QUORUM)
+ }
} it reproduced after 50 iterations |
That's good. So 1000 iterations (with concurrency) should be a reliable pass/fail test (provided we get at least 10 fails from 1000, not 1). |
e.g. don't stop after getting a failure |
More details about the error can be obtained by modifying Scylla a bit: @@ -528,6 +529,7 @@ tuple_constructor_test_assignment(const tuple_constructor& tc, data_dictionary::
tuple_constructor_validate_assignable_to(tc, db, keyspace, schema_opt, receiver);
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
} catch (exceptions::invalid_request_exception& e) {
+ expr_logger.warn("tuple not assignable: {}", e);
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
} now Scylla prints a warning before throwing:
|
|
I modified the reproducer: after it gets this error from Scylla, I would sleep 100ms and retry the insert. If it's a race, the second insert should succeed. However, the second insert also fails. So it looks like schema enters a broken state that is not transient? |
Also, running 1000 iterations with this diff: Passed. So the most likely culprit is 3d4398d :( |
It turns out that 3d4398d introduced a regression. The dtest `test_add_udt_to_another_data_types` started failing in a flaky way. The test is adding a field to the user-defined type `user_refs` and then inserting to a table that references this type. The insert fails with the following error: `Invalid tuple literal for entity_info: component 5 is not of type frozen<user_refs>`. But Scylla apparently keeps using the old version of this type, without the new field. And this doesn't seem a race in schema propagation across shards in Scylla -- once the first insert fails, sleeping and retrying the insert fails in the same way, so it seems that schema enters a broken state that is not transient. The feature wasn't released in any version of Scylla yet, so we can safely disable it for now. When the feature is disabled, Scylla uses the old way of determining schema versions (calculating hashes). I verified using @Lorak-mmk's reproducer that disabling the feature solves the problem. Fixes scylladb#15530
I described the root cause here: #15687 (comment) |
…hema in Raft mode This logic of "including base mutations" when sending a view schema change was implemented in case if a node receiving the view schema mutations somehow missed earlier base mutations that the view mutations may depend on. First, in Raft mode it's not necessary to do this. The base mutations are guaranteed to already be present on any node that applies the view update, which is ordered after any base update that it may depend on in the Raft log. Second, including base mutations, due to the way it's implemented, may cause problems. Suppose we have a base table with a view attached. When we update the schema of this base table, the view's schema is also updated. The base update generates mutations for base, obviously. But the view update also generates mutations for base -- the "include base mutations" logic. It turns out that base mutations from base update conflict with base mutations from the view update. This conflict is currently benign. But it manifested and caused a hard to debug failure when we implemented group 0 schema versioning. It had to be reverted. As a prerequisite to get group 0 schema versioning merged back, we need to deal with this conflict. One way is to not include these base mutations as part of view update (in Raft mode), which is what this commit does. Refs: scylladb#15331 (group 0 schema versioning PR) scylladb#15530 (issue caused by the conflicting mutations) scylladb#15687 (comment) (in this comment I describe in detail how the conflict works)
… view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the *original* base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb#15530
… view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR scylladb#15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue scylladb#15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the *original* base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: scylladb#15530
… view update When a view schema is changed, the schema change command also includes mutations for the corresponding base table; these mutations don't modify the base schema but are included in case if the receiver of view mutations somehow didn't receive base mutations yet (this may in theory happen outside Raft mode). There are situations where the schema change command contains both mutations that describe the current state of the base table -- included by a view update, as explained above -- and mutations that want to modify the base table. Such situation arises, for example, when we update a user-defined type which is referenced by both a view and its corresponding base table. This triggers a schema change of the view, which generates mutations to modify the view and includes mutations of the current base schema, and at the same time it triggers a schema change of the base, which generates mutations to modify the base. These two sets of mutations are conflicting with each other. One set wants to preserve the current state of the base table while the other wants to modify it. And the two sets of mutations are generated using the same timestamp, which means that conflict resolution between them is made on a per-mutation-cell basis, comparing the values in each cell and taking the "larger" one (meaning of "larger" depends on the type of each cell). Fortunately, this conflict is currently benign -- or at least there is no known situation where it causes problems. Unfortunately, it started causing problems when I attempted to implement group 0 schema versioning (PR #15331), where instead of calculating table versions as hashes of schema mutations, we would send versions as part of schema change command. These versions would be stored inside the `system_schema.scylla_tables` table, `version` column, and sent as part of schema change mutations. And then the conflict showed. One set of mutations wanted to preserve the old value of `version` column while the other wanted to update it. It turned out that sometimes the old `version` prevailed, because the `version` column in `system_schema.scylla_tables` uses UUID-based comparison (not timeuuid-based comparison). This manifested as issue #15530. To prevent this, the idea in this commit is simple: when generating mutations for the base table as part of corresponding view update, do not use the provided timestamp directly -- instead, decrement it by one. This way, if the schema change command contains mutations that want to modify the base table, these modifying mutations will win all conflicts based on the timestamp alone (they are using the same provided timestamp, but not decremented). One could argue that the choice of this timestamp is anyway arbitrary. The original purpose of including base mutations during view update was to ensure that a node which somehow missed the base mutations, gets them when applying the view. But in that case, the "most correct" solution should have been to use the *original* base mutations -- i.e. the ones that we have on disk -- instead of generating new mutations for the base with a refreshed timestamp. The base mutations that we have on disk have smaller timestamps already (since these mutations are from the past, when the base was last modified or created), so the conflict would also not happen in this case. But that solution would require doing a disk read, and we can avoid the read while still fixing the conflict by using an intermediate solution: regenerating the mutations but with `timestamp - 1`. Ref: #15530 Closes #16139
Although the symptom looks identical to #5336 I'm opening a new issue since it seems like we have a recent regression.
@kbr-scylla wrote:
@Deexie hit this in #15506
I just saw this in #15480 (comment)
The text was updated successfully, but these errors were encountered: