New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_interrupt_build_process dtest failed with schema_registry - Tried to build a global schema for view ks.t_by_v2 with an uninitialized base info #14011
Comments
Cc @eliransin |
I'm completely unfamiliar with this code which @eliransin wrote so I can't help triage it beyond saying that this is code which @eliransin wrote. If Eliran asks me, I can go debug this issue (by reading the code) just anyone else can. Unassigning it from myself until Eliran decides to assign it to me. |
This is an instance of: This means that we have some views with uninitialized base info. Will need to investigate further. I will further investigate how can this happen.
This is something that we should investigate since views that doesn't have base schema attached are considered to be |
Don't build in that... If scylladb/schema/schema_registry.cc Line 384 in c41f0eb
|
Obviously, this is a safeguard against ever having an incomplete view schemas registering. I am not saying it is a great state to be in, however it will only cause the view to be impossible to write to |
@cvybhu please have a look in parallel to me. This is a high priority issue. |
How come?
|
The resulting schema entry in the registry is missing the base info:
|
@bhalevy It is going to throw an exception, how would we ever get to the line you are referring? |
Sorry, you're right. |
This shows that the registry on shard 0 has a view with the base info but shard 24 doesn't. |
Lets see if the table even exists on shard 24 (
It does.
Lets check shard 0:
The view does exist on shard 0, so it is probably a shard that still haven't heard about the view - which causes this. So the main problem is that we have at least one shard that haven't heard about the view yet. |
So my first conclusion is that the internal error was doing its job, meaning, preventing corruption of the schema registry (or at least further corruption of the schema registry). |
Some shards knows about the mv and some don't:
This is out of 32 shards. |
I don't understand why the line you mentioned will cause a segfault, but in any case, the whole point of on_internal_error() and why it exists is that in release mode, when it doesn't crash the entire Scylla, it throws an exception - so you can never get "a few lines down" in the code. |
A guess: Imagine that we add a view to a pre-existing table. This sends a new version of both base table schema (it now has an added view) and the view schema to all other nodes. It then soon starts to "build" the pre-existing data and send view updates. If for some reason on one one of the shards on some node hasn't heard yet about a specific view version in a mutation, as I think you said it asks to "pull" this version, which will get it the view schema which is refers to the new version of the base table, but this shard doesn't have that version yet - in the pull it only received the view schema it asked for, not the base. I think it needs to pull the base schema as well? |
@eliransin - could the crash @ https://jenkins.scylladb.com/job/scylla-5.4/job/rolling-upgrade/job/rolling-upgrade-ami-test/3/ is due to this issue? |
@mykaul @eliransin it seems so https://cloudius-jenkins-test.s3.amazonaws.com/b9cafe83-c77d-43f3-ae4a-b6185fff62f6/20231102_165854/db-cluster-b9cafe83.tar.gz
Decoded:
|
Yes, it's the same issue. |
Got the same error as #15235 but it was closed as a duplicate of this issue. So, report it here:
Installation detailsKernel Version: 5.15.0-1049-aws Cluster size: 12 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@eliransin if it's the same issue, please close as duplicate. |
…ews schemas that lacks base information' from Eliran Sinvani This miniset addresses two potential conversions to `global_schema_ptr` of incomplete materialized views schemas. One of them was completely unnecessary and also is a "chicken and an egg" problem where on the sync schema procedure itself a view schema was converted to `global_schema_ptr` solely for the purposes of logging. This can create a "hickup" in the materialized views updates if they are comming from a node with a different mv schema. The reason why sometimes a synced schema can have no base info is because of deactivision and reactivision of the schema inside the `schema_registry` which doesn't restore the base information due to lack of context. When a schema is synced the problem becomes easy since we can just use the latest base information from the database. Fixes #14011 Closes #14861 * github.com:scylladb/scylladb: migration manager: fix incomplete mv schemas returned from get_schema_for_write migration_manager: do not globalize potentially incomplete schema (cherry picked from commit 5752dc8)
Backport to 5.4. queued. @eliransin does any other release need this? |
Checking... |
5.3 needs this too but it is not a clean backport, I will prepare a backport for this. |
We don't have a 5.3, do you mean 5.2? |
@eliransin ping backport |
@eliransin ping backport. |
1 similar comment
@eliransin ping backport. |
Sorry for the long delay, 5.2 doesn't have this code so no backport needed. |
Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/256/artifact/logs-full.release.016/1684893864334_materialized_views_test.py%3A%3ATestInterruptBuildProcess%3A%3Atest_interrupt_build_process_with_resharding_max_to_half_test/node2.log
Decoded:
After restart view building eventually succeeded:
The text was updated successfully, but these errors were encountered: