New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During add_drop_column multiple crash due to on_internal_error() at schema::column_at() on all nodes #14110
Comments
@kostja I don't know if it's related to your team, but it will be helpful if you provide some first analysis. |
schema::column_at() can become corrupted if the memory is corrupted. We've seen this before with views, which definition was corrupted by DDL. This seems to be a similar case, since add_drop_column nemesis is doing exactly that. I see that consisetent_cluster_management: true is set in the yaml file, so Raft is on. Seems to be a regression in the current master, it needs to be bisected to the specific commit. |
reproduced this week as well Installation detailsKernel Version: 5.15.0-1037-aws Cluster size: 6 nodes (i3.4xlarge) Scylla Nodes used in this run:
OS / Image: `` (aws: undefined_region) Test: Logs and commands
Logs:
|
🟢 ❯ git log --oneline 5fa459b..e4d6ed7 | grep -v test | grep -v docs | grep Merge
e4d6ed7a70 Merge 'Coroutinize utils::verify_owner_and_mode()' from Pavel Emelyanov
5a14c3311a Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov
1339ae141a Merge 'Small improvements after pending_ranges, endpoints_for_reading -> erm PR' from Gusev Petr
51e3b9321b Merge ' mvcc: make schema upgrades gentle' from Michał Chojnowski
7cdee303cf Merge 'ks_prop_defs: disallow empty replication factor string in NTS' from Jan Ciołek
eb457b6104 Merge 'fixed broken links, added community forum link, university link, spelling and other mistakes' from Guy Shtub
57758ec3e1 Merge 'Put streaming sched group onto stream manager' from Pavel Emelyanov
2526b232f1 Merge 'Remove explicit default_priority_class() usage from sstable aux methods' from Pavel Emelyanov
3956e01640 Merge 'Clean index_reader API' from Pavel Emelyanov
88fd7f7111 Merge 'Docs: add feature store tutorial' from Attila Tóth
1c0e8c25ca Merge 'multishard_mutation_query: make reader_context::lookup_readers() exception safe' from Botond Dénes
a7c2c9f92b Merge ' message: match unknown tenants to the default tenant' from Botond Dénes
809ddd7f79 Merge 'Move pending_ranges and endpoints_for_reading from token_metadata to erm' from Gusev Petr
9d4bca26cc Merge 'raft topology: implement `check_and_repair_cdc_streams` API' from Kamil Braun
93e4671c83 Merge 'doc: add a cloud instance recommendations page' from Anna Stuchlik
e106f6714d Merge branch 'scylladb:master' into anna-cloud-recommendation-pages
3b424e391b Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy
c2aee26278 Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov
13df85ea11 Merge 'Cut feature_service -> system_keyspace dependency' from Pavel Emelyanov
5216dcb1b3 Merge 'db/system_keyspace: remove the dependency on storage_proxy' from Botond Dénes
88a2421961 Merge 'Generalize global table pointer' from Pavel Emelyanov
52e4edfd5e Merge 'cql: update permissions when creating/altering a function/keyspace' from Wojciech Mitros
d2d53fc1db Merge 'Do not yield while traversing the gossiper endpoint state map' from Benny Halevy
5a8e2153a0 Merge 'Fix heart_beat_state::force_highest_possible_version_unsafe' from Benny Halevy
407b40c430 s3/client: Merge do_flush() with upload_part()
0cff0ffa08 Merge 'alternator,config: make alternator_timeout_in_ms live-updateable' from Kefu Chai
1b04fc1425 Merge 'Use member initializer list for trace_state and related helper classes' from Pavel Emelyanov
20ff122a84 Merge 'Delete S3 sstables without the help of deletion log' from Pavel Emelyanov
31e820e5a1 Merge 'Allow tombstone GC in compaction to be disabled on user request' from Raphael "Raph" Carvalho
a91e83fad6 Merge "issue raft read barrier before pulling schema" from Gleb
97694d26c4 Merge 'reader_permit: minor improvements to resource consume/release safety' from Botond Dénes
5d6f31df8e Merge 'Coroutinize sstable::read_toc()' from Pavel Emelyanov
0a78995e2b Merge 'Share s3 clients between sstables' from Pavel Emelyanov
3d75158fda Merge 'Allow no owned token ranges in cleanup compaction' from Benny Halevy
f1cad230bb Merge 'cql: enable setting permissions on resources with quoted UDT names' from Wojciech Mitros
e57252092c Merge 'cql3: result_set, selector: change value type to managed_bytes_opt' from Avi Kivity
bb62038119 Merge 'Scrub compaction task' from Aleksandra Martyniuk
1a3545b13d Merge 'data_dictionary: define helpers in options and define == operator only' from Kefu Chai
e252dbcfb8 Merge ' readers,mutation: move mutation_fragment_stream_validator to mutation/' from Botond Dénes
7d9ab44e81 Merge 'token_metadata: read remapping for write_both_read_new' from Gusev Petr
550aa01242 Merge 'Restore raft::internal::tagged_uint64 type' from Benny Halevy
8c6229d229 Merge 'sstable: encode value using UUID' from Kefu Chai
996f717dfc Merge 'cql3/prepare_expr: force token() receiver name to be partition key token' from Jan Ciołek
287ccce1cc Merge 'sstables: extract storage out ' from Kefu Chai
41cac23aa4 Merge 'raft: verify RPC destination ID' from Mikołaj Grzebieluch
20f620feb9 Merge 'replica, sstable: replace generation_type::value() with generation_type::as_int()' from Kefu Chai
5f37d43ee6 Merge 'compaction: validate: validate the index too' from Botond Dénes
198738f2b1 Merge 'build: compile wasm udfs automatically' from Wojciech Mitros
ab5fd0f750 Merge 's3: Provide timestamps in the s3 file implementation' from Raphael "Raph" Carvalho |
@gusev-p any updates? it's an urgent regression (P1) ^^ |
Yes, I've begun looking into it, but got distracted. I'll come back to this today/tomorrow. |
mvcc: make schema upgrades gentle is the top suspect. @Mark-Gurevich can we check this? @michoecho FYI |
@gusev-p I'm sorry, but to be honest I have no idea how to check that. |
Yes, it is definitely the top suspect here, and looking at the backtraces, I think I know what the bug is. I'm going to take the ownership of this issue, if that's okay. |
Run the Jenkins job with two commits, |
@Mark-Gurevich But since I think I know what the bug is, don't bother with the above now. |
Removing 'master/triage' - Dev picked this up. |
Update: I already have a reproducer and a fix. I'll submit them tomorrow. |
Since `mvcc: make schema upgrades gentle` (51e3b93), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. Fixes scylladb#14110
Since `mvcc: make schema upgrades gentle` (51e3b93), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. Fixes scylladb#14110
…hash' from Michał Chojnowski Since `mvcc: make schema upgrades gentle` (51e3b93), rows pointed to by the cursor can have different (older) schema than the schema of the cursor's snapshot. However, one place in the code wasn't updated accordingly, causing a row to be processed with the wrong schema in the right circumstances. This passed through unit testing because it requires a digest-computing cache read after a schema change, and no test exercised this. This series fixes the bug and adds a unit test which reproduces the issue. Fixes #14110 Closes #14305 * github.com:scylladb/scylladb: test: boost/row_cache_test: add a reproducer for #14110 cache_flat_mutation_reader: use the correct schema in prepare_hash mutation: mutation_cleaner: add pause()
Removing |
Issue description
This coredump happened in test longevity-cdc-100gb-4h-test#392
At 2023-05-31 11:25:22 nemesis "add_drop_column" started.
Then coredumps happened on all nodes at the same time.
One of the coredumps from node-1.
Error:
Coredump:
Decoded:
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1036-aws
Scylla version (or git commit hash):
5.4.0~dev-20230527.e4d6ed7a7048
with build-id6dc67925bbb00bf45be5af823788c65688817efb
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: `` (aws: undefined_region)
Test:
longevity-cdc-100gb-4h-test
Test id:
4513b525-d6b3-42af-945e-7175d3d3abbb
Test name:
scylla-master/longevity/longevity-cdc-100gb-4h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 4513b525-d6b3-42af-945e-7175d3d3abbb
$ hydra investigate show-logs 4513b525-d6b3-42af-945e-7175d3d3abbb
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: