New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519
Comments
Mystery solved. scylladb/readers/mutation_reader.cc Lines 89 to 91 in 9c7e5f6
|
Opened #16520 for the |
The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (scylladb#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing scylladb#16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: scylladb#16519
The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (scylladb#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing scylladb#16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: scylladb#16519
@avikivity - please assist in the backport. |
The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing #16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: #16519 Closes #16521 (cherry picked from commit da03334)
The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing #16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: #16519 Closes #16521 (cherry picked from commit da03334)
Backported to 5.2, 5.4. |
The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (scylladb#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing scylladb#16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: scylladb#16519 Closes scylladb#16521
Issue description
Not a regression (code has been like this since introduction of automatic schema discovery and load - present in 5.2+).
Crash happens here:
scylladb/tools/schema_loader.cc
Lines 441 to 443 in 9c7e5f6
This seems like a genuine bug, no one closes that reader. This should be crashing all the time, but somehow it doesn't.
I cannot reproduce this crash locally, and it seems like it doesn't happen all the time in SCT either. Not yet sure why.
Installation details
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash):
2024.1.0~rc2-20231217.f57117d9cfe3
with build-id3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-02a2eda4799995de1
(aws: undefined_region)Test:
longevity-alternator-3h-test
Test id:
d3ef5cec-c5d4-4857-a1dc-5893417574e7
Test name:
enterprise-2024.1/longevity/longevity-alternator-3h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor d3ef5cec-c5d4-4857-a1dc-5893417574e7
$ hydra investigate show-logs d3ef5cec-c5d4-4857-a1dc-5893417574e7
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: