Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

Closed
denesb opened this issue Dec 22, 2023 · 4 comments
Closed

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

denesb opened this issue Dec 22, 2023 · 4 comments
Assignees
Milestone

Comments

@denesb
Copy link
Contributor

denesb commented Dec 22, 2023

Issue description

Not a regression (code has been like this since introduction of automatic schema discovery and load - present in 5.2+).

Crash happens here:

auto reader = make_combined_reader(schema_table_schema, permit, std::move(readers));
auto mut_opt = read_mutation_from_flat_mutation_reader(reader).get();

This seems like a genuine bug, no one closes that reader. This should be crashing all the time, but somehow it doesn't.
I cannot reproduce this crash locally, and it seems like it doesn't happen all the time in SCT either. Not yet sure why.

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • alternator-3h-2024-1-db-node-d3ef5cec-7 (18.234.171.30 | 10.12.8.13) (shards: -1)
  • alternator-3h-2024-1-db-node-d3ef5cec-6 (54.226.242.5 | 10.12.10.50) (shards: 14)
  • alternator-3h-2024-1-db-node-d3ef5cec-5 (35.172.219.253 | 10.12.8.64) (shards: 14)
  • alternator-3h-2024-1-db-node-d3ef5cec-4 (3.80.141.243 | 10.12.9.211) (shards: 14)
  • alternator-3h-2024-1-db-node-d3ef5cec-3 (34.224.174.172 | 10.12.9.148) (shards: 14)
  • alternator-3h-2024-1-db-node-d3ef5cec-2 (54.159.81.125 | 10.12.11.208) (shards: 14)
  • alternator-3h-2024-1-db-node-d3ef5cec-1 (34.226.202.252 | 10.12.9.39) (shards: 14)

OS / Image: ami-02a2eda4799995de1 (aws: undefined_region)

Test: longevity-alternator-3h-test
Test id: d3ef5cec-c5d4-4857-a1dc-5893417574e7
Test name: enterprise-2024.1/longevity/longevity-alternator-3h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d3ef5cec-c5d4-4857-a1dc-5893417574e7
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d3ef5cec-c5d4-4857-a1dc-5893417574e7

Logs:

Jenkins job URL
Argus

@denesb
Copy link
Contributor Author

denesb commented Dec 22, 2023

Mystery solved.

future<mutation_opt> read_mutation_from_flat_mutation_reader(flat_mutation_reader_v2& r) {
return r.consume(mutation_rebuilder_v2(r.schema()));
}

consume() operates on the _impl directly, so it will never trigger set_close_required(), and the top-level reader thinks it doesn't have to be closed.
The combined reader does use top-level flat_mutation_reader_v2 methods, so it will trigger set_close_required() on the underlying readers. But the combined reader also closes readers who have no more data for the current range. So sometimes, with just the right data-shape, some readers will be left unclosed and will trigger the crash.

@denesb denesb self-assigned this Dec 22, 2023
@denesb
Copy link
Contributor Author

denesb commented Dec 22, 2023

Opened #16520 for the consume()/close() bad interaction.

denesb added a commit to denesb/scylla that referenced this issue Dec 22, 2023
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (scylladb#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing scylladb#16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: scylladb#16519
denesb added a commit to denesb/scylla that referenced this issue Dec 22, 2023
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (scylladb#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing scylladb#16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: scylladb#16519
@denesb denesb added this to the 6.0 milestone Dec 22, 2023
@mykaul mykaul added the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Dec 28, 2023
@mykaul
Copy link
Contributor

mykaul commented Dec 31, 2023

@avikivity - please assist in the backport.

avikivity pushed a commit that referenced this issue Dec 31, 2023
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: #16519

Closes #16521

(cherry picked from commit da03334)
avikivity pushed a commit that referenced this issue Dec 31, 2023
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing #16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: #16519

Closes #16521

(cherry picked from commit da03334)
@avikivity
Copy link
Member

Backported to 5.2, 5.4.

@denesb denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Jan 4, 2024
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
The reader used to read the sstables was not closed. This could
sometimes trigger an abort(), because the reader was destroyed, without
it being closed first.
Why only sometimes? This is due to two factors:
* read_mutation_from_flat_mutation_reader() - the method used to extract
  a mutation from the reader, uses consume(), which does not trigger
  `set_close_is_required()` (scylladb#16520). Due to this, the top-level
  combined reader did not complain when destroyed without close.
* The combined reader closes underlying readers who have no more data
  for the current range. If the circumstances are just right, all
  underlying readers are closed, before the combined reader is
  destoyed. Looks like this is what happens for the most time.

This bug was discovered in SCT testing. After fixing scylladb#16520, all
invokations of `scylla-sstable`, which use this code would trigger the
abort, without this patch. So no further testing is required.

Fixes: scylladb#16519

Closes scylladb#16521
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants