scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

denesb · 2023-12-22T09:25:34Z

Issue description

Not a regression (code has been like this since introduction of automatic schema discovery and load - present in 5.2+).

Crash happens here:

Lines 441 to 443 in 9c7e5f6

    
           auto reader = make_combined_reader(schema_table_schema, permit, std::move(readers)); 
        
           auto mut_opt = read_mutation_from_flat_mutation_reader(reader).get();

This seems like a genuine bug, no one closes that reader. This should be crashing all the time, but somehow it doesn't.
I cannot reproduce this crash locally, and it seems like it doesn't happen all the time in SCT either. Not yet sure why.

Installation details

Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

alternator-3h-2024-1-db-node-d3ef5cec-7 (18.234.171.30 | 10.12.8.13) (shards: -1)
alternator-3h-2024-1-db-node-d3ef5cec-6 (54.226.242.5 | 10.12.10.50) (shards: 14)
alternator-3h-2024-1-db-node-d3ef5cec-5 (35.172.219.253 | 10.12.8.64) (shards: 14)
alternator-3h-2024-1-db-node-d3ef5cec-4 (3.80.141.243 | 10.12.9.211) (shards: 14)
alternator-3h-2024-1-db-node-d3ef5cec-3 (34.224.174.172 | 10.12.9.148) (shards: 14)
alternator-3h-2024-1-db-node-d3ef5cec-2 (54.159.81.125 | 10.12.11.208) (shards: 14)
alternator-3h-2024-1-db-node-d3ef5cec-1 (34.226.202.252 | 10.12.9.39) (shards: 14)

OS / Image: ami-02a2eda4799995de1 (aws: undefined_region)

Test: longevity-alternator-3h-test
Test id: d3ef5cec-c5d4-4857-a1dc-5893417574e7
Test name: enterprise-2024.1/longevity/longevity-alternator-3h-test
Test config file(s):

longevity-alternator-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor d3ef5cec-c5d4-4857-a1dc-5893417574e7
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d3ef5cec-c5d4-4857-a1dc-5893417574e7

Logs:

db-cluster-d3ef5cec.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/db-cluster-d3ef5cec.tar.gz
sct-runner-events-d3ef5cec.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/sct-runner-events-d3ef5cec.tar.gz
sct-d3ef5cec.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/sct-d3ef5cec.log.tar.gz
loader-set-d3ef5cec.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/loader-set-d3ef5cec.tar.gz
monitor-set-d3ef5cec.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/monitor-set-d3ef5cec.tar.gz
parallel-timelines-report-d3ef5cec.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d3ef5cec-c5d4-4857-a1dc-5893417574e7/20231217_230452/parallel-timelines-report-d3ef5cec.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

denesb · 2023-12-22T09:38:01Z

Mystery solved.

scylladb/readers/mutation_reader.cc

Lines 89 to 91 in 9c7e5f6

    
           future<mutation_opt> read_mutation_from_flat_mutation_reader(flat_mutation_reader_v2& r) { 
        
               return r.consume(mutation_rebuilder_v2(r.schema())); 
        
           }

consume() operates on the _impl directly, so it will never trigger set_close_required(), and the top-level reader thinks it doesn't have to be closed.
The combined reader does use top-level flat_mutation_reader_v2 methods, so it will trigger set_close_required() on the underlying readers. But the combined reader also closes readers who have no more data for the current range. So sometimes, with just the right data-shape, some readers will be left unclosed and will trigger the crash.

denesb · 2023-12-22T09:45:55Z

Opened #16520 for the consume()/close() bad interaction.

The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (scylladb#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing scylladb#16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: scylladb#16519

mykaul · 2023-12-31T12:15:49Z

@avikivity - please assist in the backport.

The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing #16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: #16519 Closes #16521 (cherry picked from commit da03334)

avikivity · 2023-12-31T16:13:43Z

Backported to 5.2, 5.4.

The reader used to read the sstables was not closed. This could sometimes trigger an abort(), because the reader was destroyed, without it being closed first. Why only sometimes? This is due to two factors: * read_mutation_from_flat_mutation_reader() - the method used to extract a mutation from the reader, uses consume(), which does not trigger `set_close_is_required()` (scylladb#16520). Due to this, the top-level combined reader did not complain when destroyed without close. * The combined reader closes underlying readers who have no more data for the current range. If the circumstances are just right, all underlying readers are closed, before the combined reader is destoyed. Looks like this is what happens for the most time. This bug was discovered in SCT testing. After fixing scylladb#16520, all invokations of `scylla-sstable`, which use this code would trigger the abort, without this patch. So no further testing is required. Fixes: scylladb#16519 Closes scylladb#16521

denesb self-assigned this Dec 22, 2023

denesb mentioned this issue Dec 22, 2023

tools/schema_loader: read_schema_table_mutation(): close the reader #16521

Closed

denesb added the area/scylla tools label Dec 22, 2023

denesb added this to the 6.0 milestone Dec 22, 2023

denesb added the type/bug label Dec 22, 2023

scylladb-promoter closed this as completed in da03334 Dec 24, 2023

scylladb-promoter added the Backport candidate label Dec 24, 2023

mykaul added the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Dec 28, 2023

avikivity removed the Backport candidate label Dec 31, 2023

denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

denesb commented Dec 22, 2023

Logs:

denesb commented Dec 22, 2023

denesb commented Dec 22, 2023

mykaul commented Dec 31, 2023

avikivity commented Dec 31, 2023

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

scylla-sstable: crash due to unclosed reader in tools/schema_loader.cc #16519

Comments

denesb commented Dec 22, 2023

Issue description

Installation details

Logs:

denesb commented Dec 22, 2023

denesb commented Dec 22, 2023

mykaul commented Dec 31, 2023

avikivity commented Dec 31, 2023