staging sstables are not cleaned up in cleanup compaction potentially causing data resurrection #9559

bhalevy · 2021-11-01T05:31:04Z

Currently (scylla 4.6.dev 034f79c), when cleanup compaction is performed via compaction_manager::perform_cleanup
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/compaction/compaction_manager.cc#L844

We select the sstables to compact with get_candidates
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/compaction/compaction_manager.cc#L185

that calls table::in_strategy_sstables
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/table.cc#L1140-L1146

that filters out staging sstables that are undergoing view building.

This means that when view building completes and the sstables are moved out of staging
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/db/view/view_update_generator.cc#L128
calling https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/table.cc#L2214

They may contain tokens that are no longer owned by the current node, if a topology change occurred after the said sstables were written and before view build completed.

The text was updated successfully, but these errors were encountered:

bhalevy · 2021-11-01T05:39:51Z

I think that the right way to go here is not to fix the cleanup case specifically, but rather allow compaction of sstables in staging, and mark them for deletion when compaction is done, as we do in the delete_sstables_for_interrupted_compaction case:
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/compaction/compaction.cc#L778-L786

Then, when view building is done, the sstable will be auto-deleted in close_files:
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/sstables/sstables.cc#L2443-L2462

and move_sstables_from_staging can be skipped.

bhalevy · 2021-11-01T05:46:12Z

An important corner case that needs to be handled is keeping the sstable around until view building successfully completes.
For this we may want to introduce a new mark_for_deletion code:
https://github.com/scylladb/scylla/blob/bd0b573a9252045bcc86bc48eb919ff807c740a4/sstables/sstables.hh#L528-L532

e.g. tentative, that can be examined and turned into marked by the view building code, but will not cause auto-deletion, by itself, in sstable::close_files().

bhalevy · 2021-11-10T17:33:36Z

Cc @raphaelsc @psarna @nyh

bhalevy · 2022-01-11T17:13:16Z

Simplified (and untested yet) idea: create the sstables with hardlinkes both in main table directory and under staging.
Normal reads, compaction, snapshot, etc. will not consider the staging directory at all and will treat the staging sstables in the main directory as any other sstable. The hardlink in the staging directory will only function to mark the sstables that need view building as it does today and when vie building is done it will just unlink the sstable unconditionally from the staging dir.
Compaction (and kind) will happen orthogonally and independently on the hardlink in the main directory regardless of staging.

Truncate will need to remove all the hardlinks in staging after stopping the view build.

Snapshot will consider only sstables in the main directory.

staging sstables need to be compacted as any other sstable. In particular, they need to be subject to cleanup compaction and scrubbing. To achieve that, link them to the base dir when sealed. When moved out of staging, table::move_sstables_from_staging already unlinks the sstable from the staging subdirectory if its get_dir() is set to the base dir. Fix scylladb#9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

clone staging sstables so their content may be compacted while views are built. When done, the hard-linked copy in the staging subsirectory will be simply unlinked. Fixes scylladb#9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Before touching any of this code for #9559, that requires a change when loading sstables from the staging subdirectory, simplify it using coroutines. Closes #10609 * github.com:scylladb/scylla: replica: distributed_loader: reindent populate_keyspace replica: distributed_loader: coroutinize populate_keyspace replica: distributed_loader: reindent handle_sstables_pending_delete replica: distributed_loader: coroutinize handle_sstables_pending_delete replica: distributed_loader: reindent cleanup_column_family_temp_sst_dirs replica: distributed_loader: coroutinize cleanup_column_family_temp_sst_dirs replica: distributed_loader: reindent make_sstables_available replica: distributed_loader: coroutinize make_sstables_available sstable_directory: parallel_for_each_restricted: keep func alive across calls replica: distributed_loader: reindent reshape replica: distributed_loader: coroutinize reshape replica: distributed_loader: coroutinize reshard replica: distributed_loader: reindent run_resharding_jobs replica: distributed_loader: coroutinize run_resharding_jobs replica: distributed_loader: reindent distribute_reshard_jobs replica: distributed_loader: coroutinize distribute_reshard_jobs replica: distributed_loader: reindent collect_all_shared_sstables replica: distributed_loader: coroutinize collect_all_shared_sstables replica: distributed_loader: reindent process_sstable_dir replica: distributed_loader: coroutinize process_sstable_dir

clone staging sstables so their content may be compacted while views are built. When done, the hard-linked copy in the staging subsirectory will be simply unlinked. Fixes scylladb#9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

clone staging sstables so their content may be compacted while views are built. When done, the hard-linked copy in the staging subdirectory will be simply unlinked. Fixes scylladb#9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

This series decouples the staging sstables from the table's sstable set. The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile. Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally. The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed. The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed. With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files. The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally. When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked. If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things. If it was already compacted and no longer exists, then unlinking will then free its storage. Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change. But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables). A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique). With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them. Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map. Fixes #9559 Closes #10657 * github.com:scylladb/scylla: table: clone staging sstables into table dir view_update_generator: discover_staging_sstables: reindent table: add get_staging_sstables view_update_generator: discover_staging_sstables: get shared table ptr earlier distributed_loader: populate table directory first sstables: time_series_sstable_set: insert: make exception safe sstables: move_to_new_dir: fix debug log message

…m Benny Halevy This series extends sstable cleanup to resharding and other (offstrategy, major, and regular) compaction types so to: * cleanup uploaded sstables (scylladb#11933) * cleanup staging sstables after they are moved back to the main directory and become eligible for compaction (scylladb#9559) When perform_cleanup is called, all sstables are scanned, and those that require cleanup are marked as such, and are added for tracking to table_state::cleanup_sstable_set. They are removed from that set once released by compaction. Along with that sstables set, we keep the owned_ranges_ptr used by cleanup in the table_state to allow other compaction types (offstrategy, major, or regular) to cleanup those sstables that are marked as require_cleanup and that were skipped by cleanup compaction for either being in the maintenance set (requiring offstrategy compaction) or in staging. Resharding is using a more straightforward mechanism of passing the owned token ranges when resharding uploaded sstables and using it to detect sstable that require cleanup, now done as piggybacked on resharding compaction. \Closes scylladb#12422 * github.com:scylladb/scylladb: table: discard_sstables: update_sstable_cleanup_state when deleting sstables compaction_manager: compact_sstables: retrieve owned ranges if required sstables: add a printer for shared_sstable compaction_manager: keep owned_ranges_ptr in compaction_state compaction_manager: perform_cleanup: keep sstables in compaction_state::sstables_requiring_cleanup compaction: refactor compaction_state out of compaction_manager compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh compaction_manager: compacting_sstable_registration: keep a ref to the compaction_state compaction_manager: refactor get_candidates compaction_manager: get_candidates: mark as const table, compaction_manager: add requires_cleanup sstable_set: add for_each_sstable_until distributed_loader: reshard: update sstable cleanup state table, compaction_manager: add update_sstable_cleanup_state compaction_manager: needs_cleanup: delete unused schema param compaction_manager: perform_cleanup: disallow empty sorted_owened_ranges distributed_loader: reshard: consider sstables for cleanup distributed_loader: process_upload_dir: pass owned_ranges_ptr to reshard distributed_loader: reshard: add optional owned_ranges_ptr param distributed_loader: reshard: get a ref to table_state distributed_loader: reshard: capture creator by ref distributed_loader: reshard: reserve num_jobs buckets compaction: move owned ranges filtering to base class compaction: move owned_ranges into descriptor (cherry picked from commit f1bbf70)

…eaned up cleanup_compaction should resolve only after all sstables that require cleanup are cleaned up. Since it is possible that some of them are in staging and therefore cannot be cleaned up, retry once a second until they become eligible. Timeout if there is no progress within 5 minutes to prevent hanging due to view building bug. \Fixes scylladb#9559 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit a5a8020)

…m Benny Halevy cleanup_compaction should resolve only after all sstables that require cleanup are cleaned up. Since it is possible that some of them are in staging and therefore cannot be cleaned up, retry once a second until they become eligible. Timeout if there is no progress within 5 minutes to prevent hanging due to view building bug. \Fixes scylladb#9559 \Closes scylladb#13812 * github.com:scylladb/scylladb: table: signal compaction_manager when staging sstables become eligible for cleanup compaction_manager: perform_cleanup: wait until all candidates are cleaned up compaction_manager: perform_cleanup: perform_offstrategy if needed compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance sstable_set: add for_each_sstable_gently* helpers (cherry picked from commit 3b424e3)