Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodetool snapshot race condition with compaction removing files. #4051

Closed
eyalgutkind opened this Issue Jan 4, 2019 · 22 comments

Comments

Projects
None yet
7 participants
@eyalgutkind
Copy link

eyalgutkind commented Jan 4, 2019

This is Scylla's bug tracker, to be used for reporting bugs only.

Installation details
Scylla version (or git commit hash): Scylla 3.0rc3
OS (RHEL/CentOS/Ubuntu/AWS AMI):

When executing nodetool snapshot while a node is running compactions we see the following error out of the snapshot command. It looks like the snapshot command is trying to create a snapshot to a file group that is no longer there, probably due to compaction.
Due to the missing file the snapshot command fails.

nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: filesystem error: link failed: No such file or directory [/data/scylla01/data/system/batchlog-0290003c977e397cac3efdfdc01d747e/la-727-big-TOC.txt] [/data/scylla01/data/system/batchlog-0290003c977e397cac3efdfdc01d747e/snapshots/backup/la-727-big-TOC.txt.tmp]

Looking into the backup directory we can not find the 727 group of files.

ls /data/scylla01/data/system/batchlog-0290003c977e397cac3efdfdc01d747e/snapshots/backup/
la-691-big-CompressionInfo.db  la-703-big-Digest.sha1         la-707-big-Index.db            la-711-big-Statistics.db       la-714-big-TOC.txt
la-691-big-Data.db             la-703-big-Filter.db           la-707-big-Scylla.db           la-711-big-Summary.db          la-715-big-CompressionInfo.db
la-691-big-Digest.sha1         la-703-big-Index.db            la-707-big-Statistics.db       la-711-big-TOC.txt             la-715-big-Data.db
la-691-big-Filter.db           la-703-big-Scylla.db           la-707-big-Summary.db          la-712-big-CompressionInfo.db  la-715-big-Digest.sha1
la-691-big-Index.db            la-703-big-Statistics.db       la-707-big-TOC.txt             la-712-big-Data.db             la-715-big-Filter.db
la-691-big-Scylla.db           la-703-big-Summary.db          la-708-big-CompressionInfo.db  la-712-big-Digest.sha1         la-715-big-Index.db
la-691-big-Statistics.db       la-703-big-TOC.txt             la-708-big-Data.db             la-712-big-Filter.db           la-715-big-Scylla.db
la-691-big-Summary.db          la-705-big-CompressionInfo.db  la-708-big-Digest.sha1         la-712-big-Index.db            la-715-big-Statistics.db
la-691-big-TOC.txt             la-705-big-Data.db             la-708-big-Filter.db           la-712-big-Scylla.db           la-715-big-Summary.db
la-699-big-CompressionInfo.db  la-705-big-Digest.sha1         la-708-big-Index.db            la-712-big-Statistics.db       la-715-big-TOC.txt
la-699-big-Data.db             la-705-big-Filter.db           la-708-big-Scylla.db           la-712-big-Summary.db          la-716-big-CompressionInfo.db
la-699-big-Digest.sha1         la-705-big-Index.db            la-708-big-Statistics.db       la-712-big-TOC.txt             la-716-big-Data.db
la-699-big-Filter.db           la-705-big-Scylla.db           la-708-big-Summary.db          la-713-big-CompressionInfo.db  la-716-big-Digest.sha1
la-699-big-Index.db            la-705-big-Statistics.db       la-708-big-TOC.txt             la-713-big-Data.db             la-716-big-Filter.db
la-699-big-Scylla.db           la-705-big-Summary.db          la-709-big-CompressionInfo.db  la-713-big-Digest.sha1         la-716-big-Index.db
la-699-big-Statistics.db       la-705-big-TOC.txt             la-709-big-Data.db             la-713-big-Filter.db           la-716-big-Scylla.db
la-699-big-Summary.db          la-706-big-CompressionInfo.db  la-709-big-Digest.sha1         la-713-big-Index.db            la-716-big-Statistics.db
la-699-big-TOC.txt             la-706-big-Data.db             la-709-big-Filter.db           la-713-big-Scylla.db           la-716-big-Summary.db
la-701-big-CompressionInfo.db  la-706-big-Digest.sha1         la-709-big-Index.db            la-713-big-Statistics.db       la-716-big-TOC.txt
la-701-big-Data.db             la-706-big-Filter.db           la-709-big-Scylla.db           la-713-big-Summary.db          la-717-big-CompressionInfo.db
la-701-big-Digest.sha1         la-706-big-Index.db            la-709-big-Statistics.db       la-713-big-TOC.txt             la-717-big-Data.db
la-701-big-Filter.db           la-706-big-Scylla.db           la-709-big-Summary.db          la-714-big-CompressionInfo.db  la-717-big-Digest.sha1
la-701-big-Index.db            la-706-big-Statistics.db       la-709-big-TOC.txt             la-714-big-Data.db             la-717-big-Filter.db
la-701-big-Scylla.db           la-706-big-Summary.db          la-711-big-CompressionInfo.db  la-714-big-Digest.sha1         la-717-big-Index.db
la-701-big-Statistics.db       la-706-big-TOC.txt             la-711-big-Data.db             la-714-big-Filter.db           la-717-big-Scylla.db
la-701-big-Summary.db          la-707-big-CompressionInfo.db  la-711-big-Digest.sha1         la-714-big-Index.db            la-717-big-Statistics.db
la-701-big-TOC.txt             la-707-big-Data.db             la-711-big-Filter.db           la-714-big-Scylla.db           la-717-big-Summary.db
la-703-big-CompressionInfo.db  la-707-big-Digest.sha1         la-711-big-Index.db            la-714-big-Statistics.db       la-717-big-TOC.txt
la-703-big-Data.db             la-707-big-Filter.db           la-711-big-Scylla.db           la-714-big-Summary.db          manifest.json

@eyalgutkind eyalgutkind changed the title notetool snapshot race condition with compaction removing files. nodetool snapshot race condition with compaction removing files. Jan 4, 2019

@tzach tzach added the bug label Jan 4, 2019

@tzach tzach added this to the 3.0 milestone Jan 4, 2019

@slivne

This comment has been minimized.

Copy link
Contributor

slivne commented Jan 4, 2019

@eyalgutkind can we get the logs from the node ?

@eyalgutkind

This comment has been minimized.

Copy link
Author

eyalgutkind commented Jan 4, 2019

@slivne

This comment has been minimized.

Copy link
Contributor

slivne commented Jan 5, 2019

true I was able to reproduce this on docker as well....

ran cassandra-stress

in two session
session 1: run nodetool flush
session 2: wait 1 sec; run nodetool snapshot

and the issue reproduces.

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 7, 2019

snapshot gets the list of all current sstables and iterates through them, copying each one into snapshot dir. The problem exists because one of the sstables may have been compacted before the iteration aforementioned reaches it. i thought about a few options (all respecting snapshot semantics), follow:

  1. wait for compaction to finish, prevent sstables being snapshot'd from being compacted
  2. postpone deletion for sstables that were selected for snapshot, like by scheduling a job instead, or doing it only when sstable is closed.

I prefer option 2 by far. Thoughts?

@bhalevy

This comment has been minimized.

Copy link
Contributor

bhalevy commented Jan 7, 2019

Option 2 looks much better. If snapshot can increase the reference count on the sstable and compaction would mark it for deletion we can remove it once the last reference (snapshot or compaction) is released.

@slivne

This comment has been minimized.

Copy link
Contributor

slivne commented Jan 7, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 7, 2019

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 7, 2019

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 8, 2019

We have a list to keep track of sstables that have been compacted but have not been deleted yet, so must not GC any tombstones in other sstables that may delete data in these sstables. That's preventing me from moving to direction in which a compacted sstable will be deleted in destructor. The existence of the list aforementioned for handling GC issue and deletion of sstable in destructor seems to be mutually exclusive. A solution that I have in mind is to keep the sstables::delete_atomically() approach but make it only proceed for a given sstable after pending operations on it are done, like snapshot, but may be kind of ugly. I'll think more about alternative solutions.

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

A quick fix would be to run the snapshotting under cf.run_with_compaction_disabled()

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 9, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

Here's an outline of a more complete solution.

Any given sstable within the Scylla process can be in the following states:

(1) Used, must not be removed from disk

(2) Unused, can now be removed from disk, but atomically with all sstables in the same atomic deletion group.

(3) Removed on disk

The transition is monotonic: (1) -> (2) -> (3).

In addition to having a list of active sstables in state (1) for a given table, compaction needs a list of sstables in states <= (2), which are not present in the main sstable list and are not yet deleted (aka compacted_undeleted).

=== Problem ===

The problem here is that sstables are deleted on disk by compaction while still in state (1), because compaction has no regard for active users.

As Raphael mentioned, we cannot fix this by changing compaction to just rely on shared_sstable refcounting for managing the transition from state (1) to (2) and let the sstable destructor delete the sstables, because of two problems:

(A) sstables in a single compaction group must be deleted atomically. We cannot let one of the compacted sstables to be in state (2) while others are in state (3) because that may result in loss of deletions after the process is restarted.

(B) We cannot store shared_sstable in compacted_undeleted because this would prevent the sstable destructor from running and keep the sstables on disk indefinitely.

=== Solution ====

The list of compacted_undeleted sstables mentioned in (B) can be maintained by linking sstable objects into an intrusive list. They will be linked-in when compaction removes the sstable from the main sstable list. Sstable objects will be destroyed after being removed from disk, and their destructors will auto-unlink themselves from that list.

For solving (A), we need a new concept:

// Called when an sstable marked for deletion is no longer used,
// meaning no live shared_sstable is pointing to it.
class sstable_deleter {
public:
   virtual void accept(std::unique_ptr<sstable>) noexcept = 0;
};

The sstable object destructor will no longer delete the sstable, sstable_deleter implementation will.

When marking an sstable for deletion, you pass a seastar::shared_ptr<sstable_deleter>, which decides what to do with a no longer used sstable. We can have a default impl which does what sstable::~sstable currently does.

shared_sstable will be, like now, a shared-pointer-like object which indicates that the sstable is used, which keeps the sstable in state (1). It will be a wrapper over shared_ptr<sstable>. Its destructor will call shared_ptr::release() to check if it's the last owner. If it was the last owner, the unqiue_ptr<sstable> it got from release() will be passed to the sstable_deleter associated with that sstable.

When compaction removes sstables from the main sstable list, it allocates one sstable_deleter and marks all sstables in the atomic deletion group using that sstable_deleter. The deleter defers actual removal from disk until it collects all the expected sstables via accept().

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 9, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 9, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 9, 2019

@raphaelsc Can you think of a reason why this race became more likely on 3.0 than before?

@raphaelsc

This comment has been minimized.

Copy link
Contributor

raphaelsc commented Jan 9, 2019

@avikivity

This comment has been minimized.

Copy link
Contributor

avikivity commented Jan 9, 2019

@tgrabiec's plan is in line with my idea of having an sstable_manager to manage the lifetime of the sstable. The difference is that sstable_manager is the interface for both creation and destruction of the sstable. We can incrementally move to full life cycle managment after this bug is fixed.

@avikivity

This comment has been minimized.

Copy link
Contributor

avikivity commented Jan 10, 2019

We can use boost::intrusive_ptr as the implementation of shared_sstble. intrusive_ref calls intrusive_ptr_release to drop the reference count, so it can move the sstable to sstable_deleter when the reference count drops to 0.

@tgrabiec

This comment has been minimized.

Copy link
Contributor

tgrabiec commented Jan 10, 2019

@raphaelsc atomic deletion manager was removed in 2.2, but we only noticed this now.

avikivity added a commit that referenced this issue Jan 10, 2019

database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Fixes #4051.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110185726.22604-1-raphaelsc@scylladb.com>

avikivity added a commit that referenced this issue Jan 11, 2019

database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>

avikivity added a commit that referenced this issue Feb 19, 2019

database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.