New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodetool clearsnapshot can crash a node #4554
Comments
There are, in my opinion, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel I think the result of the operation becomes undefined. Some files in the snapshot may have been deleted by Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace) |
the error we get on not being able to remove a directory is ..." seastar - Exceptional future ignored: storage_io_error (Storage I/O error: 39: filesystem error: remove failed: Directory not empty [..." not disputing the "big hammer" approach above - lets split this out into separate issues nodetool clearsnapshot should not crash a node on not being able to remove a directory and should return an error |
yes, that exception should return gracefully to the caller. |
91b71a0 does not solve the issue of the exception bringing down the node |
Deferring backport decision until we have more mileage with this. |
We still need to fix
|
We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag. nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error. There are, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel the result of the operation becomes undefined. Some files in the snapshot may have been deleted by clear, for example, and a user may then not be able to properly restore from the backup if this snapshot was used to generate a backup. Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace) Fixes scylladb#4554 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20190614174438.9002-1-glauber@scylladb.com> (cherry picked from commit 91b71a0)
closing this issue - we fixed the main paths - if there is still a problem - please reopen |
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag.
nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error.
The text was updated successfully, but these errors were encountered: