Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodetool clearsnapshot can crash a node #4554

Closed
1 task done
glommer opened this issue Jun 13, 2019 · 8 comments
Closed
1 task done

nodetool clearsnapshot can crash a node #4554

glommer opened this issue Jun 13, 2019 · 8 comments

Comments

@glommer
Copy link
Contributor

glommer commented Jun 13, 2019

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag.

nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error.

@glommer
Copy link
Contributor Author

glommer commented Jun 13, 2019

There are, in my opinion, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel I think the result of the operation becomes undefined. Some files in the snapshot may have been deleted by clear, for example, and a user may then not be able to properly restore from the backup if this snapshot was used to generate a backup.

Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace)

@slivne
Copy link
Contributor

slivne commented Jun 14, 2019

the error we get on not being able to remove a directory is ..." seastar - Exceptional future ignored: storage_io_error (Storage I/O error: 39: filesystem error: remove failed: Directory not empty [..."

not disputing the "big hammer" approach above - lets split this out into separate issues

nodetool clearsnapshot should not crash a node on not being able to remove a directory and should return an error

@glommer
Copy link
Contributor Author

glommer commented Jun 14, 2019

yes, that exception should return gracefully to the caller.

@slivne
Copy link
Contributor

slivne commented Jun 16, 2019

91b71a0 does not solve the issue of the exception bringing down the node

@avikivity
Copy link
Member

Deferring backport decision until we have more mileage with this.

@slivne slivne assigned amnonh and unassigned glommer Jul 29, 2019
@slivne
Copy link
Contributor

slivne commented Jul 29, 2019

We still need to fix

The error we get on not being able to remove a directory is ..." seastar - Exceptional future ignored: storage_io_error (Storage I/O error: 39: filesystem error: remove failed: Directory not empty [..."

@slivne slivne modified the milestones: 3.2, 3.4 Feb 13, 2020
@slivne slivne modified the milestones: 4.0, 4.1 Mar 24, 2020
avikivity pushed a commit to avikivity/scylladb that referenced this issue Apr 28, 2020
We saw a node crashing today with nodetool clearsnapshot being called.
After investigation, the reason is that nodetool clearsnapshot ws called
at the same time a new snapshot was created with the same tag. nodetool
clearsnapshot can't delete all files in the directory, because new files
had by then been created in that directory, and crashes on I/O error.

There are, many problems with allowing those operations to proceed in
parallel. Even if we fix the code not to crash and return an error on
directory non-empty, the moment they do any amount of work in parallel
the result of the operation becomes undefined. Some files in the
snapshot may have been deleted by clear, for example, and a user may
then not be able to properly restore from the backup if this snapshot
was used to generate a backup.

Moreover, although we could lock at the granularity of a keyspace or
column family, I think we should use a big hammer here and lock the
entire snapshot creation/deletion to avoid surprises (for example, if a
user requests creation of a snapshot for all keyspaces, and another
process requests clear of a single keyspace)

Fixes scylladb#4554

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190614174438.9002-1-glauber@scylladb.com>
(cherry picked from commit 91b71a0)
@bhalevy
Copy link
Member

bhalevy commented May 18, 2020

Related commits that help synchronization of snapshot operations and preventing hitting ENOENT
88d2486 sstables: Synchronize deletion of SSTables in resharding with other operations
682fb3a api: storage_service: serialize true_snapshot_size

@slivne slivne modified the milestones: 4.1, 4.3 Jun 1, 2020
@slivne
Copy link
Contributor

slivne commented Jan 24, 2021

closing this issue - we fixed the main paths - if there is still a problem - please reopen

@slivne slivne closed this as completed Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants