nodetool clearsnapshot can crash a node #4554

glommer · 2019-06-13T23:30:55Z

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag.

nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error.

glommer · 2019-06-13T23:33:46Z

There are, in my opinion, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel I think the result of the operation becomes undefined. Some files in the snapshot may have been deleted by clear, for example, and a user may then not be able to properly restore from the backup if this snapshot was used to generate a backup.

Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace)

slivne · 2019-06-14T19:47:40Z

the error we get on not being able to remove a directory is ..." seastar - Exceptional future ignored: storage_io_error (Storage I/O error: 39: filesystem error: remove failed: Directory not empty [..."

not disputing the "big hammer" approach above - lets split this out into separate issues

nodetool clearsnapshot should not crash a node on not being able to remove a directory and should return an error

glommer · 2019-06-14T19:49:32Z

yes, that exception should return gracefully to the caller.

slivne · 2019-06-16T19:06:46Z

91b71a0 does not solve the issue of the exception bringing down the node

avikivity · 2019-06-24T16:08:37Z

Deferring backport decision until we have more mileage with this.

slivne · 2019-07-29T19:40:42Z

We still need to fix

The error we get on not being able to remove a directory is ..." seastar - Exceptional future ignored: storage_io_error (Storage I/O error: 39: filesystem error: remove failed: Directory not empty [..."

We saw a node crashing today with nodetool clearsnapshot being called. After investigation, the reason is that nodetool clearsnapshot ws called at the same time a new snapshot was created with the same tag. nodetool clearsnapshot can't delete all files in the directory, because new files had by then been created in that directory, and crashes on I/O error. There are, many problems with allowing those operations to proceed in parallel. Even if we fix the code not to crash and return an error on directory non-empty, the moment they do any amount of work in parallel the result of the operation becomes undefined. Some files in the snapshot may have been deleted by clear, for example, and a user may then not be able to properly restore from the backup if this snapshot was used to generate a backup. Moreover, although we could lock at the granularity of a keyspace or column family, I think we should use a big hammer here and lock the entire snapshot creation/deletion to avoid surprises (for example, if a user requests creation of a snapshot for all keyspaces, and another process requests clear of a single keyspace) Fixes scylladb#4554 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20190614174438.9002-1-glauber@scylladb.com> (cherry picked from commit 91b71a0)

bhalevy · 2020-05-18T17:53:09Z

Related commits that help synchronization of snapshot operations and preventing hitting ENOENT
88d2486 sstables: Synchronize deletion of SSTables in resharding with other operations
682fb3a api: storage_service: serialize true_snapshot_size

slivne · 2021-01-24T20:58:26Z

closing this issue - we fixed the main paths - if there is still a problem - please reopen

slivne mentioned this issue Jun 14, 2019

Do not allow parallel snapshot (create/list/clear) to execute in parallel #4557

Closed

slivne added type/bug area/nodetool labels Jun 14, 2019

slivne assigned glommer Jun 16, 2019

slivne added this to the 3.2 milestone Jun 16, 2019

scylladb-promoter closed this as completed in 91b71a0 Jun 16, 2019

slivne reopened this Jun 16, 2019

scylladb-promoter added the Backport candidate label Jun 18, 2019

slivne assigned amnonh and unassigned glommer Jul 29, 2019

slivne unassigned amnonh Aug 4, 2019

aleksbykov mentioned this issue Jan 16, 2020

nodetool listsnapshots failed during parallel running with create/clear snapshots #5603

Closed

slivne modified the milestones: 3.2, 3.4 Feb 13, 2020

slivne modified the milestones: 4.0, 4.1 Mar 24, 2020

slivne modified the milestones: 4.1, 4.3 Jun 1, 2020

avikivity removed the Backport candidate label Jul 1, 2020

slivne closed this as completed Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodetool clearsnapshot can crash a node #4554

nodetool clearsnapshot can crash a node #4554

glommer commented Jun 13, 2019

glommer commented Jun 13, 2019

slivne commented Jun 14, 2019 •

edited

glommer commented Jun 14, 2019

slivne commented Jun 16, 2019

avikivity commented Jun 24, 2019

slivne commented Jul 29, 2019

bhalevy commented May 18, 2020

slivne commented Jan 24, 2021

nodetool clearsnapshot can crash a node #4554

nodetool clearsnapshot can crash a node #4554

Comments

glommer commented Jun 13, 2019

glommer commented Jun 13, 2019

slivne commented Jun 14, 2019 • edited

glommer commented Jun 14, 2019

slivne commented Jun 16, 2019

avikivity commented Jun 24, 2019

slivne commented Jul 29, 2019

bhalevy commented May 18, 2020

slivne commented Jan 24, 2021

slivne commented Jun 14, 2019 •

edited