Handle EDQUOT error just like ENOSPC #12626

soyacz · 2023-01-24T11:14:55Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

mykaul · 2023-01-24T11:27:37Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

soyacz · 2023-01-24T11:31:56Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

mykaul · 2023-01-24T11:46:34Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being.
Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.

As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

avikivity · 2023-01-24T11:51:36Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being. Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.

As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

XFS will surely return EDQUOT (just grep the source), and ScyllaDB won't handle it correctly (ditto).

mykaul · 2023-01-24T11:57:42Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being. Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.
As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

XFS will surely return EDQUOT (just grep the source), and ScyllaDB won't handle it correctly (ditto).

OK, re-opening. @bhalevy - I assume your team should handle this?

bhalevy · 2023-01-24T12:42:14Z

Why do we care about EDQUOT?

For the sake of the argument, if we handle it like an I/O error, ENOSPC in particular, and isolate the node, I'd expect the admin to fix the environment and remove any quota restrictions, as scylla is not designed to run as a service on a general purpose server.

bhalevy · 2023-01-24T12:46:59Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

I suggest that the proper handling would be to follow ENOSPC handling as effectively, running out of quota is equivalent to running out of storage space (just that it's tunable).

bhalevy · 2023-01-24T12:48:21Z

@tchaikov I'm assigning to you since it's a good opportunity to survey how we deal with ENOSPC, and you may even find holes in it while inspecting the existing code.

tchaikov · 2023-01-24T13:17:43Z

@bhalevy sure, will look into this issue once i am back from PTO.

avikivity · 2023-01-24T13:17:48Z

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

I suggest that the proper handling would be to follow ENOSPC handling as effectively, running out of quota is equivalent to running out of storage space (just that it's tunable).

Yes, I don't see any reason to play games.

Separately, the operator should monitor disk usage and increase quota ahead of time.

EDQUOT can be returned as the errno when the underlying filesystem is trying to reserve necessary resources from disk for performing i/o on behalf of the effective user, and the filesystem fails to acquire the necessary resources. it could be inode, volume space, or whatever resources for completing the i/o operation. but none of them is the consequence of scylla's fault. so we should not `abort()` at seeing this errno. instead, it's should be reported to the administrator. in this change, EDQUOT is also considered as an environmental failure just like EIO, EACCES and ENOSPC. they could be thrown when stopping an server. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

mykaul · 2023-01-31T11:10:23Z

Setting this to 5.3, but we may wish to have it backported later to 5.2.x.

EDQUOT can be returned as the errno when the underlying filesystem is trying to reserve necessary resources from disk for performing i/o on behalf of the effective user, and the filesystem fails to acquire the necessary resources. it could be inode, volume space, or whatever resources for completing the i/o operation. but none of them is the consequence of scylla's fault. so we should not `abort()` at seeing this errno. instead, it's should be reported to the administrator. in this change, EDQUOT is also considered as an environmental failure just like EIO, EACCES and ENOSPC. they could be thrown when stopping an server. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

- main: consider EDQUOT as environmental failure also - main: use defer_verbose_shutdown() to shutdown compaction manager - replica/table: extract should_retry() int with_retry - replica/table: retry on EDQUOT when flushing memtable Fixes #12626 Closes #12653 * github.com:scylladb/scylladb: replica/table: retry on EDQUOT when flushing memtable replica/table: extract should_retry() int with_retry main: use defer_verbose_shutdown() to shutdown compaction manager main: consider EDQUOT as environmental failure also

avikivity · 2023-04-16T19:31:20Z

Not a regression, not backporting. There's something to be said about this being a bug, but let's deal with it later.

mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023

mykaul reopened this Jan 24, 2023

mykaul added the area/storage Storage space reduction and other storage enhancements label Jan 24, 2023

bhalevy added the n00b label Jan 24, 2023

bhalevy assigned tchaikov Jan 24, 2023

mykaul changed the title ~~Handle EDQUOT error~~ Handle EDQUOT error just like ENOSPC Jan 24, 2023

tchaikov mentioned this issue Jan 27, 2023

Handle EDQUOT error just like ENOSPC #12653

Merged

mykaul added this to the 5.3 milestone Jan 31, 2023

scylladb-promoter closed this as completed in c3ef353 Feb 10, 2023

scylladb-promoter added the Backport candidate label Feb 10, 2023

avikivity removed the Backport candidate label Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle EDQUOT error just like ENOSPC #12626

Handle EDQUOT error just like ENOSPC #12626

soyacz commented Jan 24, 2023

mykaul commented Jan 24, 2023

soyacz commented Jan 24, 2023

mykaul commented Jan 24, 2023

avikivity commented Jan 24, 2023

mykaul commented Jan 24, 2023

bhalevy commented Jan 24, 2023 •

edited

bhalevy commented Jan 24, 2023

bhalevy commented Jan 24, 2023

tchaikov commented Jan 24, 2023

avikivity commented Jan 24, 2023

mykaul commented Jan 31, 2023

avikivity commented Apr 16, 2023

Handle EDQUOT error just like ENOSPC #12626

Handle EDQUOT error just like ENOSPC #12626

Comments

soyacz commented Jan 24, 2023

mykaul commented Jan 24, 2023

soyacz commented Jan 24, 2023

mykaul commented Jan 24, 2023

avikivity commented Jan 24, 2023

mykaul commented Jan 24, 2023

bhalevy commented Jan 24, 2023 • edited

bhalevy commented Jan 24, 2023

bhalevy commented Jan 24, 2023

tchaikov commented Jan 24, 2023

avikivity commented Jan 24, 2023

mykaul commented Jan 31, 2023

avikivity commented Apr 16, 2023

bhalevy commented Jan 24, 2023 •

edited