New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle EDQUOT error just like ENOSPC #12626
Comments
Do you have a reproduction / test that it doesn't? |
Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily. |
I'm closing this item for the time being. As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical. |
XFS will surely return EDQUOT (just grep the source), and ScyllaDB won't handle it correctly (ditto). |
OK, re-opening. @bhalevy - I assume your team should handle this? |
Why do we care about EDQUOT? For the sake of the argument, if we handle it like an I/O error, ENOSPC in particular, and isolate the node, I'd expect the admin to fix the environment and remove any quota restrictions, as scylla is not designed to run as a service on a general purpose server. |
I suggest that the proper handling would be to follow ENOSPC handling as effectively, running out of quota is equivalent to running out of storage space (just that it's tunable). |
@tchaikov I'm assigning to you since it's a good opportunity to survey how we deal with ENOSPC, and you may even find holes in it while inspecting the existing code. |
@bhalevy sure, will look into this issue once i am back from PTO. |
Yes, I don't see any reason to play games. Separately, the operator should monitor disk usage and increase quota ahead of time. |
EDQUOT can be returned as the errno when the underlying filesystem is trying to reserve necessary resources from disk for performing i/o on behalf of the effective user, and the filesystem fails to acquire the necessary resources. it could be inode, volume space, or whatever resources for completing the i/o operation. but none of them is the consequence of scylla's fault. so we should not `abort()` at seeing this errno. instead, it's should be reported to the administrator. in this change, EDQUOT is also considered as an environmental failure just like EIO, EACCES and ENOSPC. they could be thrown when stopping an server. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Setting this to 5.3, but we may wish to have it backported later to 5.2.x. |
EDQUOT can be returned as the errno when the underlying filesystem is trying to reserve necessary resources from disk for performing i/o on behalf of the effective user, and the filesystem fails to acquire the necessary resources. it could be inode, volume space, or whatever resources for completing the i/o operation. but none of them is the consequence of scylla's fault. so we should not `abort()` at seeing this errno. instead, it's should be reported to the administrator. in this change, EDQUOT is also considered as an environmental failure just like EIO, EACCES and ENOSPC. they could be thrown when stopping an server. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* use `defer_verbose_shutdown()` to shutdown compaction manager `EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them are caused by environmental issues. before this change, `compaction_manager` filters the ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`, so they are not propagated to caller when calling `compaction_manager::stop()` -- only a warning message is printed in the log. but `EDQUOT` is not handled. after this change, the exception raised by compaction manager's stop process is not filtered anymore and is handled by `defer_verbose_shutdown()` instead, which is able to check the type of exception, and print out error message in the log. so the `ENOSPC` and `EDQUOT` errors are taken care of, and more visible from user's perspective as they are printed as errors instead of warning. but they are not printed using the `compaction_manager` logger anymore. so if our testing or user's workflow depends on this behavior, the related setting should be updated accordingly. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
retry when memtable flush fails due to EDQUOT. there are chances that user exceeds the disk quota when scylla flushes memtable and user manages to free up the necessary resource before the next retry. before this change, we simply `abort()` in this case. after this change, we just keep on retrying until the service is shutdown. Fixes scylladb#12626 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- main: consider EDQUOT as environmental failure also - main: use defer_verbose_shutdown() to shutdown compaction manager - replica/table: extract should_retry() int with_retry - replica/table: retry on EDQUOT when flushing memtable Fixes #12626 Closes #12653 * github.com:scylladb/scylladb: replica/table: retry on EDQUOT when flushing memtable replica/table: extract should_retry() int with_retry main: use defer_verbose_shutdown() to shutdown compaction manager main: consider EDQUOT as environmental failure also
Not a regression, not backporting. There's something to be said about this being a bug, but let's deal with it later. |
When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.
The text was updated successfully, but these errors were encountered: