Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle EDQUOT error just like ENOSPC #12626

Closed
soyacz opened this issue Jan 24, 2023 · 12 comments · Fixed by #12653
Closed

Handle EDQUOT error just like ENOSPC #12626

soyacz opened this issue Jan 24, 2023 · 12 comments · Fixed by #12653
Assignees
Labels
area/storage Storage space reduction and other storage enhancements
Milestone

Comments

@soyacz
Copy link
Contributor

soyacz commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

@mykaul
Copy link
Contributor

mykaul commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

@soyacz
Copy link
Contributor Author

soyacz commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

@mykaul
Copy link
Contributor

mykaul commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being.
Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.

As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

@mykaul mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023
@avikivity
Copy link
Member

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being. Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.

As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

XFS will surely return EDQUOT (just grep the source), and ScyllaDB won't handle it correctly (ditto).

@mykaul
Copy link
Contributor

mykaul commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

Do you have a reproduction / test that it doesn't?

Not yet, we can plan adding it - we start using https://chaos-mesh.org/docs/simulate-io-chaos-on-kubernetes/ so we'll be able to simulate it easily.

I'm closing this item for the time being. Let's re-open if indeed (1) XFS returns EDQUOT and (2) Scylla doesn't handle it properly.
As for simulating - just assign an XFS quota on a mount the Scylla K8S operator and fill that mount point, no need for anything artifical.

XFS will surely return EDQUOT (just grep the source), and ScyllaDB won't handle it correctly (ditto).

OK, re-opening. @bhalevy - I assume your team should handle this?

@mykaul mykaul reopened this Jan 24, 2023
@mykaul mykaul added the area/storage Storage space reduction and other storage enhancements label Jan 24, 2023
@bhalevy
Copy link
Member

bhalevy commented Jan 24, 2023

Why do we care about EDQUOT?

For the sake of the argument, if we handle it like an I/O error, ENOSPC in particular, and isolate the node, I'd expect the admin to fix the environment and remove any quota restrictions, as scylla is not designed to run as a service on a general purpose server.

@bhalevy
Copy link
Member

bhalevy commented Jan 24, 2023

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

I suggest that the proper handling would be to follow ENOSPC handling as effectively, running out of quota is equivalent to running out of storage space (just that it's tunable).

@bhalevy
Copy link
Member

bhalevy commented Jan 24, 2023

@tchaikov I'm assigning to you since it's a good opportunity to survey how we deal with ENOSPC, and you may even find holes in it while inspecting the existing code.

@tchaikov
Copy link
Contributor

@bhalevy sure, will look into this issue once i am back from PTO.

@avikivity
Copy link
Member

When using XFS quota it filesystem may return EDQUOT error. Scylla should handle it properly.

I suggest that the proper handling would be to follow ENOSPC handling as effectively, running out of quota is equivalent to running out of storage space (just that it's tunable).

Yes, I don't see any reason to play games.

Separately, the operator should monitor disk usage and increase quota ahead of time.

@mykaul mykaul changed the title Handle EDQUOT error Handle EDQUOT error just like ENOSPC Jan 24, 2023
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jan 27, 2023
EDQUOT can be returned as the errno when the underlying filesystem
is trying to reserve necessary resources from disk for performing
i/o on behalf of the effective user, and the filesystem fails to
acquire the necessary resources. it could be inode, volume space,
or whatever resources for completing the i/o operation. but none
of them is the consequence of scylla's fault. so we should not
`abort()` at seeing this errno. instead, it's should be reported
to the administrator.

in this change, EDQUOT is also considered as an environmental
failure just like EIO, EACCES and ENOSPC. they could be thrown
when stopping an server.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jan 27, 2023
* use `defer_verbose_shutdown()` to shutdown compaction manager

`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.

before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.

after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jan 27, 2023
retry when memtable flush fails due to EDQUOT.

there are chances that user exceeds the disk quota when scylla flushes
memtable and user manages to free up the necessary resource before the
next retry.

before this change, we simply `abort()` in this case.

after this change, we just keep on retrying until the service is shutdown.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jan 27, 2023
retry when memtable flush fails due to EDQUOT.

there are chances that user exceeds the disk quota when scylla flushes
memtable and user manages to free up the necessary resource before the
next retry.

before this change, we simply `abort()` in this case.

after this change, we just keep on retrying until the service is shutdown.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@mykaul mykaul added this to the 5.3 milestone Jan 31, 2023
@mykaul
Copy link
Contributor

mykaul commented Jan 31, 2023

Setting this to 5.3, but we may wish to have it backported later to 5.2.x.

tchaikov added a commit to tchaikov/scylladb that referenced this issue Feb 6, 2023
EDQUOT can be returned as the errno when the underlying filesystem
is trying to reserve necessary resources from disk for performing
i/o on behalf of the effective user, and the filesystem fails to
acquire the necessary resources. it could be inode, volume space,
or whatever resources for completing the i/o operation. but none
of them is the consequence of scylla's fault. so we should not
`abort()` at seeing this errno. instead, it's should be reported
to the administrator.

in this change, EDQUOT is also considered as an environmental
failure just like EIO, EACCES and ENOSPC. they could be thrown
when stopping an server.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Feb 6, 2023
* use `defer_verbose_shutdown()` to shutdown compaction manager

`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.

before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.

after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Feb 6, 2023
retry when memtable flush fails due to EDQUOT.

there are chances that user exceeds the disk quota when scylla flushes
memtable and user manages to free up the necessary resource before the
next retry.

before this change, we simply `abort()` in this case.

after this change, we just keep on retrying until the service is shutdown.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Feb 7, 2023
* use `defer_verbose_shutdown()` to shutdown compaction manager

`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.

before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.

after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Feb 7, 2023
retry when memtable flush fails due to EDQUOT.

there are chances that user exceeds the disk quota when scylla flushes
memtable and user manages to free up the necessary resource before the
next retry.

before this change, we simply `abort()` in this case.

after this change, we just keep on retrying until the service is shutdown.

Fixes scylladb#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
denesb added a commit that referenced this issue Feb 7, 2023
- main: consider EDQUOT as environmental failure also
- main: use defer_verbose_shutdown() to shutdown compaction manager
- replica/table: extract should_retry() int with_retry
- replica/table: retry on EDQUOT when flushing memtable

Fixes #12626

Closes #12653

* github.com:scylladb/scylladb:
  replica/table: retry on EDQUOT when flushing memtable
  replica/table: extract should_retry() int with_retry
  main: use defer_verbose_shutdown() to shutdown compaction manager
  main: consider EDQUOT as environmental failure also
@avikivity
Copy link
Member

Not a regression, not backporting. There's something to be said about this being a bug, but let's deal with it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage Storage space reduction and other storage enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants