Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport bug fixes for a v1.1.1 release #2160

Merged
merged 11 commits into from
Mar 25, 2022
Merged

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Mar 22, 2022

In large-scale deployments we observed the disk monitor erase request to return a timeout error, cancelling the ongoing request. This is a two-fold bug fix: First, we must not use a timeout for process-internal actor communication, especially not for such complicated nested loops that when partially executed and then never resumed leave the actor in an undefined state. Second, we must treat erase requests with a high priority because they should never be upheld by queued requests on the read or write path.

If this does not fix the bug, then a release with these changes will at the very least help us track down the actual source of the issue because it's no longer being shadowed from the request timeout error.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

Run on our testbed.

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Mar 22, 2022
@dominiklohmann dominiklohmann requested a review from lava March 22, 2022 11:11
@dominiklohmann dominiklohmann changed the base branch from master to v1.1.x March 22, 2022 11:14
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from 8269268 to 0b6e5ce Compare March 22, 2022 12:03
In large-scale deployments we observed the disk monitor erase request to
return a timeout error, cancelling the ongoing request. This is a
two-fold bug fix: First, we must not use a timeout for process-internal
actor communication, especially not for such complicated nested loops
that when partially executed and then never resumed leave the actor in
an undefined state. Second, we must treat erase requests with a high
priority because they should never be upheld by queued requests on the
read or write path.
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from 0b6e5ce to 7b40ec8 Compare March 22, 2022 12:28
@lava lava mentioned this pull request Mar 23, 2022
3 tasks
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from 554fc01 to d7d84c6 Compare March 23, 2022 13:50
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from d7d84c6 to 05b6c6e Compare March 23, 2022 14:03
@lava lava marked this pull request as ready for review March 23, 2022 16:15
Co-authored-by: Benno Evers <benno.evers@tenzir.com>
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from 16b66c0 to 22c3101 Compare March 24, 2022 08:54
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good; there are not unit tests but we verified manually that they work and improve disk monitor behavior on the testbed.

@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from ad33b0d to 5bf8f33 Compare March 24, 2022 10:00
@dominiklohmann dominiklohmann changed the title Give erase requests a high priority Backport bug fixes for a v1.1.1 release Mar 25, 2022
@dominiklohmann dominiklohmann force-pushed the topic/disk-monitor-prio branch from 7e28a00 to a9a1120 Compare March 25, 2022 16:04
@dominiklohmann dominiklohmann merged commit 7b99e63 into v1.1.x Mar 25, 2022
@dominiklohmann dominiklohmann deleted the topic/disk-monitor-prio branch March 25, 2022 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants