Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport bug fixes for a v1.1.1 release #2160

Merged
merged 11 commits into from Mar 25, 2022
Merged

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Mar 22, 2022

In large-scale deployments we observed the disk monitor erase request to return a timeout error, cancelling the ongoing request. This is a two-fold bug fix: First, we must not use a timeout for process-internal actor communication, especially not for such complicated nested loops that when partially executed and then never resumed leave the actor in an undefined state. Second, we must treat erase requests with a high priority because they should never be upheld by queued requests on the read or write path.

If this does not fix the bug, then a release with these changes will at the very least help us track down the actual source of the issue because it's no longer being shadowed from the request timeout error.

馃摑 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

馃幆 Review Instructions

Run on our testbed.

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Mar 22, 2022
@dominiklohmann dominiklohmann changed the base branch from master to v1.1.x March 22, 2022 11:14
In large-scale deployments we observed the disk monitor erase request to
return a timeout error, cancelling the ongoing request. This is a
two-fold bug fix: First, we must not use a timeout for process-internal
actor communication, especially not for such complicated nested loops
that when partially executed and then never resumed leave the actor in
an undefined state. Second, we must treat erase requests with a high
priority because they should never be upheld by queued requests on the
read or write path.
@lava lava marked this pull request as ready for review March 23, 2022 16:15
Co-authored-by: Benno Evers <benno.evers@tenzir.com>
Copy link
Member

@lava lava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good; there are not unit tests but we verified manually that they work and improve disk monitor behavior on the testbed.

@dominiklohmann dominiklohmann changed the title Give erase requests a high priority Backport bug fixes for a v1.1.1 release Mar 25, 2022
@dominiklohmann dominiklohmann merged commit 7b99e63 into v1.1.x Mar 25, 2022
@dominiklohmann dominiklohmann deleted the topic/disk-monitor-prio branch March 25, 2022 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
2 participants