New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport bug fixes for a v1.1.1 release #2160
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 22, 2022 12:03
8269268
to
0b6e5ce
Compare
In large-scale deployments we observed the disk monitor erase request to return a timeout error, cancelling the ongoing request. This is a two-fold bug fix: First, we must not use a timeout for process-internal actor communication, especially not for such complicated nested loops that when partially executed and then never resumed leave the actor in an undefined state. Second, we must treat erase requests with a high priority because they should never be upheld by queued requests on the read or write path.
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 22, 2022 12:28
0b6e5ce
to
7b40ec8
Compare
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 23, 2022 13:50
554fc01
to
d7d84c6
Compare
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 23, 2022 14:03
d7d84c6
to
05b6c6e
Compare
Co-authored-by: Benno Evers <benno.evers@tenzir.com>
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 24, 2022 08:54
16b66c0
to
22c3101
Compare
lava
approved these changes
Mar 24, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good; there are not unit tests but we verified manually that they work and improve disk monitor behavior on the testbed.
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 24, 2022 10:00
ad33b0d
to
5bf8f33
Compare
dominiklohmann
changed the title
Give erase requests a high priority
Backport bug fixes for a v1.1.1 release
Mar 25, 2022
dominiklohmann
force-pushed
the
topic/disk-monitor-prio
branch
from
March 25, 2022 16:04
7e28a00
to
a9a1120
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In large-scale deployments we observed the disk monitor erase request to return a timeout error, cancelling the ongoing request. This is a two-fold bug fix: First, we must not use a timeout for process-internal actor communication, especially not for such complicated nested loops that when partially executed and then never resumed leave the actor in an undefined state. Second, we must treat erase requests with a high priority because they should never be upheld by queued requests on the read or write path.
If this does not fix the bug, then a release with these changes will at the very least help us track down the actual source of the issue because it's no longer being shadowed from the request timeout error.
馃摑 Checklist
馃幆 Review Instructions
Run on our testbed.