Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Fix zpool_reopen_001_pos #9680

Merged
merged 1 commit into from Dec 9, 2019

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Dec 4, 2019

Motivation and Context

Debug CI failures of zpool_reopen_001_pos. I wasn't able to reproduce
this locally so I've updated the code to provide additional debugging.

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20%28TEST%29/builds/6326

There default timeout values have not been yet changed since I'd like to
reproduce this issue if possible.

note: Hitting the open timeout has only been observed with test that
use a scsi_debug device. Other occasional test failures appear to be
encountering the same issue.

Description

Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust. Additionally, log both the
time waited and requested timeout to the internal log for debugging.

How Has This Been Tested?

Locally by running the zpool_reopen ZTS tests. Pending full CI run.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Status: Work in Progress Not yet ready for general review labels Dec 4, 2019
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

With the debugging from the PR and the CI logs I was able to verify that the 500ms timeout was in fact reached. Given that, I've increased the timeout to 1000ms to see if the issue can still be reproduced.

@codecov
Copy link

codecov bot commented Dec 6, 2019

Codecov Report

Merging #9680 into master will increase coverage by <1%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #9680    +/-   ##
========================================
+ Coverage      79%      79%   +<1%     
========================================
  Files         418      418            
  Lines      123572   123575     +3     
========================================
+ Hits        97956    98193   +237     
+ Misses      25616    25382   -234
Flag Coverage Δ
#kernel 80% <100%> (ø) ⬆️
#user 67% <ø> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f95704c...f63b930. Read the comment docs.

@behlendorf behlendorf marked this pull request as ready for review December 6, 2019 21:28
Copy link
Contributor

@Ornias1993 Ornias1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can't be 100% sure it solves the bug in all edgecases.
(can we ever be 100% certain?)

I think this is at least more robust so, LGTM.

@Ornias1993
Copy link
Contributor

@behlendorf If you mark it "not draft, ready for review as real PR" maybe remove the "WIP" tag and replace it with "review requested"?

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Dec 8, 2019
@behlendorf behlendorf changed the title ZTS: Fix zpool_reopen_001_pos (DEBUG) ZTS: Fix zpool_reopen_001_pos Dec 9, 2019
@behlendorf
Copy link
Contributor Author

Doubling the timeout appears to have resolved the issue. After running this PR through many times I haven't see this failure. At a minimum it does appear to reduce the frequency.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 9, 2019
@behlendorf behlendorf merged commit a25861d into openzfs:master Dec 9, 2019
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 26, 2019
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9680
Conflicts:
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 27, 2019
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9680
Conflicts:
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9680
Conflicts:
shartse pushed a commit to shartse/zfs that referenced this pull request Jul 21, 2020
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9680
shartse pushed a commit to shartse/zfs that referenced this pull request Aug 19, 2020
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9680
@behlendorf behlendorf deleted the zpool_reopen_001_pos branch April 19, 2021 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants