-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Preserve PyTest Cache across job runs #100522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100522
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New FailuresAs of commit e36f78e: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
4f121e5
to
38332ff
Compare
.github/workflows/_rocm-test.yml
Outdated
github-token: ${{ secrets.GITHUB_TOKEN }} | ||
test-matrix: ${{ inputs.test-matrix }} | ||
|
||
- name: Download pytest cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a no-op for ROCm as I don't think they have access to S3. It's fine though as we have continue-on-error: true
here.
I don't see ROCm test is run inhttps://hud.pytorch.org/pr/100522, it's probably a good idea to do a rebase as it was restored from unstable since last week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rocm does seem to have access to S3 (at least the read command appears to work as per the "Download pytest cache" step in this job), though I'm surprised that there isn't a .pytest_cache folder found to upload despite there being a failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, so all jobs map the $GITHUB_WORKSPACE
path to the /var/lib/jenkins/workspace
folder.
However, while in most linux jobs pytest uses that workspace folder as it's root folder, rocm uses /var/lib/jenkins/pytorch
as it's root folder for some reason. That docker folder isn't mapped to any file on the drive and is thus inaccessible outside the docker environment!
(One simple fix would be to vmap that folder as well...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. For now I'm removing rocm support (we can look into adding it back later, since rocm machines do appear to have S3 access) and mac support (since they definitely don't have S3 access).
For now let's focus on getting the basic Linux/Windows support
I'm adding |
return prefix | ||
|
||
|
||
def _merge_pytest_caches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @clee2000 Do you think that this would work with the new stepcurrent plugin https://github.com/pytorch/pytorch/blob/main/test/conftest.py#L21? As this plugin saves the last test runs into the cache and resumes from that, I have a feeling that we don't need to persist this information across jobs (may be in the future, but not now)
AFAIK, there are at least the following entries in pytest cache:
- stepcurrent (or its official cousin stepwise)
- lastfailed
Do we need to persist anything from the cache besides lastfailed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of right now, the only file that needs to be saved is lastfailed. In the future nodeids might also be useful for running new tests first as well. Stepcurrent and stepwise related files and folders should definitely not be copied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context! Limiting caching to just the static support files (which pytest doesn't recreate if the .pytest_cache dir already exists) and the lastfailed file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice change. IMO, we could get a lot more from this in the future. Specifically, what if we have a generic pytest plugin pytest-awesome-s3-cache
or something that can be installed and used. It's for sure that not only PyTorch needs this feature :)
@@ -0,0 +1,101 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, it's ok to have these functions here but some of them reminds me of the list of utilities functions in https://github.com/pytorch/pytorch/blob/main/tools/stats/upload_stats_lib.py. So I guess there are some duplications
@pytorchbot merge -f "Failures are unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR. A follow up to #100522 which makes the `.pytest_cache` available to use here A concrete example: 1. Person A pushes a new commit and creates a PR. 2. 2 hours later, test_im_now_broken.py fails 3. Person A attempts to fix the test, but the test is actually still broken 4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running) # Testing I modified a file to make the tests invoking it fail and triggered CI twice with this failure. First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811 Test step took 1h 9m to run Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992 Test step failed within 2m 27s Pull Request resolved: #101123 Approved by: https://github.com/malfet, https://github.com/huydhn
Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR. A follow up to #100522 which makes the `.pytest_cache` available to use here A concrete example: 1. Person A pushes a new commit and creates a PR. 2. 2 hours later, test_im_now_broken.py fails 3. Person A attempts to fix the test, but the test is actually still broken 4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running) # Testing I modified a file to make the tests invoking it fail and triggered CI twice with this failure. First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811 Test step took 1h 9m to run Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992 Test step failed within 2m 27s Pull Request resolved: #101123 Approved by: https://github.com/malfet, https://github.com/huydhn
Preserves the PyTest cache from one job run to the next. In a later PR, this will be used to change the order in which we actually run those tests
The process is:
The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted.