Preserve PyTest Cache across job runs #100522

ZainRizvi · 2023-05-03T03:47:22Z

Preserves the PyTest cache from one job run to the next. In a later PR, this will be used to change the order in which we actually run those tests

The process is:

Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job
If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder
After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard
Upload the resulting cache file back to S3

The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted.

pytorch-bot · 2023-05-03T03:47:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100522

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

GitHub May 10th outage

❌ 2 New Failures

As of commit e36f78e:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.github/actions/pytest-cache-download/action.yml

huydhn · 2023-05-09T01:07:56Z

.github/workflows/_rocm-test.yml

          github-token: ${{ secrets.GITHUB_TOKEN }}
          test-matrix: ${{ inputs.test-matrix }}

+      - name: Download pytest cache


This is a no-op for ROCm as I don't think they have access to S3. It's fine though as we have continue-on-error: true here.

I don't see ROCm test is run inhttps://hud.pytorch.org/pr/100522, it's probably a good idea to do a rebase as it was restored from unstable since last week.

Rocm does seem to have access to S3 (at least the read command appears to work as per the "Download pytest cache" step in this job), though I'm surprised that there isn't a .pytest_cache folder found to upload despite there being a failure

Oh man, so all jobs map the $GITHUB_WORKSPACE path to the /var/lib/jenkins/workspace folder.

However, while in most linux jobs pytest uses that workspace folder as it's root folder, rocm uses /var/lib/jenkins/pytorch as it's root folder for some reason. That docker folder isn't mapped to any file on the drive and is thus inaccessible outside the docker environment!

(One simple fix would be to vmap that folder as well...)

Thanks for the context. For now I'm removing rocm support (we can look into adding it back later, since rocm machines do appear to have S3 access) and mac support (since they definitely don't have S3 access).

For now let's focus on getting the basic Linux/Windows support

huydhn · 2023-05-09T01:10:42Z

I'm adding ciflow/unstable just to see if something fails and the cache is uploaded

huydhn · 2023-05-09T01:30:33Z

.github/scripts/pytest_caching_utils.py

+    return prefix
+
+
+def _merge_pytest_caches(


cc @clee2000 Do you think that this would work with the new stepcurrent plugin https://github.com/pytorch/pytorch/blob/main/test/conftest.py#L21? As this plugin saves the last test runs into the cache and resumes from that, I have a feeling that we don't need to persist this information across jobs (may be in the future, but not now)

AFAIK, there are at least the following entries in pytest cache:

stepcurrent (or its official cousin stepwise)

lastfailed

Do we need to persist anything from the cache besides lastfailed?

as of right now, the only file that needs to be saved is lastfailed. In the future nodeids might also be useful for running new tests first as well. Stepcurrent and stepwise related files and folders should definitely not be copied

Thanks for the context! Limiting caching to just the static support files (which pytest doesn't recreate if the .pytest_cache dir already exists) and the lastfailed file.

huydhn

This is a nice change. IMO, we could get a lot more from this in the future. Specifically, what if we have a generic pytest plugin pytest-awesome-s3-cache or something that can be installed and used. It's for sure that not only PyTorch needs this feature :)

huydhn · 2023-05-09T22:29:37Z

.github/scripts/file_io_utils.py

@@ -0,0 +1,101 @@
+import json


Just a thought, it's ok to have these functions here but some of them reminds me of the list of utilities functions in https://github.com/pytorch/pytorch/blob/main/tools/stats/upload_stats_lib.py. So I guess there are some duplications

ZainRizvi · 2023-05-10T18:35:17Z

@pytorchbot merge -f "Failures are unrelated"

pytorchmergebot · 2023-05-10T18:37:22Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR. A follow up to #100522 which makes the `.pytest_cache` available to use here A concrete example: 1. Person A pushes a new commit and creates a PR. 2. 2 hours later, test_im_now_broken.py fails 3. Person A attempts to fix the test, but the test is actually still broken 4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running) # Testing I modified a file to make the tests invoking it fail and triggered CI twice with this failure. First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811 Test step took 1h 9m to run Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992 Test step failed within 2m 27s Pull Request resolved: #101123 Approved by: https://github.com/malfet, https://github.com/huydhn

pytorch-bot bot added the topic: not user facing topic category label May 3, 2023

ZainRizvi force-pushed the zainr/pytest-cache branch from 4f121e5 to 38332ff Compare May 6, 2023 11:42

ZainRizvi added 27 commits May 8, 2023 11:05

Add s3 and cache manipulation primitives

d0849e3

Actionize it - take 1

64430e4

Fix import

f2323ea

Support uploads and downloads

fa23ac7

Fix pr identifier

2ffee58

Keep downloads disabled until uploads are verified

c45a76c

Expand uploading to all jobs

e6b48b6

temp boto3 install

3e47e0f

Fixes

d45ce6c

disable most tests temporarily to speed up CI

f929736

try to make GITHUB_WORKSPACE actually be expanded

08e4b47

misc fixes

76d25f7

enforce pr_identifier sanitization

5a6ff6a

only test upload step in one location for now

95f8f34

temp test change

c7cd000

one more

3af6a3b

fixes

57ec9cc

fix invocation

37fd251

Install boto3 with retries

dc7ebc7

stuff didn't get cancelled?

03a7019

lint fixes

980eea0

fix quotes

41fdeaa

more linter fixes

fb511ef

again

1202e2e

do

803edda

continue if cache not found

ba53481

Save the evidence

0f27c7a

ZainRizvi changed the title ~~Add Test Reordering~~ Preserve PyTest Cache across job runs May 8, 2023

ZainRizvi marked this pull request as ready for review May 8, 2023 17:18

ZainRizvi requested a review from a team as a code owner May 8, 2023 17:18

ZainRizvi requested review from clee2000 and huydhn May 8, 2023 18:45

huydhn reviewed May 9, 2023

View reviewed changes

.github/actions/pytest-cache-download/action.yml Outdated Show resolved Hide resolved

huydhn added the ciflow/trunk Trigger trunk jobs on your pull request label May 9, 2023

huydhn reviewed May 9, 2023

View reviewed changes

huydhn added the ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow label May 9, 2023

huydhn reviewed May 9, 2023

View reviewed changes

Remove nonexistent shard input

1621e06

ZainRizvi added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 9, 2023

ZainRizvi added 4 commits May 9, 2023 16:52

Copy over only selected files

22c3e5c

Remove workflows that don't work

fe61403

formatting

a990d1f

formatting

e36f78e

ZainRizvi requested a review from huydhn May 9, 2023 22:03

huydhn approved these changes May 9, 2023

View reviewed changes

huydhn reviewed May 9, 2023

View reviewed changes

pytorchmergebot added the merging label May 10, 2023

pytorchmergebot added Merged and removed merging labels May 10, 2023

pytorchmergebot closed this in 96f4631 May 10, 2023

ZainRizvi mentioned this pull request May 10, 2023

Test Reordering: Run previously failing tests first #101123

Closed

Preserve PyTest Cache across job runs #100522

Preserve PyTest Cache across job runs #100522

Uh oh!

Conversation

ZainRizvi commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100522

❗ 1 Active SEVs

❌ 2 New Failures

Uh oh!

Uh oh!

huydhn May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi May 9, 2023

Choose a reason for hiding this comment

Uh oh!

ZainRizvi May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZainRizvi May 9, 2023

Choose a reason for hiding this comment

Uh oh!

huydhn commented May 9, 2023

Uh oh!

huydhn May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clee2000 May 9, 2023

Choose a reason for hiding this comment

Uh oh!

ZainRizvi May 9, 2023

Choose a reason for hiding this comment

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

huydhn May 9, 2023

Choose a reason for hiding this comment

Uh oh!

ZainRizvi commented May 10, 2023

Uh oh!

pytorchmergebot commented May 10, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZainRizvi commented May 3, 2023 •

edited

Loading

pytorch-bot bot commented May 3, 2023 •

edited

Loading

huydhn May 9, 2023 •

edited

Loading

ZainRizvi May 9, 2023 •

edited

Loading

huydhn May 9, 2023 •

edited

Loading