Skip to content

Conversation

ZainRizvi
Copy link
Contributor

@ZainRizvi ZainRizvi commented May 3, 2023

Preserves the PyTest cache from one job run to the next. In a later PR, this will be used to change the order in which we actually run those tests

The process is:

  1. Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job
  2. If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder
  3. After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard
  4. Upload the resulting cache file back to S3

The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted.

@pytorch-bot
Copy link

pytorch-bot bot commented May 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100522

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures

As of commit e36f78e:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 3, 2023
@ZainRizvi ZainRizvi force-pushed the zainr/pytest-cache branch from 4f121e5 to 38332ff Compare May 6, 2023 11:42
@ZainRizvi ZainRizvi changed the title Add Test Reordering Preserve PyTest Cache across job runs May 8, 2023
@ZainRizvi ZainRizvi marked this pull request as ready for review May 8, 2023 17:18
@ZainRizvi ZainRizvi requested a review from a team as a code owner May 8, 2023 17:18
@ZainRizvi ZainRizvi requested review from clee2000 and huydhn May 8, 2023 18:45
@huydhn huydhn added the ciflow/trunk Trigger trunk jobs on your pull request label May 9, 2023
github-token: ${{ secrets.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}

- name: Download pytest cache
Copy link
Contributor

@huydhn huydhn May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op for ROCm as I don't think they have access to S3. It's fine though as we have continue-on-error: true here.

I don't see ROCm test is run inhttps://hud.pytorch.org/pr/100522, it's probably a good idea to do a rebase as it was restored from unstable since last week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rocm does seem to have access to S3 (at least the read command appears to work as per the "Download pytest cache" step in this job), though I'm surprised that there isn't a .pytest_cache folder found to upload despite there being a failure

Copy link
Contributor Author

@ZainRizvi ZainRizvi May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, so all jobs map the $GITHUB_WORKSPACE path to the /var/lib/jenkins/workspace folder.

However, while in most linux jobs pytest uses that workspace folder as it's root folder, rocm uses /var/lib/jenkins/pytorch as it's root folder for some reason. That docker folder isn't mapped to any file on the drive and is thus inaccessible outside the docker environment!

(One simple fix would be to vmap that folder as well...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. For now I'm removing rocm support (we can look into adding it back later, since rocm machines do appear to have S3 access) and mac support (since they definitely don't have S3 access).

For now let's focus on getting the basic Linux/Windows support

@huydhn
Copy link
Contributor

huydhn commented May 9, 2023

I'm adding ciflow/unstable just to see if something fails and the cache is uploaded

@huydhn huydhn added the ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow label May 9, 2023
return prefix


def _merge_pytest_caches(
Copy link
Contributor

@huydhn huydhn May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @clee2000 Do you think that this would work with the new stepcurrent plugin https://github.com/pytorch/pytorch/blob/main/test/conftest.py#L21? As this plugin saves the last test runs into the cache and resumes from that, I have a feeling that we don't need to persist this information across jobs (may be in the future, but not now)

AFAIK, there are at least the following entries in pytest cache:

  • stepcurrent (or its official cousin stepwise)
  • lastfailed

Do we need to persist anything from the cache besides lastfailed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of right now, the only file that needs to be saved is lastfailed. In the future nodeids might also be useful for running new tests first as well. Stepcurrent and stepwise related files and folders should definitely not be copied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context! Limiting caching to just the static support files (which pytest doesn't recreate if the .pytest_cache dir already exists) and the lastfailed file.

@ZainRizvi ZainRizvi added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 9, 2023
@ZainRizvi ZainRizvi requested a review from huydhn May 9, 2023 22:03
Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change. IMO, we could get a lot more from this in the future. Specifically, what if we have a generic pytest plugin pytest-awesome-s3-cache or something that can be installed and used. It's for sure that not only PyTorch needs this feature :)

@@ -0,0 +1,101 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, it's ok to have these functions here but some of them reminds me of the list of utilities functions in https://github.com/pytorch/pytorch/blob/main/tools/stats/upload_stats_lib.py. So I guess there are some duplications

@ZainRizvi
Copy link
Contributor Author

@pytorchbot merge -f "Failures are unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 16, 2023
Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR.

A follow up to #100522 which makes the `.pytest_cache` available to use here

A concrete example:
1. Person A pushes a new commit and creates a PR.
2. 2 hours later, test_im_now_broken.py fails
3. Person A attempts to fix the test, but the test is actually still broken
4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running)

# Testing
I modified a file to make the tests invoking it fail and triggered CI twice with this failure.

First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811
Test step took 1h 9m to run

Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992
Test step failed within 2m 27s

Pull Request resolved: #101123
Approved by: https://github.com/malfet, https://github.com/huydhn
jcaip pushed a commit that referenced this pull request May 23, 2023
Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR.

A follow up to #100522 which makes the `.pytest_cache` available to use here

A concrete example:
1. Person A pushes a new commit and creates a PR.
2. 2 hours later, test_im_now_broken.py fails
3. Person A attempts to fix the test, but the test is actually still broken
4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running)

# Testing
I modified a file to make the tests invoking it fail and triggered CI twice with this failure.

First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811
Test step took 1h 9m to run

Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992
Test step failed within 2m 27s

Pull Request resolved: #101123
Approved by: https://github.com/malfet, https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants