Skip to content

Conversation

zou3519
Copy link
Contributor

@zou3519 zou3519 commented Mar 9, 2023

Stack from ghstack:

Fixes #96347

This PR:

  • Makes the functorch tests run as a part of the "default" shards
  • Delete the functorch CI shard from all CI job configurations (if it exists)
  • Increase the "default" shard count by 1 for each job, unless it was
    previously set to 1, to accommodate the new functorch tests and not
    regress time-to-signal.
  • Adds a bunch of skips for ROCM and torchdynamo configurations. We can
    investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:

  • wait for CI

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Mar 9, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 9, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96464

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 6dd2479:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zou3519 added a commit that referenced this pull request Mar 9, 2023
Body to come soon

ghstack-source-id: 0a00907
Pull Request resolved: #96464
@zou3519
Copy link
Contributor Author

zou3519 commented Mar 9, 2023

Not ready for review yet

@zou3519 zou3519 added keep-going Don't stop on first failure, keep running tests until the end ciflow/trunk Trigger trunk jobs on your pull request labels Mar 9, 2023
…ecific shards"

Body to come soon

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 10, 2023
Body to come soon

ghstack-source-id: c100787
Pull Request resolved: #96464
…ecific shards"

Body to come soon

[ghstack-poisoned]
…ecific shards"

Body to come soon

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 13, 2023
Body to come soon

ghstack-source-id: 53f7c2f
Pull Request resolved: #96464
…ecific shards"

Body to come soon

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 15, 2023
Body to come soon

ghstack-source-id: bfce049
Pull Request resolved: #96464
…ecific shards"

Body to come soon

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 15, 2023
Body to come soon

ghstack-source-id: b462516
Pull Request resolved: #96464
…ecific shards"

Body to come soon

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 16, 2023
Body to come soon

ghstack-source-id: c2a9e0e
Pull Request resolved: #96464
…ecific shards"

Body to come soon

[ghstack-poisoned]
@zou3519 zou3519 marked this pull request as ready for review March 16, 2023 16:31
@zou3519 zou3519 requested review from a team, Chillee, ezyang and kshitij12345 as code owners March 16, 2023 16:31
…ecific shards"

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

[ghstack-poisoned]
@zou3519 zou3519 requested a review from huydhn March 16, 2023 16:31
@huydhn
Copy link
Contributor

huydhn commented Mar 16, 2023

FYI, there is one remaining functorch shard for MacOS x86_64 in periodic https://github.com/pytorch/pytorch/blob/master/.github/workflows/periodic.yml#L313

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's also update the MacOS x86_64 shard and wait if all tests pass

…ecific shards"

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 16, 2023
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

ghstack-source-id: 13a4c38
Pull Request resolved: #96464
@zou3519
Copy link
Contributor Author

zou3519 commented Mar 16, 2023

FYI, there is one remaining functorch shard for MacOS x86_64 in periodic https://github.com/pytorch/pytorch/blob/master/.github/workflows/periodic.yml#L313

Good catch, I forgot about the jobs in periodic. I updated that shard and also increased the default shard count by 1 for the jobs in periodic

…ecific shards"

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 16, 2023
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

ghstack-source-id: ee52726
Pull Request resolved: #96464
…ecific shards"

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 20, 2023
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

ghstack-source-id: ca719db
Pull Request resolved: #96464
…ecific shards"

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

[ghstack-poisoned]
zou3519 added a commit that referenced this pull request Mar 20, 2023
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI

ghstack-source-id: a3b703a
Pull Request resolved: #96464
@zou3519
Copy link
Contributor Author

zou3519 commented Mar 21, 2023

@pytorchbot merge -f "test failure looks flaky"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 21, 2023
Before #96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default).

After #96464, there is a new group of flaky failures coming from functorch.  So let's mark the test as flaky to monitor without impacting trunk.

Two flaky tests currently seeing in trunk are:

* #97256
* `functorch/test_memory_efficient_fusion.py` OOM

Pull Request resolved: #97259
Approved by: https://github.com/malfet, https://github.com/zou3519
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 23, 2023
…ds (#96464)

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI
Pull Request resolved: pytorch/pytorch#96464
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 23, 2023
Before pytorch/pytorch#96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default).

After pytorch/pytorch#96464, there is a new group of flaky failures coming from functorch.  So let's mark the test as flaky to monitor without impacting trunk.

Two flaky tests currently seeing in trunk are:

* pytorch/pytorch#97256
* `functorch/test_memory_efficient_fusion.py` OOM

Pull Request resolved: pytorch/pytorch#97259
Approved by: https://github.com/malfet, https://github.com/zou3519
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 27, 2023
…ds (#96464)

Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI
Pull Request resolved: pytorch/pytorch#96464
Approved by: https://github.com/huydhn
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 27, 2023
Before pytorch/pytorch#96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default).

After pytorch/pytorch#96464, there is a new group of flaky failures coming from functorch.  So let's mark the test as flaky to monitor without impacting trunk.

Two flaky tests currently seeing in trunk are:

* pytorch/pytorch#97256
* `functorch/test_memory_efficient_fusion.py` OOM

Pull Request resolved: pytorch/pytorch#97259
Approved by: https://github.com/malfet, https://github.com/zou3519
@facebook-github-bot facebook-github-bot deleted the gh/zou3519/615/head branch June 8, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged release notes: releng release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants