Skip to content

Conversation

@clee2000
Copy link
Contributor

@clee2000 clee2000 commented May 5, 2022

Fixes #ISSUE_NUMBER
shard win-vs2019-cuda11.3-py3 / test from 2 shards to 5 shards
helps w/ #76838

Notes:

  • avg tts for the past week as of May 5 is 4.7 and 4.5 hours for 1st and 2nd shard on master, around 4 hours for all branches (but I don't think the changes from removing distributed tests + moving testing off of pull have come into effect yet)
  • high overhead
  • hope that tts doesn't explode

Sharding spreadsheet: https://docs.google.com/spreadsheets/d/1BdtVsjRr0Is9LXMNilR02FEdPXNq7zEWl8AmR3ArsLQ/edit#gid=1153012347

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 5, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit 6627217 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build trunk / macos-11-py3-x86-64 / test (default, 1, 2, macos-11) (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-08T00:00:35.1662060Z RuntimeError: test_sparse failed!
2022-05-08T00:00:34.8704050Z Generating XML reports...
2022-05-08T00:00:34.8704600Z Generated XML report: test-reports/python-unittest/test_sparse/TEST-TestSparseCPU-20220508000011.xml
2022-05-08T00:00:34.8705250Z Generated XML report: test-reports/python-unittest/test_sparse/TEST-TestSparseMaskedReductionsCPU-20220508000011.xml
2022-05-08T00:00:34.8705920Z Generated XML report: test-reports/python-unittest/test_sparse/TEST-TestSparseUnaryUfuncsCPU-20220508000011.xml
2022-05-08T00:00:34.8706540Z Generated XML report: test-reports/python-unittest/test_sparse/TEST-TestSparseOneOff-20220508000011.xml
2022-05-08T00:00:35.1658430Z Traceback (most recent call last):
2022-05-08T00:00:35.1659340Z   File "test/run_test.py", line 1072, in <module>
2022-05-08T00:00:35.1660060Z     main()
2022-05-08T00:00:35.1660690Z   File "test/run_test.py", line 1050, in main
2022-05-08T00:00:35.1661370Z     raise RuntimeError(err_message)
2022-05-08T00:00:35.1662060Z RuntimeError: test_sparse failed!
2022-05-08T00:00:35.3762110Z + cleanup
2022-05-08T00:00:35.3762360Z + retcode=1
2022-05-08T00:00:35.3762550Z + set +x
2022-05-08T00:00:35.3787660Z ##[error]Process completed with exit code 1.
2022-05-08T00:00:35.3872070Z ##[group]Run pytorch/pytorch/.github/actions/get-workflow-job-id@master
2022-05-08T00:00:35.3872390Z with:
2022-05-08T00:00:35.3873240Z   github-token: ***
2022-05-08T00:00:35.3873460Z env:
2022-05-08T00:00:35.3873630Z   IN_CI: 1
2022-05-08T00:00:35.3873910Z   IS_GHA: 1

See GitHub Actions build trunk / win-vs2019-cuda11.3-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-07T23:52:40.8694821Z RuntimeError: test_torch failed!
2022-05-07T23:52:20.5450815Z Generated XML report: test-reports\python-unittest\test_torch\TEST-TestVitalSignsCudaCUDA-20220507234138.xml
2022-05-07T23:52:20.5451429Z Generated XML report: test-reports\python-unittest\test_torch\TEST-TestVitalSignsCudaCPU-20220507234138.xml
2022-05-07T23:52:40.2534466Z [TORCH_VITAL] CUDA.used		 true
2022-05-07T23:52:40.2535045Z [TORCH_VITAL] Dataloader.basic_unit_test		 TEST_VALUE_STRING
2022-05-07T23:52:40.2555684Z [TORCH_VITAL] Dataloader.enabled		 True
2022-05-07T23:52:40.8669095Z Traceback (most recent call last):
2022-05-07T23:52:40.8669659Z   File "run_test.py", line 1072, in <module>
2022-05-07T23:52:40.8669984Z     main()
2022-05-07T23:52:40.8694004Z   File "run_test.py", line 1050, in main
2022-05-07T23:52:40.8694476Z     raise RuntimeError(err_message)
2022-05-07T23:52:40.8694821Z RuntimeError: test_torch failed!
2022-05-07T23:52:44.4014247Z 
2022-05-07T23:52:44.4019407Z (base) C:\actions-runner\_work\pytorch\pytorch\test>if ERRORLEVEL 1 goto fail 
2022-05-07T23:52:44.4046408Z 
2022-05-07T23:52:44.4047099Z (base) C:\actions-runner\_work\pytorch\pytorch\test>exit /b 1 
2022-05-07T23:52:44.5595028Z + cleanup
2022-05-07T23:52:44.5621569Z + retcode=1
2022-05-07T23:52:44.5634392Z + set +x
2022-05-07T23:52:44.6974865Z ##[error]Process completed with exit code 1.
2022-05-07T23:52:45.3975201Z ##[group]Run pytorch/pytorch/.github/actions/get-workflow-job-id@master
2022-05-07T23:52:45.3975704Z with:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@clee2000 clee2000 force-pushed the clee2000/win-shard branch 2 times, most recently from 2cc96c7 to d9d6623 Compare May 6, 2022 22:04
@clee2000 clee2000 added the ciflow/trunk Trigger trunk jobs on your pull request label May 6, 2022
@clee2000 clee2000 marked this pull request as ready for review May 6, 2022 22:07
@clee2000 clee2000 requested a review from suo May 6, 2022 22:07
Copy link
Member

@suo suo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepting, but let's eyeball the shards in CI before merging

@clee2000 clee2000 changed the title win shard shard win-vs2019-cuda11.3-py3 / test from 2 shards to 5 shards May 6, 2022
@clee2000 clee2000 force-pushed the clee2000/win-shard branch from d9d6623 to 6627217 Compare May 7, 2022 21:00
Copy link
Contributor

@janeyx99 janeyx99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lannnnnnnnnd :D

@janeyx99
Copy link
Contributor

janeyx99 commented May 9, 2022

Though, looking at CI:
image

Looks decently even, but not short enough

@clee2000
Copy link
Contributor Author

clee2000 commented May 9, 2022

@pytorchbot merge this please

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Hey @clee2000.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request May 13, 2022
) (#76867)

Summary:
Fixes #ISSUE_NUMBER
shard `win-vs2019-cuda11.3-py3 / test` from 2 shards to 5 shards
helps w/ #76838

Notes:
- avg tts for the past week as of May 5 is 4.7 and 4.5 hours for 1st and 2nd shard on master, around 4 hours for all branches (but I don't think the changes from removing distributed tests + moving testing off of pull have come into effect yet)
- high overhead
- hope that tts doesn't explode

Sharding spreadsheet: https://docs.google.com/spreadsheets/d/1BdtVsjRr0Is9LXMNilR02FEdPXNq7zEWl8AmR3ArsLQ/edit#gid=1153012347

Pull Request resolved: #76867
Approved by: https://github.com/suo, https://github.com/janeyx99

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/dc4f12d9cc15476c545e3a1bb7a74e23d5b0ddf5

Reviewed By: malfet

Differential Revision: D36250677

Pulled By: clee2000

fbshipit-source-id: 5aeeb617fba2483be83e5c1f0a7c7dda03f4294a
@clee2000 clee2000 deleted the clee2000/win-shard branch May 16, 2022 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants