Fix round robin sharding #121022

clee2000 · 2024-03-01T18:02:36Z

Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running python test/run_test.py --shard 3 5 with no test times downloaded and checked that it wasn't an empty list.

pytorch-bot · 2024-03-01T18:02:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121022

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b9a9fb with merge base f01a23d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

huydhn · 2024-03-01T20:57:23Z

tools/testing/test_selections.py

+    def _get_min_sharded_job(
+        test: ShardedTest, sharded_jobs: List[ShardJob]
+    ) -> ShardJob:
+        if test.get_time() == 0:


What do you think about default the test time to 60s, the same as the slow test threshold? If that works out, this function could then be simplified using a min heap https://docs.python.org/3/library/heapq.html with get_total_time as the sorted value.

There would have to be a distinction between the assumed 60s and the actual value of "None" for threshold setting in run_test, but that's easily solved by having a different function for getting the assumed time

What is the expected behavior if you want to sort by time and assume the time if its unknown? Do the unknown times get put before or after the tests with <60s run time?

Pushed a new commit doing this, putting the unknown tests behind the <60s tests, but didn't do the heap thing since the current readability is alright imo. We only have 6 shards in CI rn so it wouldn't be much faster

undid this, pretty sure it was causing uneven sharding b/c there are test files that simply don't get run but do get collected

huydhn

LGTM! I have one thought about the possibility of using the slow test threshold of 60s as the default when there is no available test time

osalpekar · 2024-03-07T01:48:58Z

tools/testing/test_selections.py

        self.parallel: List[ShardedTest] = []

-    def get_total_time(self) -> float:
+    def _get_time_helper(self, get_test_time: Callable[[ShardedTest], float]) -> float:


Fine for now, but this code organization seems a little unnecessarily complicated. This. helper can just take an arg and use test.get_time or the get_assumed_time._get_time logic in-line based on that arg. And what is the different btwn test.get_time() and test.time?

test.time can be none, test.get_time() will always return a float. Basically just a connivence thing because I don't want to type time.time or 0 everywhere

osalpekar · 2024-03-07T01:52:48Z

tools/testing/test_selections.py

+        known_tests = [
+            x
+            for x in tests
+            if get_duration(x, test_file_times, test_class_times) is not None


Why do we need separate get_time and get_duration functions?

get_duration actually calculates the time and is used to populate get_time, which just returns the value inside the object. get_duration also returns an optional float while get_time returns just a float

clee2000 · 2024-03-08T16:59:05Z

@pytorchbot merge -f "no trunk needed?"

pytorchmergebot · 2024-03-08T17:01:25Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2024-03-08T23:14:47Z

@pytorchbot revert -m "made sharding really uneven" -c weird

pytorchmergebot · 2024-03-08T23:16:21Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))

pytorchmergebot · 2024-03-08T23:16:28Z

@clee2000 your PR has been successfully reverted.

This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))

clee2000 · 2024-03-11T17:27:45Z

@pytorchbot merge -f "no trunk needed?"

pytorchmergebot · 2024-03-11T17:30:03Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eqy · 2024-03-12T22:28:04Z

@clee2000 this was affecting our CI, happy to see the fix! I'm surprised it took such a large amount of code though---I was thinking of simply changing the sharding lambda to
min_sharded_job = min(new_sharded_jobs, key=lambda j: (j.get_total_time(), len(j.serial))) to address the issue

clee2000 · 2024-03-12T22:36:07Z

@clee2000 this was affecting our CI, happy to see the fix! I'm surprised it took such a large amount of code though---I was thinking of simply changing the sharding lambda to min_sharded_job = min(new_sharded_jobs, key=lambda j: (j.get_total_time(), len(j.serial))) to address the issue

Doesn't that end up with all the unknown tests on one shard if you have tests that do have test times? I guess in practice its probably fine since if you have test times you have test times for everything

pytorch-bot bot added the topic: not user facing topic category label Mar 1, 2024

clee2000 marked this pull request as ready for review March 1, 2024 18:08

clee2000 requested review from a team and huydhn March 1, 2024 18:12

clee2000 added 2 commits March 1, 2024 10:38

udate

33529f5

update

92b0ea7

clee2000 force-pushed the csl/fix_round_robin_sharding branch from e104f29 to 92b0ea7 Compare March 1, 2024 18:38

huydhn reviewed Mar 1, 2024

View reviewed changes

huydhn approved these changes Mar 1, 2024

View reviewed changes

update

1efec44

clee2000 requested review from a team and huydhn March 1, 2024 22:21

osalpekar approved these changes Mar 7, 2024

View reviewed changes

osalpekar reviewed Mar 7, 2024

View reviewed changes

clean up a bit?

4c300a4

pytorchmergebot added the merging label Mar 8, 2024

pytorchmergebot closed this in effdea5 Mar 8, 2024

pytorchmergebot added Merged and removed merging labels Mar 8, 2024

pytorchmergebot added a commit that referenced this pull request Mar 8, 2024

Revert "Fix round robin sharding (#121022)"

9eb8fae

This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))

pytorchmergebot added the Reverted label Mar 8, 2024

pytorchmergebot reopened this Mar 8, 2024

no deafult time

9b9a9fb

pianpwk pushed a commit that referenced this pull request Mar 11, 2024

Revert "Fix round robin sharding (#121022)"

fbc51a9

This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))

pytorchmergebot added the merging label Mar 11, 2024

pytorchmergebot closed this in 6801595 Mar 11, 2024

pytorchmergebot removed the merging label Mar 11, 2024

github-actions bot deleted the csl/fix_round_robin_sharding branch April 12, 2024 01:52

Fix round robin sharding #121022

Fix round robin sharding #121022

Uh oh!

Conversation

clee2000 commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121022

✅ No Failures

Uh oh!

huydhn Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

clee2000 Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clee2000 Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clee2000 Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

osalpekar Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

clee2000 Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

osalpekar Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

clee2000 Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

clee2000 commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Merge started

Uh oh!

clee2000 commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Uh oh!

clee2000 commented Mar 11, 2024

Uh oh!

pytorchmergebot commented Mar 11, 2024

Merge started

Uh oh!

eqy commented Mar 12, 2024

Uh oh!

clee2000 commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clee2000 commented Mar 1, 2024 •

edited

Loading

pytorch-bot bot commented Mar 1, 2024 •

edited

Loading

clee2000 Mar 1, 2024 •

edited

Loading

clee2000 Mar 1, 2024 •

edited

Loading