-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix round robin sharding #121022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix round robin sharding #121022
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121022
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9b9a9fb with merge base f01a23d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e104f29
to
92b0ea7
Compare
tools/testing/test_selections.py
Outdated
def _get_min_sharded_job( | ||
test: ShardedTest, sharded_jobs: List[ShardJob] | ||
) -> ShardJob: | ||
if test.get_time() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about default the test time to 60s, the same as the slow test threshold? If that works out, this function could then be simplified using a min heap https://docs.python.org/3/library/heapq.html with get_total_time
as the sorted value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would have to be a distinction between the assumed 60s and the actual value of "None" for threshold setting in run_test, but that's easily solved by having a different function for getting the assumed time
What is the expected behavior if you want to sort by time and assume the time if its unknown? Do the unknown times get put before or after the tests with <60s run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a new commit doing this, putting the unknown tests behind the <60s tests, but didn't do the heap thing since the current readability is alright imo. We only have 6 shards in CI rn so it wouldn't be much faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undid this, pretty sure it was causing uneven sharding b/c there are test files that simply don't get run but do get collected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I have one thought about the possibility of using the slow test threshold of 60s as the default when there is no available test time
tools/testing/test_selections.py
Outdated
self.parallel: List[ShardedTest] = [] | ||
|
||
def get_total_time(self) -> float: | ||
def _get_time_helper(self, get_test_time: Callable[[ShardedTest], float]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for now, but this code organization seems a little unnecessarily complicated. This. helper can just take an arg and use test.get_time
or the get_assumed_time._get_time
logic in-line based on that arg. And what is the different btwn test.get_time()
and test.time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test.time
can be none, test.get_time()
will always return a float. Basically just a connivence thing because I don't want to type time.time or 0
everywhere
known_tests = [ | ||
x | ||
for x in tests | ||
if get_duration(x, test_file_times, test_class_times) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need separate get_time
and get_duration
functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_duration
actually calculates the time and is used to populate get_time
, which just returns the value inside the object. get_duration
also returns an optional float while get_time
returns just a float
@pytorchbot merge -f "no trunk needed?" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "made sharding really uneven" -c weird |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))
@clee2000 your PR has been successfully reverted. |
This reverts commit effdea5. Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))
@pytorchbot merge -f "no trunk needed?" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@clee2000 this was affecting our CI, happy to see the fix! I'm surprised it took such a large amount of code though---I was thinking of simply changing the sharding lambda to |
Doesn't that end up with all the unknown tests on one shard if you have tests that do have test times? I guess in practice its probably fine since if you have test times you have test times for everything |
Fix round robin sharding when there are no test times and sort_by_time=False
Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code
Tested locally by running
python test/run_test.py --shard 3 5
with no test times downloaded and checked that it wasn't an empty list.