Skip to content

Conversation

clee2000
Copy link
Contributor

@clee2000 clee2000 commented Mar 1, 2024

Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running python test/run_test.py --shard 3 5 with no test times downloaded and checked that it wasn't an empty list.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 1, 2024
Copy link

pytorch-bot bot commented Mar 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121022

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b9a9fb with merge base f01a23d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@clee2000 clee2000 marked this pull request as ready for review March 1, 2024 18:08
@clee2000 clee2000 requested review from a team and huydhn March 1, 2024 18:12
@clee2000 clee2000 force-pushed the csl/fix_round_robin_sharding branch from e104f29 to 92b0ea7 Compare March 1, 2024 18:38
def _get_min_sharded_job(
test: ShardedTest, sharded_jobs: List[ShardJob]
) -> ShardJob:
if test.get_time() == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about default the test time to 60s, the same as the slow test threshold? If that works out, this function could then be simplified using a min heap https://docs.python.org/3/library/heapq.html with get_total_time as the sorted value.

Copy link
Contributor Author

@clee2000 clee2000 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would have to be a distinction between the assumed 60s and the actual value of "None" for threshold setting in run_test, but that's easily solved by having a different function for getting the assumed time

What is the expected behavior if you want to sort by time and assume the time if its unknown? Do the unknown times get put before or after the tests with <60s run time?

Copy link
Contributor Author

@clee2000 clee2000 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new commit doing this, putting the unknown tests behind the <60s tests, but didn't do the heap thing since the current readability is alright imo. We only have 6 shards in CI rn so it wouldn't be much faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undid this, pretty sure it was causing uneven sharding b/c there are test files that simply don't get run but do get collected

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have one thought about the possibility of using the slow test threshold of 60s as the default when there is no available test time

@clee2000 clee2000 requested review from a team and huydhn March 1, 2024 22:21
self.parallel: List[ShardedTest] = []

def get_total_time(self) -> float:
def _get_time_helper(self, get_test_time: Callable[[ShardedTest], float]) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now, but this code organization seems a little unnecessarily complicated. This. helper can just take an arg and use test.get_time or the get_assumed_time._get_time logic in-line based on that arg. And what is the different btwn test.get_time() and test.time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test.time can be none, test.get_time() will always return a float. Basically just a connivence thing because I don't want to type time.time or 0 everywhere

known_tests = [
x
for x in tests
if get_duration(x, test_file_times, test_class_times) is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need separate get_time and get_duration functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_duration actually calculates the time and is used to populate get_time, which just returns the value inside the object. get_duration also returns an optional float while get_time returns just a float

@clee2000
Copy link
Contributor Author

clee2000 commented Mar 8, 2024

@pytorchbot merge -f "no trunk needed?"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@clee2000
Copy link
Contributor Author

clee2000 commented Mar 8, 2024

@pytorchbot revert -m "made sharding really uneven" -c weird

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Mar 8, 2024
This reverts commit effdea5.

Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))
@pytorchmergebot
Copy link
Collaborator

@clee2000 your PR has been successfully reverted.

pianpwk pushed a commit that referenced this pull request Mar 11, 2024
This reverts commit effdea5.

Reverted #121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](#121022 (comment)))
@clee2000
Copy link
Contributor Author

@pytorchbot merge -f "no trunk needed?"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@eqy
Copy link
Collaborator

eqy commented Mar 12, 2024

@clee2000 this was affecting our CI, happy to see the fix! I'm surprised it took such a large amount of code though---I was thinking of simply changing the sharding lambda to
min_sharded_job = min(new_sharded_jobs, key=lambda j: (j.get_total_time(), len(j.serial))) to address the issue

@clee2000
Copy link
Contributor Author

@clee2000 this was affecting our CI, happy to see the fix! I'm surprised it took such a large amount of code though---I was thinking of simply changing the sharding lambda to min_sharded_job = min(new_sharded_jobs, key=lambda j: (j.get_total_time(), len(j.serial))) to address the issue

Doesn't that end up with all the unknown tests on one shard if you have tests that do have test times? I guess in practice its probably fine since if you have test times you have test times for everything

@github-actions github-actions bot deleted the csl/fix_round_robin_sharding branch April 12, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants