Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round robin during spread scheduling #19968

Merged
merged 10 commits into from
Dec 23, 2021
Merged

Round robin during spread scheduling #19968

merged 10 commits into from
Dec 23, 2021

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Nov 2, 2021

Why are these changes needed?

This PR does round robin over all the nodes instead of always starting from the beginning of the node list for spread scheduling.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 2, 2021

Nice change!

@jjyao jjyao added the do-not-merge Do not merge this PR! label Nov 4, 2021
@jjyao jjyao changed the title [WIP] Random round robin during spread scheduling Round robin during spread scheduling Dec 9, 2021
@ericl ericl self-assigned this Dec 9, 2021
@jjyao
Copy link
Collaborator Author

jjyao commented Dec 10, 2021

pipelined_ingestion_1500_gb_15_windows succeeded without prefix hack: https://buildkite.com/ray-project/periodic-ci/builds/1902#_

@ericl
Copy link
Contributor

ericl commented Dec 10, 2021 via email

@jjyao
Copy link
Collaborator Author

jjyao commented Dec 10, 2021

Without prefix, without spread scheduling strategy: Forever (hang)
With prefix: ~33m
With spread scheduling strategy: ~47m

@scv119 can confirm this.

@ericl
Copy link
Contributor

ericl commented Dec 10, 2021 via email

@jjyao
Copy link
Collaborator Author

jjyao commented Dec 10, 2021

I'm trying to figure out the difference now. This is a good validation use case for dataset stats and scheduler observability projects.

@jjyao jjyao removed the do-not-merge Do not merge this PR! label Dec 10, 2021
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for cpp part. I will wait for Eric's approval since I didn't review the dataset part.

Only one nit comment

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 14, 2021
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets side looks good, a few comments about the Ray side.

@jjyao
Copy link
Collaborator Author

jjyao commented Dec 21, 2021

Not sure if there is anything changed or last run was just noise: I ran pipelined_ingestion_1500_gb_15_windows three times with this PR and the time is almost the same now:
https://buildkite.com/ray-project/periodic-ci/builds/2059#_
https://buildkite.com/ray-project/periodic-ci/builds/2057
https://buildkite.com/ray-project/periodic-ci/builds/2054

@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 21, 2021
@jjyao
Copy link
Collaborator Author

jjyao commented Dec 22, 2021

It's ready to be merged. Perf number looks good: https://buildkite.com/ray-project/periodic-ci/builds/2086#_

@ericl ericl merged commit 60388b2 into ray-project:master Dec 23, 2021
@jjyao
Copy link
Collaborator Author

jjyao commented Dec 29, 2021

Oh, I think for latests runs, I ran the tests against master which still had the prefix code in pipelined_training.py. Let me revert the PR to fix pipelined_ingestion_1500_gb_15_windows.

@jjyao jjyao deleted the jjyao/rr branch December 29, 2021 19:51
jjyao added a commit to jjyao/ray that referenced this pull request Dec 29, 2021
rkooo567 pushed a commit that referenced this pull request Dec 30, 2021
@jjyao jjyao mentioned this pull request Dec 30, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants