Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix Inferentia job scheduling #2969

Merged
merged 3 commits into from
Jan 10, 2024
Merged

[Core] Fix Inferentia job scheduling #2969

merged 3 commits into from
Jan 10, 2024

Conversation

concretevitamin
Copy link
Collaborator

Fixes #2968. Bug fixed is when we are submitting a job request, do not request GPU: <count> resources.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below): pytest tests/test_smoke.py::test_inferentia --aws
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying the issue and the quick fix @concretevitamin! Left a question. : )

Comment on lines 326 to 329
# gpu_dict should be empty when the accelerator is not GPU.
# FIXME: This is a hack to make sure that we do not reserve
# GPU when requesting TPU.
if 'tpu' in acc_name.lower():
# GPU when requesting non-GPU accelerators (e.g., TPU/Inferentia).
if 'tpu' in acc_name.lower() or 'inferentia' in acc_name.lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not very generalizable for the cases where more "non-GPU" accelerators are supported, such as: Trainium. Is it possible that we can directly add a --num-gpus in the ray start command where we start the ray cluster at the following place to force the ray cluster to contain the missing GPU:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. One concern is calling any accelerator "GPU" may be confusing to users (and it's only convenient for us).

Ray in recent releases took a different approach by not calling all accelerators "GPU". Instead there will be several resource names:

Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will only change the representation in ray status, which should ideally not be exposed to users, so not affecting the user side. That said, we can merge the current PR changes first to unblock our users first, and do a further refactoring in the future for avoiding the hardcode here (either by using GPU or further distinguish them).

For the hardcode, we probably want to add trainium as well. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, PTAL.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring @concretevitamin! LGTM.

@concretevitamin concretevitamin merged commit 5bf34c4 into master Jan 10, 2024
19 checks passed
@concretevitamin concretevitamin deleted the fix-inf branch January 10, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job submission on Inferentia instances blocks
2 participants