[Core] Fix Inferentia job scheduling #2969

concretevitamin · 2024-01-10T00:41:57Z

Fixes #2968. Bug fixed is when we are submitting a job request, do not request GPU: <count> resources.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below): pytest tests/test_smoke.py::test_inferentia --aws
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll

Thanks for identifying the issue and the quick fix @concretevitamin! Left a question. : )

Michaelvll · 2024-01-10T01:01:36Z

sky/backends/cloud_vm_ray_backend.py

            # gpu_dict should be empty when the accelerator is not GPU.
            # FIXME: This is a hack to make sure that we do not reserve
-            # GPU when requesting TPU.
-            if 'tpu' in acc_name.lower():
+            # GPU when requesting non-GPU accelerators (e.g., TPU/Inferentia).
+            if 'tpu' in acc_name.lower() or 'inferentia' in acc_name.lower():


This seems not very generalizable for the cases where more "non-GPU" accelerators are supported, such as: Trainium. Is it possible that we can directly add a --num-gpus in the ray start command where we start the ray cluster at the following place to force the ray cluster to contain the missing GPU:

skypilot/sky/provision/instance_setup.py

Line 211 in 6fd5bf2

ray_options = (

That seems reasonable. One concern is calling any accelerator "GPU" may be confusing to users (and it's only convenient for us).

Ray in recent releases took a different approach by not calling all accelerators "GPU". Instead there will be several resource names:

https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/neuron.py#L36

https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/nvidia_gpu.py#L23

https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/tpu.py#L80

etc.

Wdyt?

I think this will only change the representation in ray status, which should ideally not be exposed to users, so not affecting the user side. That said, we can merge the current PR changes first to unblock our users first, and do a further refactoring in the future for avoiding the hardcode here (either by using GPU or further distinguish them).

For the hardcode, we probably want to add trainium as well. : )

Michaelvll

Thanks for the refactoring @concretevitamin! LGTM.

concretevitamin added 2 commits January 9, 2024 16:32

[Core] Fix Inferentia job scheduling.

103d5c2

Add smoke

72be722

concretevitamin requested a review from Michaelvll January 10, 2024 00:41

Michaelvll reviewed Jan 10, 2024

View reviewed changes

concretevitamin mentioned this pull request Jan 10, 2024

Issues with AWS inferentia: Disk is not mounted, reusage of existing venv, job keeps queueing. #2967

Closed

refactor

4e2b182

concretevitamin requested a review from Michaelvll January 10, 2024 21:55

Michaelvll approved these changes Jan 10, 2024

View reviewed changes

concretevitamin merged commit 5bf34c4 into master Jan 10, 2024
19 checks passed

concretevitamin deleted the fix-inf branch January 10, 2024 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix Inferentia job scheduling #2969

[Core] Fix Inferentia job scheduling #2969

concretevitamin commented Jan 10, 2024

Michaelvll left a comment •

edited

Michaelvll Jan 10, 2024

concretevitamin Jan 10, 2024

Michaelvll Jan 10, 2024

concretevitamin Jan 10, 2024

Michaelvll left a comment

[Core] Fix Inferentia job scheduling #2969

[Core] Fix Inferentia job scheduling #2969

Conversation

concretevitamin commented Jan 10, 2024

Michaelvll left a comment • edited

Choose a reason for hiding this comment

Michaelvll Jan 10, 2024

Choose a reason for hiding this comment

concretevitamin Jan 10, 2024

Choose a reason for hiding this comment

Michaelvll Jan 10, 2024

Choose a reason for hiding this comment

concretevitamin Jan 10, 2024

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment •

edited