[Core] Introduce AcceleratorManager interface #40286

jjyao · 2023-10-12T06:08:49Z

Why are these changes needed?

Introduce AcceleratorManager interface so that each accelerator support can just be implementing a subclass.

Related issue number

#38504

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…rator

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

chappidim · 2023-10-13T19:05:36Z

python/ray/_private/accelerators/neuron.py

+            The number of Neuron cores if any were detected, otherwise 0.
+        """
+        nc_count: int = 0
+        neuron_path = "/opt/aws/neuron/bin/"


Note: The path doesn't guarantee for all environments (based on recent slack conversation https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1696448156026509)

Expected:
find (neuron-ls) command if it exists
then run the command to get the cores

Also, we did ask AWS Neuron SDK team to get the cores information using IMDS (API driven) but no plans to support this.

I just copy pasted from the existing implementation. Are you planning to fix it for other environments?

Can we add a try/catch and fix it in the current PR?
If not, ok to move it to issue and I'll own it (tentative ETA: Q1 2024).

Also add TODO maybe?

Created #40405 to track this.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567

Generally lgtm.

python/ray/_private/accelerators/__init__.py

rkooo567 · 2023-10-17T01:39:58Z

python/ray/_private/accelerators/__init__.py

+from ray._private.accelerators.neuron import NeuronAccelerator
+
+
+def get_all_accelerators() -> Set[Accelerator]:


These are not DevAPIsright?

No, just implementation details.

rkooo567 · 2023-10-17T01:43:02Z

python/ray/_private/accelerators/accelerator.py

+
+
+@DeveloperAPI
+class Accelerator(ABC):


Suggested change

class Accelerator(ABC):

class AcceleratorUtil(ABC):

? Since it seems like it is also all staticmethod

Renamed to AcceleratorManager per ray-project/enhancements#46 (comment)

python/ray/_private/accelerators/accelerator.py

rkooo567 · 2023-10-17T02:14:03Z

python/ray/_private/utils.py

+    """Get the mapping from accelerator resource name
+    to the visible ids."""
+
+    from ray._private.accelerators import (


why do we import here?

circular dependency

rkooo567 · 2023-10-17T02:15:18Z

python/ray/_private/utils.py

-    constraint_name = f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}" f"{pretty_name}"
-    return constraint_name
+        if last_set_visible_accelerator_ids.get(resource_name, None) == accelerator_ids:
+            continue  # optimization: already set


Is it new or did we have the same optimization before?

also is it really necessary? Aren't they just setting an env var? Maybe remove this for now?

Same old code.

I actually tried to remove it since I also think it might be unnecessary but it uncovered a bug. I decided to fix the bug in a follow-up PR and then remove this optimization.

rkooo567 · 2023-10-17T02:19:44Z

python/ray/_private/accelerators/__init__.py

+    }
+
+
+def get_all_accelerator_resource_names() -> Set[str]:


Why don't we just use enums here instead of names directly?

What benefits do enum provide here?

I think the top-level API makes sense to accept string, but for internal functions, it's cleaner to pass enum around (otherwise, we should rely on some implicit assumption the input is always valid, or we should do validation everywhere).

rkooo567 · 2023-10-17T02:20:04Z

python/ray/tests/test_autoscaler_yaml.py

@@ -175,7 +175,7 @@ def testValidateDefaultConfigAWSMultiNodeTypes(self):
            "CPU": 4,
            "memory": 12025908428,
            "neuron_cores": 2,
-            "accelerator_type:aws-neuron-core": 2,
+            "accelerator_type:aws-neuron-core": 1,


why is it changed?

It's a bug from the previous PR. Total quantity of the special accelerator_type resource should only be 1.

rkooo567 · 2023-10-17T02:20:30Z

python/ray/tests/test_advanced_2.py

@@ -685,36 +684,6 @@ def get_neuron_core_ids(neuron_cores_per_worker):
    nc_f = ray.remote(resources={"neuron_cores": 2})(lambda: get_neuron_core_ids(2))
    assert ray.get(nc_f.remote()) == 2

-    with pytest.raises(ValueError):


why are we removing this?

Because we now allow specifying both gpu and neuro core resources for a single task.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

harborn · 2023-10-18T10:58:23Z

python/ray/_private/accelerators/accelerator.py

+
+        Returns:
+            The resource name: e.g., the resource name for Nvidia GPUs is "GPU"
+        """


I think AcceleratorManager should provide get_resource_type() interface,
while get_resource_name provide detail resource name, such as NvidiaGPU, IntelGPU
get_resource_type provide resource type, such as NvidiaGPU is GPU, IntelGPU is GPU

get_resource_name returns the Ray resource name so it should be "GPU". We can add a get_accelerator_family that returns NvidiaGPU or IntelGPU in the future if needed.

SongGuyang · 2023-10-23T02:08:55Z

python/ray/_private/accelerators/__init__.py

+from ray._private.accelerators.neuron import NeuronAcceleratorManager
+
+
+def get_all_accelerator_managers() -> Set[AcceleratorManager]:


Can we also support get accelerator managers from env vars? Some users may need to append accelerator manager modules which are not maintained in Ray repo.

…set (#43714) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> #40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray. This PR reverts back to the original behavior.

…set (ray-project#43714) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> ray-project#40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray. This PR reverts back to the original behavior.

Introduce Accelerator interface

db61097

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao assigned scv119 and rkooo567 Oct 12, 2023

jjyao added 3 commits October 11, 2023 23:24

up

8a383df

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

bcf4886

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

79471f3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested review from wuisawesome, DmitriGekhtman, ericl, architkulkarni and a team as code owners October 12, 2023 18:36

jjyao added 7 commits October 12, 2023 16:24

up

36f339e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

7eb0f05

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Merge branch 'master' of github.com:ray-project/ray into jjyao/accele…

43a8505

…rator

up

c6891e8

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

97d3706

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

0efd839

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

3ca37b3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao changed the title ~~[Core][WIP] Introduce Accelerator interface~~ [Core] Introduce Accelerator interface Oct 13, 2023

fix

ee5d478

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

chappidim reviewed Oct 13, 2023

View reviewed changes

up

25de5eb

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao mentioned this pull request Oct 13, 2023

[REP] Ray accelerator support ray-project/enhancements#46

Merged

jjyao added 5 commits October 13, 2023 15:33

up

fe5e7b3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

a505ae8

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

2e455f9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

3864902

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

e093778

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

rkooo567 mentioned this pull request Oct 16, 2023

[runtime env]: Integrating Nsight to Ray worker process #39998

Merged

8 tasks

rkooo567 reviewed Oct 17, 2023

View reviewed changes

jjyao added 3 commits October 17, 2023 01:16

comments

4f39068

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

3fb560b

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

54c8cab

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from rkooo567 October 17, 2023 08:30

rkooo567 approved these changes Oct 17, 2023

View reviewed changes

jjyao changed the title ~~[Core] Introduce Accelerator interface~~ [Core] Introduce AcceleratorManager interface Oct 17, 2023

jjyao merged commit 16da484 into ray-project:master Oct 17, 2023
65 of 69 checks passed

jjyao deleted the jjyao/accelerator branch October 17, 2023 15:25

jjyao mentioned this pull request Oct 17, 2023

[Core] Handle the case where /opt/aws/neuron/bin/ path doesn't exist during nuero cores auto detection #40405

Open

harborn reviewed Oct 18, 2023

View reviewed changes

SongGuyang reviewed Oct 23, 2023

View reviewed changes

jerome-habana mentioned this pull request Oct 23, 2023

Add support for Intel Gaudi Backend #40561

Merged

8 tasks

rickyyx mentioned this pull request Oct 26, 2023

Release test long_running_many_actor_tasks failed #40568

Closed

joncarter1 mentioned this pull request Feb 22, 2024

[Core] ray.get_gpu_ids fails when CUDA_VISIBLE_DEVICES uses GPU UUIDs #43314

Closed

jjyao mentioned this pull request Mar 5, 2024

[Core] Fix get_gpu_ids to return a list of str when CUDA_VISIBLE_DEVICES is set #43714

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Introduce AcceleratorManager interface #40286

[Core] Introduce AcceleratorManager interface #40286

jjyao commented Oct 12, 2023 •

edited

Loading

chappidim Oct 13, 2023

jjyao Oct 13, 2023

chappidim Oct 13, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 left a comment

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

rkooo567 Oct 17, 2023

jjyao Oct 17, 2023

harborn Oct 18, 2023

jjyao Oct 18, 2023

SongGuyang Oct 23, 2023

		from ray._private.accelerators.neuron import NeuronAccelerator


		def get_all_accelerators() -> Set[Accelerator]:

		from ray._private.accelerators.neuron import NeuronAcceleratorManager


		def get_all_accelerator_managers() -> Set[AcceleratorManager]:

[Core] Introduce AcceleratorManager interface #40286

[Core] Introduce AcceleratorManager interface #40286

Conversation

jjyao commented Oct 12, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Oct 12, 2023 •

edited

Loading