[Core] Support Intel GPU #38553

harborn · 2023-08-17T10:06:54Z

Why are these changes needed?

Intel also provide common computing GPUs.
Intel internal benchmark shows that Intel GPU has great performance on LLM train/infer workflow.

This PR aim to support Intel GPU on Ray.
We add two device type as GPU: INTEL_MAX_1550, INTEL_MAX_1100.

This upgrade allows users to use INTEL GPUs almost seamlessly, just like Nvidia’s different GPU devices.

Usage of different GPU type in ray cluster

To use different GPU in ray cluster:

if current ray cluster has only one GPU type, you don’t have to specify in task/actor. if no accelerator_type in task/actor options, ray will auto use the only one GPU type.
if current ray cluster has more than one GPU type, and ray task/actor don't provide accelerator_type in options, ray will raise ValueError, due to ray can't decide which GPU to run the task/actor.

Such as:

from ray.util.accelerators import NVIDIA_TESLA_V100, INTEL_MAX_1550

# add a node with Nvidia GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{NVIDIA_TESLA_V100}": 1})

# add a node with Intel GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{INTEL_MAX_1550}": 1})

ray.init(address=cluster.address)

# use Nvidia GPU to train
@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def train(data):
    return "This function was run on a node with a Nvidia Tesla V100 GPU"

# use Intel GPU to infer
ray.get(train.remote(1))

@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def infer(data):
    return "This function was run on a node with an Intel Max 1550 GPU"

ray.get(infer.remote(1))

The changes include 2 parts:

upgrades of GPU detection process of ray.init
upgrades of GPU resources usage of ray task or actor

Upgrades GPU detection process of ray.init

ray.init will autodetect all kinds of GPUs, current including:

Nvidia GPU
Intel GPU

The GPUs info will be detected during ray.init() and stored in resources field in option.

Upgrades of ray task or actor

Only one accelerator type in current ray service

# detect only one device NVIDIA_TESLA_V100, so default to use NVIDIA_TESLA_V100
@ray.remote(num_gpus=1)
def func():
    pass

@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:NVIDIA_TESLA_V100": 1})
def func():
    pass

Multi accelerator type in current ray service

not specified accelerator type

@ray.remote(num_gpus=1)
def func():
    pass
# raise ValueError("current ray service has multi type GPU, please choose one")

specified accelerator type

# specified accelerator type
# such as use INTEL_MAX_1550
@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:INTEL_MAX_1550": 1})
def func():
    pass

Related issue number

#36493 previous implementation
#37998 auto detect aws accelerator

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

harborn · 2023-08-17T10:12:29Z

Please check this PR instead https://github.com/ray-project/ray/pull/36493
Sorry for some commit problems.
@abhilash1910 @cadedaniel @scv119

harborn · 2023-08-17T10:23:51Z

@xwu99
Please check here, Thanks.

harborn · 2023-08-17T10:35:48Z

Also updated previous comments:

Add 3 UTs: test_xpu_ids, test_local_mode_xpus, test_disable_xpu_devices
Change RAY_ACCELERATOR to RAY_EXPERIMENTAL_ACCELERATOR_TYPE
Unified two environment variable to one: ONEAPI_DEVICE_SELECTOR, which is similar to CUDA_VISIBLE_DEVICES. While remove XPU_VISIBLE_DEVICES, which not used in IPEX 1.13 and 2.0 actually.
Add some comments in codes.
Only one type of accelerator can be used in ray cluster, even though there are more than 2 types accelerator in ray cluster.
@xwu99 has add some update for documents.

@cadedaniel

abhilash1910 · 2023-08-17T13:21:30Z

python/ray/tests/test_basic.py

@@ -540,6 +542,25 @@ def check():
    )


+def test_disable_xpu_devices():
+    script = """
+import ray


Maybe indent the quoted script:

script= """ import ray .....

LGTM otherwise

abhilash1910

LGTM- ! Thanks

harborn · 2023-08-18T05:44:56Z

Previous comments are in https://github.com/ray-project/ray/pull/36493

harborn · 2023-10-07T01:16:36Z

@harborn

Is there a separate channel for discussions related to further integration/development (such as slack/discord etc?)

Could you reach out to me on Ray slack? We should set up a collaboration channel.

OK, reach you on Slack.

xwu99 · 2023-10-10T06:59:45Z

python/ray/_private/resource_spec.py

-            object_store_memory,
-            resources,
-            redis_max_memory,
+            num_cpus, num_gpus, memory, object_store_memory, resources, redis_max_memory


it's better to no change the original format.

xwu99 · 2023-10-10T07:04:53Z

python/ray/_private/resource_spec.py

+def _detect_gpus(num_gpus: Optional[int], resources: dict) -> int:
+    """Detect GPUs by rules of 'Hardware Accelerators (GPUs) on Ray', link:
+    https://github.com/ray-project/ray/blob/master/python/ray/util/accelerators/accelerators.md
+    The mainly rule is Homogenous within a node:


Better to rephrase like: The GPU type in the same node should be the same, but different node can have different types of GPUs.

xwu99 · 2023-10-10T07:05:51Z

python/ray/_private/resource_spec.py

+        if gpu_ids is not None:
+            num_xpus = min(num_xpus, len(gpu_ids))
+
+    # resources.update({ray_constants.XPU: num_xpus})


remove redundant comment.

xwu99 · 2023-10-10T07:14:59Z

python/ray/_private/utils.py

+        environment variable. CUDA_VISIBLE_DEVICES for Nvidia GPU,
+        ONEAPI_DEVICE_SELECTOR for Intel GPU.
+    Returns:
+        ID list (List[str]): according the resource model,


should move the long description up to the first paragraph.

xwu99 · 2023-10-10T07:41:44Z

python/ray/_private/utils.py

+    backend = ray_constants.RAY_ONEAPI_DEVICE_BACKEND_TYPE
+    device_type = ray_constants.RAY_ONEAPI_DEVICE_TYPE
+    os.environ["ONEAPI_DEVICE_SELECTOR"] = backend + ":" + device_type
+


can remove the above block as ONEAPI_DEVICE_SELECTOR already applied to dpctl.

jjyao

Could you create a test_intel_gpu.py file and create some tests. You can see test_tpu.py as an example.

python/ray/_private/accelerators/intel_gpu.py

python/ray/util/accelerators/accelerators.py

jjyao · 2023-10-19T17:15:45Z

Lint failed:



python/ray/_private/accelerators/intel_gpu.py:1:1: F401 're' imported but unused
--
  | python/ray/_private/accelerators/intel_gpu.py:3:1: F401 'sys' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:5:1: F401 'subprocess' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:6:1: F401 'importlib' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:55:18: E711 comparison to None should be 'if cond is not None:'

jjyao · 2023-10-20T15:07:05Z

python/ray/tests/accelerators/test_intel_gpu.py

+def test_multi_gpu_with_different_vendor(ray_start_cluster):
+    cluster = ray_start_cluster
+    nvidia_gpu = NVIDIA_TESLA_A100
+    intel_gpu = INTEL_MAX_1550
+    prefix = ray._private.ray_constants.RESOURCE_CONSTRAINT_PREFIX
+    nvidia_resource_name = f"{prefix}{nvidia_gpu}"
+    intel_resource_name = f"{prefix}{intel_gpu}"
+    cluster.add_node(num_cpus=1, num_gpus=10, resources={nvidia_resource_name: 1})
+    cluster.add_node(num_cpus=1, num_gpus=10, resources={intel_resource_name: 1})
+    ray.init(address=cluster.address)


This won't test anything. Since we didn't mock IntelGPUAcceleratorManager.get_current_node_num_accelerators, both nodes with have Nvidia GPUs.

python/ray/tests/accelerators/test_intel_gpu.py

python/ray/_private/accelerators/intel_gpu.py

python/ray/tests/accelerators/test_intel_gpu.py

Signed-off-by: harborn <gangsheng.wu@intel.com>

jjyao · 2023-10-24T19:53:32Z

Tests failed on windows

ESC_bk;t=1698165733313^G================================== FAILURES ===================================
ESC_bk;t=1698165733313^G___________________ test_get_current_node_num_accelerators ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_num_accelerators():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:38: ModuleNotFoundError
ESC_bk;t=1698165733313^G___________________ test_get_current_node_accelerator_type ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_accelerator_type():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:53: ModuleNotFoundError

Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn changed the title ~~Support Intel GPU~~ [Core] Support Intel GPU Aug 17, 2023

xwu99 mentioned this pull request Aug 17, 2023

[WIP][DOC] Intel GPU on Ray Support #38547

Closed

8 tasks

abhilash1910 reviewed Aug 17, 2023

View reviewed changes

abhilash1910 approved these changes Aug 17, 2023

View reviewed changes

harborn requested review from a team, richardliaw, gjoliver, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, edoakes, shrekris-anyscale, sihanwang41, zcin, architkulkarni, ericl, scv119, c21, scottjlee, bveeramani, raulchen and aslonnie as code owners August 18, 2023 06:33

harborn force-pushed the ray_intel_gpu branch from 52ab210 to d09cda3 Compare October 8, 2023 06:49

xwu99 reviewed Oct 10, 2023

View reviewed changes

harborn closed this Oct 19, 2023

harborn force-pushed the ray_intel_gpu branch from d09cda3 to b3c1424 Compare October 19, 2023 08:54

harborn reopened this Oct 19, 2023

jjyao reviewed Oct 19, 2023

View reviewed changes

harborn force-pushed the ray_intel_gpu branch from 328340d to ca007ab Compare October 20, 2023 00:39

jjyao reviewed Oct 20, 2023

View reviewed changes

harborn force-pushed the ray_intel_gpu branch from da1feca to ce3e7e6 Compare October 24, 2023 05:24

jjyao reviewed Oct 24, 2023

View reviewed changes

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_intel_gpu.py Show resolved Hide resolved

jjyao reviewed Oct 24, 2023

View reviewed changes

jjyao approved these changes Oct 24, 2023

View reviewed changes

jjyao reviewed Oct 24, 2023

View reviewed changes

python/ray/tests/accelerators/test_intel_gpu.py Show resolved Hide resolved

harborn added 9 commits October 24, 2023 23:53

Support Intel GPU

66e382a

Signed-off-by: harborn <gangsheng.wu@intel.com>

fix and add ut

ab2407b

Signed-off-by: harborn <gangsheng.wu@intel.com>

fix

b86e0de

Signed-off-by: harborn <gangsheng.wu@intel.com>

add case

8a6ea17

Signed-off-by: harborn <gangsheng.wu@intel.com>

add more test cases

43247b7

Signed-off-by: harborn <gangsheng.wu@intel.com>

add mock files

02a8380

Signed-off-by: harborn <gangsheng.wu@intel.com>

fix format

bd173ab

Signed-off-by: harborn <gangsheng.wu@intel.com>

fix

83a15f9

Signed-off-by: harborn <gangsheng.wu@intel.com>

fix

9c417fd

Signed-off-by: harborn <gangsheng.wu@intel.com>

harborn force-pushed the ray_intel_gpu branch from 07ecee3 to 9c417fd Compare October 24, 2023 15:53

jjyao mentioned this pull request Oct 24, 2023

Add Apple silicon GPU(mps) support to ray #38464

Open

8 tasks

skip case on Windows

3bbc10a

Signed-off-by: harborn <gangsheng.wu@intel.com>

jjyao merged commit 8cfc894 into ray-project:master Oct 25, 2023
34 of 37 checks passed

rickyyx mentioned this pull request Dec 7, 2023

[Core] Perf regression #41695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support Intel GPU #38553

[Core] Support Intel GPU #38553

harborn commented Aug 17, 2023 •

edited

Loading

harborn commented Aug 17, 2023

harborn commented Aug 17, 2023

harborn commented Aug 17, 2023 •

edited

Loading

abhilash1910 Aug 17, 2023

abhilash1910 left a comment

harborn commented Aug 18, 2023

harborn commented Oct 7, 2023

xwu99 Oct 10, 2023

xwu99 Oct 10, 2023

xwu99 Oct 10, 2023

xwu99 Oct 10, 2023

xwu99 Oct 10, 2023

jjyao left a comment

jjyao commented Oct 19, 2023

jjyao Oct 20, 2023

jjyao commented Oct 24, 2023

[Core] Support Intel GPU #38553

[Core] Support Intel GPU #38553

Conversation

harborn commented Aug 17, 2023 • edited Loading

Why are these changes needed?

Usage of different GPU type in ray cluster

Upgrades GPU detection process of ray.init

Upgrades of ray task or actor

Only one accelerator type in current ray service

Multi accelerator type in current ray service

not specified accelerator type

specified accelerator type

Related issue number

Checks

harborn commented Aug 17, 2023

harborn commented Aug 17, 2023

harborn commented Aug 17, 2023 • edited Loading

abhilash1910 Aug 17, 2023

Choose a reason for hiding this comment

abhilash1910 left a comment

Choose a reason for hiding this comment

harborn commented Aug 18, 2023

harborn commented Oct 7, 2023

xwu99 Oct 10, 2023

Choose a reason for hiding this comment

xwu99 Oct 10, 2023

Choose a reason for hiding this comment

xwu99 Oct 10, 2023

Choose a reason for hiding this comment

xwu99 Oct 10, 2023

Choose a reason for hiding this comment

xwu99 Oct 10, 2023

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

jjyao commented Oct 19, 2023

jjyao Oct 20, 2023

Choose a reason for hiding this comment

jjyao commented Oct 24, 2023

harborn commented Aug 17, 2023 •

edited

Loading

harborn commented Aug 17, 2023 •

edited

Loading