Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support Intel GPU #38553

Merged
merged 10 commits into from
Oct 25, 2023
Merged

[Core] Support Intel GPU #38553

merged 10 commits into from
Oct 25, 2023

Conversation

harborn
Copy link
Contributor

@harborn harborn commented Aug 17, 2023

Why are these changes needed?

Intel also provide common computing GPUs.
Intel internal benchmark shows that Intel GPU has great performance on LLM train/infer workflow.

This PR aim to support Intel GPU on Ray.
We add two device type as GPU: INTEL_MAX_1550, INTEL_MAX_1100.

This upgrade allows users to use INTEL GPUs almost seamlessly, just like Nvidia’s different GPU devices.

Usage of different GPU type in ray cluster

To use different GPU in ray cluster:

  1. if current ray cluster has only one GPU type, you don’t have to specify in task/actor. if no accelerator_type in task/actor options, ray will auto use the only one GPU type.
  2. if current ray cluster has more than one GPU type, and ray task/actor don't provide accelerator_type in options, ray will raise ValueError, due to ray can't decide which GPU to run the task/actor.

Such as:

from ray.util.accelerators import NVIDIA_TESLA_V100, INTEL_MAX_1550

# add a node with Nvidia GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{NVIDIA_TESLA_V100}": 1})

# add a node with Intel GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{INTEL_MAX_1550}": 1})

ray.init(address=cluster.address)

# use Nvidia GPU to train
@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def train(data):
    return "This function was run on a node with a Nvidia Tesla V100 GPU"

# use Intel GPU to infer
ray.get(train.remote(1))

@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def infer(data):
    return "This function was run on a node with an Intel Max 1550 GPU"

ray.get(infer.remote(1))

The changes include 2 parts:

  1. upgrades of GPU detection process of ray.init
  2. upgrades of GPU resources usage of ray task or actor

Upgrades GPU detection process of ray.init

ray.init will autodetect all kinds of GPUs, current including:

  • Nvidia GPU
  • Intel GPU

The GPUs info will be detected during ray.init() and stored in resources field in option.

Upgrades of ray task or actor

Only one accelerator type in current ray service

# detect only one device NVIDIA_TESLA_V100, so default to use NVIDIA_TESLA_V100
@ray.remote(num_gpus=1)
def func():
    pass

@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:NVIDIA_TESLA_V100": 1})
def func():
    pass

Multi accelerator type in current ray service

not specified accelerator type

@ray.remote(num_gpus=1)
def func():
    pass
# raise ValueError("current ray service has multi type GPU, please choose one")

specified accelerator type

# specified accelerator type
# such as use INTEL_MAX_1550
@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:INTEL_MAX_1550": 1})
def func():
    pass

Related issue number

#36493 previous implementation
#37998 auto detect aws accelerator

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@harborn harborn changed the title Support Intel GPU [Core] Support Intel GPU Aug 17, 2023
@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

Please check this PR instead https://github.com/ray-project/ray/pull/36493
Sorry for some commit problems.
@abhilash1910 @cadedaniel @scv119

@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

@xwu99
Please check here, Thanks.

@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

Also updated previous comments:

  1. Add 3 UTs: test_xpu_ids, test_local_mode_xpus, test_disable_xpu_devices
  2. Change RAY_ACCELERATOR to RAY_EXPERIMENTAL_ACCELERATOR_TYPE
  3. Unified two environment variable to one: ONEAPI_DEVICE_SELECTOR, which is similar to CUDA_VISIBLE_DEVICES. While remove XPU_VISIBLE_DEVICES, which not used in IPEX 1.13 and 2.0 actually.
  4. Add some comments in codes.
  5. Only one type of accelerator can be used in ray cluster, even though there are more than 2 types accelerator in ray cluster.
  6. @xwu99 has add some update for documents.

@cadedaniel

@xwu99 xwu99 mentioned this pull request Aug 17, 2023
8 tasks
@@ -540,6 +542,25 @@ def check():
)


def test_disable_xpu_devices():
script = """
import ray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indent the quoted script:

script= """
            import ray .....

LGTM otherwise

Copy link

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM- ! Thanks

@harborn
Copy link
Contributor Author

harborn commented Aug 18, 2023

Previous comments are in https://github.com/ray-project/ray/pull/36493

@harborn
Copy link
Contributor Author

harborn commented Oct 7, 2023

@harborn

Is there a separate channel for discussions related to further integration/development (such as slack/discord etc?)

Could you reach out to me on Ray slack? We should set up a collaboration channel.

OK, reach you on Slack.

object_store_memory,
resources,
redis_max_memory,
num_cpus, num_gpus, memory, object_store_memory, resources, redis_max_memory
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to no change the original format.

def _detect_gpus(num_gpus: Optional[int], resources: dict) -> int:
"""Detect GPUs by rules of 'Hardware Accelerators (GPUs) on Ray', link:
https://github.com/ray-project/ray/blob/master/python/ray/util/accelerators/accelerators.md
The mainly rule is Homogenous within a node:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to rephrase like: The GPU type in the same node should be the same, but different node can have different types of GPUs.

if gpu_ids is not None:
num_xpus = min(num_xpus, len(gpu_ids))

# resources.update({ray_constants.XPU: num_xpus})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove redundant comment.

environment variable. CUDA_VISIBLE_DEVICES for Nvidia GPU,
ONEAPI_DEVICE_SELECTOR for Intel GPU.
Returns:
ID list (List[str]): according the resource model,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move the long description up to the first paragraph.

backend = ray_constants.RAY_ONEAPI_DEVICE_BACKEND_TYPE
device_type = ray_constants.RAY_ONEAPI_DEVICE_TYPE
os.environ["ONEAPI_DEVICE_SELECTOR"] = backend + ":" + device_type

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the above block as ONEAPI_DEVICE_SELECTOR already applied to dpctl.

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a test_intel_gpu.py file and create some tests. You can see test_tpu.py as an example.

python/ray/_private/accelerators/intel_gpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/intel_gpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/intel_gpu.py Show resolved Hide resolved
python/ray/util/accelerators/accelerators.py Outdated Show resolved Hide resolved
@jjyao
Copy link
Contributor

jjyao commented Oct 19, 2023

Lint failed:



python/ray/_private/accelerators/intel_gpu.py:1:1: F401 're' imported but unused
--
  | python/ray/_private/accelerators/intel_gpu.py:3:1: F401 'sys' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:5:1: F401 'subprocess' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:6:1: F401 'importlib' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:55:18: E711 comparison to None should be 'if cond is not None:'

Comment on lines 62 to 71
def test_multi_gpu_with_different_vendor(ray_start_cluster):
cluster = ray_start_cluster
nvidia_gpu = NVIDIA_TESLA_A100
intel_gpu = INTEL_MAX_1550
prefix = ray._private.ray_constants.RESOURCE_CONSTRAINT_PREFIX
nvidia_resource_name = f"{prefix}{nvidia_gpu}"
intel_resource_name = f"{prefix}{intel_gpu}"
cluster.add_node(num_cpus=1, num_gpus=10, resources={nvidia_resource_name: 1})
cluster.add_node(num_cpus=1, num_gpus=10, resources={intel_resource_name: 1})
ray.init(address=cluster.address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't test anything. Since we didn't mock IntelGPUAcceleratorManager.get_current_node_num_accelerators, both nodes with have Nvidia GPUs.

Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
@jjyao
Copy link
Contributor

jjyao commented Oct 24, 2023

Tests failed on windows

ESC_bk;t=1698165733313^G================================== FAILURES ===================================
ESC_bk;t=1698165733313^G___________________ test_get_current_node_num_accelerators ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_num_accelerators():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:38: ModuleNotFoundError
ESC_bk;t=1698165733313^G___________________ test_get_current_node_accelerator_type ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_accelerator_type():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:53: ModuleNotFoundError

Signed-off-by: harborn <gangsheng.wu@intel.com>
@jjyao jjyao merged commit 8cfc894 into ray-project:master Oct 25, 2023
34 of 37 checks passed
@rickyyx rickyyx mentioned this pull request Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants