-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support Intel GPU #38553
[Core] Support Intel GPU #38553
Conversation
Please check this PR instead https://github.com/ray-project/ray/pull/36493 |
@xwu99 |
Also updated previous comments:
|
python/ray/tests/test_basic.py
Outdated
@@ -540,6 +542,25 @@ def check(): | |||
) | |||
|
|||
|
|||
def test_disable_xpu_devices(): | |||
script = """ | |||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe indent the quoted script:
script= """
import ray .....
LGTM otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM- ! Thanks
Previous comments are in https://github.com/ray-project/ray/pull/36493 |
OK, reach you on Slack. |
python/ray/_private/resource_spec.py
Outdated
object_store_memory, | ||
resources, | ||
redis_max_memory, | ||
num_cpus, num_gpus, memory, object_store_memory, resources, redis_max_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to no change the original format.
python/ray/_private/resource_spec.py
Outdated
def _detect_gpus(num_gpus: Optional[int], resources: dict) -> int: | ||
"""Detect GPUs by rules of 'Hardware Accelerators (GPUs) on Ray', link: | ||
https://github.com/ray-project/ray/blob/master/python/ray/util/accelerators/accelerators.md | ||
The mainly rule is Homogenous within a node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to rephrase like: The GPU type in the same node should be the same, but different node can have different types of GPUs.
python/ray/_private/resource_spec.py
Outdated
if gpu_ids is not None: | ||
num_xpus = min(num_xpus, len(gpu_ids)) | ||
|
||
# resources.update({ray_constants.XPU: num_xpus}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove redundant comment.
python/ray/_private/utils.py
Outdated
environment variable. CUDA_VISIBLE_DEVICES for Nvidia GPU, | ||
ONEAPI_DEVICE_SELECTOR for Intel GPU. | ||
Returns: | ||
ID list (List[str]): according the resource model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should move the long description up to the first paragraph.
python/ray/_private/utils.py
Outdated
backend = ray_constants.RAY_ONEAPI_DEVICE_BACKEND_TYPE | ||
device_type = ray_constants.RAY_ONEAPI_DEVICE_TYPE | ||
os.environ["ONEAPI_DEVICE_SELECTOR"] = backend + ":" + device_type | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove the above block as ONEAPI_DEVICE_SELECTOR already applied to dpctl.
d09cda3
to
b3c1424
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create a test_intel_gpu.py file and create some tests. You can see test_tpu.py
as an example.
Lint failed:
|
328340d
to
ca007ab
Compare
def test_multi_gpu_with_different_vendor(ray_start_cluster): | ||
cluster = ray_start_cluster | ||
nvidia_gpu = NVIDIA_TESLA_A100 | ||
intel_gpu = INTEL_MAX_1550 | ||
prefix = ray._private.ray_constants.RESOURCE_CONSTRAINT_PREFIX | ||
nvidia_resource_name = f"{prefix}{nvidia_gpu}" | ||
intel_resource_name = f"{prefix}{intel_gpu}" | ||
cluster.add_node(num_cpus=1, num_gpus=10, resources={nvidia_resource_name: 1}) | ||
cluster.add_node(num_cpus=1, num_gpus=10, resources={intel_resource_name: 1}) | ||
ray.init(address=cluster.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't test anything. Since we didn't mock IntelGPUAcceleratorManager.get_current_node_num_accelerators, both nodes with have Nvidia GPUs.
da1feca
to
ce3e7e6
Compare
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
07ecee3
to
9c417fd
Compare
Tests failed on windows
|
Signed-off-by: harborn <gangsheng.wu@intel.com>
Why are these changes needed?
Intel also provide common computing GPUs.
Intel internal benchmark shows that Intel GPU has great performance on LLM train/infer workflow.
This PR aim to support Intel GPU on Ray.
We add two device type as GPU: INTEL_MAX_1550, INTEL_MAX_1100.
This upgrade allows users to use INTEL GPUs almost seamlessly, just like Nvidia’s different GPU devices.
Usage of different GPU type in ray cluster
To use different GPU in ray cluster:
accelerator_type
in task/actor options, ray will auto use the only one GPU type.accelerator_type
in options, ray will raise ValueError, due to ray can't decide which GPU to run the task/actor.Such as:
The changes include 2 parts:
Upgrades GPU detection process of ray.init
ray.init
will autodetect all kinds of GPUs, current including:The GPUs info will be detected during
ray.init()
and stored inresources
field in option.Upgrades of ray task or actor
Only one accelerator type in current ray service
Multi accelerator type in current ray service
not specified accelerator type
specified accelerator type
Related issue number
#36493 previous implementation
#37998 auto detect aws accelerator
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.