Add support for Intel Gaudi Backend #40561

jerome-habana · 2023-10-23T11:02:49Z

Added support for intel gaudi backend based on new interfaces defined in #40286

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jerome <janand@habana.ai>

jjyao

Lg

python/ray/_private/accelerators/hpu.py

python/ray/air/_internal/torch_utils.py

python/ray/tests/accelerators/test_hpu.py

python/ray/train/torch/config.py

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

python/ray/_private/accelerators/hpu.py

python/ray/train/torch/config.py

python/ray/tests/accelerators/test_hpu.py

Signed-off-by: Jerome <janand@habana.ai>

jjyao

Have you tested this on the machine with Gaudi?

python/ray/_private/accelerators/hpu.py

python/ray/tests/accelerators/test_hpu.py

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao

Last comment

python/ray/_private/accelerators/hpu.py

jjyao · 2023-10-27T15:31:51Z

Lint failure:

Fri Oct 27 12:06:51 UTC 2023 Flake8....
--
  | python/ray/_private/utils.py:338:89: E501 line too long (108 > 88 characters)
  | python/ray/tests/accelerators/test_hpu.py:3:1: F401 'subprocess' imported but unused
  | python/ray/tests/accelerators/test_hpu.py:110:74: E711 comparison to None should be 'if cond is None:'

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Signed-off-by: Jerome <janand@habana.ai>

python/ray/_private/accelerators/hpu.py

jjyao · 2023-10-30T04:45:47Z

Lint failure:



def test_get_current_process_visible_accelerator_ids():
--
  | os.environ[hpu.HABANA_VISIBLE_DEVICES_ENV_VAR] = "0,1,2"
  | -    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == ["0", "1", "2"]  # noqa: E501
  | +    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
  | +        "0",
  | +        "1",
  | +        "2",
  | +    ]  # noqa: E501

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

python/ray/_private/accelerators/hpu.py

python/ray/tests/accelerators/test_hpu.py

python/ray/_private/accelerators/hpu.py

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

* Add Intel gaudi to accelerator list * Add check for backend initialization with updated test Signed-off-by: Jerome <janand@habana.ai>

jjyao · 2023-10-31T04:51:01Z

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

python/ray/_private/accelerators/hpu.py

python/ray/util/accelerators/__init__.py

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana · 2023-10-31T04:59:56Z

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

might be nice to have auto corrector

jjyao · 2023-10-31T05:51:35Z

python/ray/util/accelerators/accelerators.py

@@ -7,6 +7,7 @@
 NVIDIA_TESLA_A10G = "A10G"
 INTEL_MAX_1550 = "Intel-GPU-Max-1550"
 INTEL_MAX_1100 = "Intel-GPU-Max-1100"
+INTEL_GAUDI = "Intel-GAUDI"


Can you also add INTEL_GAUDI2 here?

Sure. I've kept it generic for now. Lets update post closure of the right instance usage ?

Hi all, I'm working on LLM serving on Gaudi 2. Is Gaudi 2 not supported yet?

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Add support for Intel Gaudi Backend

0d71fb0

Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 23, 2023

View reviewed changes

Address reviews

81a161e

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 25, 2023 05:26

Merge branch 'master' into ray_hpu2

3306f9a

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao self-assigned this Oct 25, 2023

jjyao reviewed Oct 25, 2023

View reviewed changes

jerome-habana added 2 commits October 26, 2023 06:54

Address reviews and remove trainer changes

dd19aa0

Signed-off-by: Jerome <janand@habana.ai>

Remove whitespace

722f69f

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 26, 2023 03:58

jjyao reviewed Oct 26, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_hpu.py Show resolved Hide resolved

Add more tests and address reviews

98a57e5

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 27, 2023 07:37

jerome-habana added 2 commits October 27, 2023 13:07

Merge branch 'master' into ray_hpu2

d6d2638

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Merge branch 'master' into ray_hpu2

05c8505

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao reviewed Oct 27, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

jerome-habana and others added 2 commits October 30, 2023 08:48

Merge branch 'master' into ray_hpu2

aeb2901

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Correct lint errors

3e5a57d

Signed-off-by: Jerome <janand@habana.ai>

jjyao approved these changes Oct 30, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

jerome-habana and others added 3 commits October 30, 2023 07:15

Update api section, tests

1ac9d52

Signed-off-by: Jerome <janand@habana.ai>

More lint fixes

12dbe78

Signed-off-by: Jerome <janand@habana.ai>

Merge branch 'master' into ray_hpu2

4520ed3

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao reviewed Oct 30, 2023

View reviewed changes

jerome-habana and others added 2 commits October 31, 2023 07:42

Merge branch 'master' into ray_hpu2

6fe2a83

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Add Gaudi to acclerator list

8506631

* Add Intel gaudi to accelerator list * Add check for backend initialization with updated test Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 31, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Show resolved Hide resolved

jjyao reviewed Oct 31, 2023

View reviewed changes

python/ray/util/accelerators/__init__.py Show resolved Hide resolved

Resolve Lint failure

1f13d3b

Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 31, 2023

View reviewed changes

Merge branch 'master' into ray_hpu2

8d2d5b6

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao merged commit 04a8aa3 into ray-project:master Oct 31, 2023
29 of 33 checks passed

jerome-habana mentioned this pull request May 15, 2024

Ray train with Intel Gaudi #40695

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Intel Gaudi Backend #40561

Add support for Intel Gaudi Backend #40561

jerome-habana commented Oct 23, 2023

jjyao left a comment

jjyao left a comment

jjyao left a comment

jjyao commented Oct 27, 2023

jjyao commented Oct 30, 2023

jjyao commented Oct 31, 2023

jerome-habana commented Oct 31, 2023

jjyao Oct 31, 2023

jerome-habana Oct 31, 2023

kira-lin Nov 23, 2023

Add support for Intel Gaudi Backend #40561

Add support for Intel Gaudi Backend #40561

Conversation

jerome-habana commented Oct 23, 2023

Why are these changes needed?

Related issue number

Checks

jjyao left a comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

jjyao commented Oct 27, 2023

jjyao commented Oct 30, 2023

jjyao commented Oct 31, 2023

jerome-habana commented Oct 31, 2023

jjyao Oct 31, 2023

Choose a reason for hiding this comment

jerome-habana Oct 31, 2023

Choose a reason for hiding this comment

kira-lin Nov 23, 2023

Choose a reason for hiding this comment