Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Intel Gaudi Backend #40561

Merged
merged 17 commits into from
Oct 31, 2023
Merged

Conversation

jerome-habana
Copy link
Contributor

Added support for intel gaudi backend based on new interfaces defined in #40286

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jerome <janand@habana.ai>
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lg

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/air/_internal/torch_utils.py Outdated Show resolved Hide resolved
python/ray/tests/accelerators/test_hpu.py Outdated Show resolved Hide resolved
python/ray/tests/accelerators/test_hpu.py Outdated Show resolved Hide resolved
python/ray/train/torch/config.py Outdated Show resolved Hide resolved
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
@jjyao jjyao self-assigned this Oct 25, 2023
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/train/torch/config.py Outdated Show resolved Hide resolved
python/ray/tests/accelerators/test_hpu.py Outdated Show resolved Hide resolved
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome <janand@habana.ai>
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on the machine with Gaudi?

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/tests/accelerators/test_hpu.py Show resolved Hide resolved
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
@jjyao
Copy link
Contributor

jjyao commented Oct 27, 2023

Lint failure:

Fri Oct 27 12:06:51 UTC 2023 Flake8....
--
  | python/ray/_private/utils.py:338:89: E501 line too long (108 > 88 characters)
  | python/ray/tests/accelerators/test_hpu.py:3:1: F401 'subprocess' imported but unused
  | python/ray/tests/accelerators/test_hpu.py:110:74: E711 comparison to None should be 'if cond is None:'


jerome-habana and others added 2 commits October 30, 2023 08:48
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Signed-off-by: Jerome <janand@habana.ai>
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
@jjyao
Copy link
Contributor

jjyao commented Oct 30, 2023

Lint failure:



def test_get_current_process_visible_accelerator_ids():
--
  | os.environ[hpu.HABANA_VISIBLE_DEVICES_ENV_VAR] = "0,1,2"
  | -    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == ["0", "1", "2"]  # noqa: E501
  | +    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
  | +        "0",
  | +        "1",
  | +        "2",
  | +    ]  # noqa: E501


jerome-habana and others added 3 commits October 30, 2023 07:15
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/tests/accelerators/test_hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved
jerome-habana and others added 2 commits October 31, 2023 07:42
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
* Add Intel gaudi to accelerator list
* Add check for backend initialization with updated test

Signed-off-by: Jerome <janand@habana.ai>
@jjyao
Copy link
Contributor

jjyao commented Oct 31, 2023

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None


Signed-off-by: Jerome <janand@habana.ai>
@jerome-habana
Copy link
Contributor Author

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

might be nice to have auto corrector

@@ -7,6 +7,7 @@
NVIDIA_TESLA_A10G = "A10G"
INTEL_MAX_1550 = "Intel-GPU-Max-1550"
INTEL_MAX_1100 = "Intel-GPU-Max-1100"
INTEL_GAUDI = "Intel-GAUDI"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add INTEL_GAUDI2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've kept it generic for now. Lets update post closure of the right instance usage ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, I'm working on LLM serving on Gaudi 2. Is Gaudi 2 not supported yet?

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
@jjyao jjyao merged commit 04a8aa3 into ray-project:master Oct 31, 2023
29 of 33 checks passed
@jerome-habana jerome-habana mentioned this pull request May 15, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants