[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

allenwang28 · 2023-08-21T17:54:52Z

Why are these changes needed?

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

To a user, these changes intend to shift from:

@ray.remote(resources={"TPU": 1})
def my_function():
    ...

to

@ray.remote(num_tpus=4)
def my_function():
    ...

Since we're adding TPUs as a native resource, this gives users the added ability to access individual TPU chips in a TPU host, e.g. the following would be valid:

@ray.remote(num_tpus=1) # or 2
def my_function():
    import jax
    print(jax.device_count()) -> 1 (or 2)

This is enabled by setting environment variables as described in jax-ml/jax#14977.

Short overview of changes:

Add in autodetect num TPU chips + TPU version (+ tests for each)
Adds in num_tpus in ray_params and does input sanitization similar to how GPUs function (e.g. num_tpus instead of TPU in resources)
Adds in several constants
Since there are a lot of changes related to neuron cores (from Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998) I tried to consolidate the neuron core specific logic with TPU logic where applicable, for instance changing _validate_neuron_core_accelerator -> _validate_accelerator in ray_option_utils.py
And added in a lot of tests!

Related issue number

#38085

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: allenwang28 <allencwang@google.com>

…cally setting TPU_VISIBLE_CHIPS, test cases and other changes around Ray to make sure we're doing the same things as other accelerators. Still need to test more next. Signed-off-by: allenwang28 <allencwang@google.com>

Signed-off-by: allenwang28 <allencwang@google.com>

…some final touch ups before sending out for review... Signed-off-by: allenwang28 <allencwang@google.com>

Signed-off-by: allenwang28 <allencwang@google.com>

architkulkarni · 2023-08-22T20:34:43Z

@richardliaw @scv119 Do you approve the API change? I'll review the PR, perhaps you can suggest other reviewers too.

ericl · 2023-08-24T18:07:03Z

I don't believe we are intending to add vendor-specific resources directly to the Ray API--- the pattern followed should be similar to #37998 for all new accelerator types.

allenwang28 · 2023-08-24T19:32:00Z

I don't believe we are intending to add vendor-specific resources directly to the Ray API--- the pattern followed should be similar to #37998 for all new accelerator types.

Thanks - my mistake. I'll pull out the API specific changes and align with the neuron_cores approach.

Signed-off-by: allenwang28 <allencwang@google.com>

into tpu_native_resource

Signed-off-by: allenwang28 <allencwang@google.com>

allenwang28 · 2023-08-24T23:16:46Z

I've updated the PR to remove num_tpus, PTAL!

architkulkarni

Looks good to me! Some nits around maintenance/documentation in case someone less familiar with TPU has to update this code in the future, but these don't need to block the PR in my opinion

architkulkarni · 2023-08-24T23:35:48Z

python/ray/_private/accelerator.py

+        numeric_entries = [int(entry) for entry in vfio_entries if entry.isdigit()]
+        return len(numeric_entries)
+    except Exception:
+        logging.info("Failed to detect number of TPUs.")


What specific exception are we trying to catch here?

Does the original exception get printed automatically? If not, we should probably log it

Added FileNotFoundError!

architkulkarni · 2023-08-24T23:37:52Z

python/ray/_private/accelerator.py

+    ```
+    (i.e. a GCE VM), vs through GKE:
+    - GCE VMs will always have a metadata server to poll this info
+    - GKE VMS will have environment variables preset.


Is there google documentation we can link to as a reference?

Unfortunately not at this moment, this is all mostly internal knowledge related to GKE. I'm not sure if this will be publicized or not

architkulkarni · 2023-08-24T23:38:54Z

python/ray/_private/accelerator.py

+    """
+
+    def accelerator_type_to_version(accelerator_type: str) -> str:
+        return "TPU-" + str(accelerator_type.split("-")[0]).upper()


Is it worth checking that accelerator-type contains a "-" ?

Done. I moved some logic we used from config.py within GCP autoscaler to assert accelerator type (and tested that both are successful)

architkulkarni · 2023-08-24T23:43:05Z

python/ray/_private/ray_constants.py

+RAY_GCE_TPU_ACCELERATOR_ENDPOINT = (
+    "http://metadata.google.internal/computeMetadata/"
+    "v1/instance/attributes/accelerator-type"
+)
+RAY_GCE_TPU_HEADERS = {"Metadata-Flavor": "Google"}


Ideally we could link to documentation for this too in case someone needs to update this later

architkulkarni · 2023-08-24T23:45:08Z

python/ray/_private/accelerator.py

+    """Attempt to detect the number of TPUs on this machine.
+
+    TPU chips are represented as devices within `/dev/`, either as
+    `/dev/accel*` or `/dev/vfio/*`.


Link to documentation?

All internal knowledge unfortunately :(

architkulkarni · 2023-08-24T23:46:03Z

python/ray/_private/accelerator.py

+
+    Returns:
+        A string representing the TPU version,
+        e.g. "V2", "V3", "V4" if applicable, else None.


Suggested change

e.g. "V2", "V3", "V4" if applicable, else None.

e.g. "TPU-V2", "TPU-V3", "TPU-V4" if applicable, else None.

architkulkarni · 2023-08-24T23:52:20Z

python/ray/_private/ray_option_utils.py

+        raise ValueError(
+            "Only one of 'num_gpus', 'neuron_cores/accelerator_type:aws-neuron-core' "
+            "and 'TPU/accelerator_type:TPU-V*' can be set. "
+            f"Detected {num_configured_accelerators} "


Consider giving more details instead of just num_configured_accelerators, for example listing which of the options were detected

architkulkarni · 2023-08-24T23:55:31Z

python/ray/_private/utils.py

+    global last_set_tpu_chips
+    if last_set_tpu_chips == tpu_chips:
+        return  # optimization: already set


What are we optimizing? If it's just setting env vars, I'm not sure it's worth the additional complexity of making it stateful with a global variable

I don't disagree, but I will note I was copying the previous examples that Neuron cores and GPU both used.

Gotcha! I'm curious why it's there but no need to change it in this PR

architkulkarni · 2023-08-28T21:43:17Z

src/ray/common/ray_config_def.h

@@ -690,7 +690,7 @@ RAY_CONFIG(std::string, predefined_unit_instance_resources, "GPU")
 /// The scheduler will treat these custom resource types as unit_instance.
 /// Default custom_unit_instance_resources is "neuron_cores".


Suggested change

/// Default custom_unit_instance_resources is "neuron_cores".

/// Default custom_unit_instance_resources is "neuron_cores,TPU".

Is this right? Feel free to rewrite the whole comment if you understand it, it's kind of confusing to me at the moment.

Great point, I've rewritten the comment to match my understanding

architkulkarni · 2023-08-28T21:57:59Z

Assigning @scv119 as ray core codeowner, and someone who reviewed the AWS neuron core PR

jjyao · 2023-08-29T00:10:20Z

Hi @allenwang28,

Thanks for the contribution. Given multiple teams are trying to add support of different accelerators to Ray. We'd like to come up with a design first to see how we should support them in a unified way and the task is on me (#38504). Do you mind waiting until the design is done? I'll start to work on it in 2-3 weeks after finishing the Ray 2.7 release. Does this timeline work for you?

allenwang28 · 2023-08-30T15:23:54Z

Hey @jjyao - I understand, thanks for the heads up. @richardliaw is actually announcing this feature today so I am a bit worried about that timeline of post 2.7. I'm wondering if we can push this version through in time for the 2.7 release? But I can definitely help out with the refactor after the design if you need an extra set of hands!

Signed-off-by: allenwang28 <allencwang@google.com>

architkulkarni · 2023-09-05T21:24:29Z

Thanks, the update looks great!

Hey @jjyao - I understand, thanks for the heads up. @richardliaw is actually announcing this feature today so I am a bit worried about that timeline of post 2.7. I'm wondering if we can push this version through in time for the 2.7 release? But I can definitely help out with the refactor after the design if you need an extra set of hands!

I'm not familiar with the announcement unfortunately, @richardliaw can you confirm whether or not this PR should be included in Ray 2.7?

ericl

Approving for API (not a new API any more).

richardliaw · 2023-09-05T22:53:26Z

hey @architkulkarni looks like approvals are done; can you take a look at tests and merge if tests-ok?

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

architkulkarni · 2023-09-06T22:23:24Z

Failed tests:

:kubernetes: ❤️‍🩹 🏃 chaos network delay many job submissions unrelated
doc build runtime_context.py errors unrelated
test_client_builder unrelated
py38 build failure unrelated

…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

matthewdeng · 2023-09-06T23:15:20Z

This broke the documentation build - @allenwang28 @architkulkarni could you take a look?

/home/docs/checkouts/readthedocs.org/user_builds/anyscale-ray/checkouts/38669/python/ray/runtime_context.py:docstring of ray.runtime_context.RuntimeContext.get_resource_ids:5: WARNING: Unexpected indentation.
/home/docs/checkouts/readthedocs.org/user_builds/anyscale-ray/checkouts/38669/python/ray/runtime_context.py:docstring of ray.runtime_context.RuntimeContext.get_resource_ids:3: WARNING: Inline interpreted text or phrase reference start-string without end-string.

architkulkarni · 2023-09-06T23:20:11Z

I'll revert this PR and open a new one with the fix.

…8669)" This reverts commit 701a652.

…39352) * [Ray Core] Adds in Google Cloud TPUs as a native Resource (#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * Fix docstring Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Allen Wang <allencwang@google.com>

…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>

allenwang28 added 15 commits August 18, 2023 19:20

Initial commit for TPUs as a native resource

5cc3f1c

Signed-off-by: allenwang28 <allencwang@google.com>

Merge remote-tracking branch 'upstream/master' into tpu_native_resource

c3e4b81

Signed-off-by: allenwang28 <allencwang@google.com>

Adds in TPU as a custom_unit_instance_resource + tests

7270287

Signed-off-by: allenwang28 <allencwang@google.com>

Adds in validate accelerators option + some cleanup + some tests

a2560fc

Signed-off-by: allenwang28 <allencwang@google.com>

More cleanup, adds a few more tests related to accelerator

e20f773

Signed-off-by: allenwang28 <allencwang@google.com>

Try/catch on vfio check, show TPU metrics

735d951

Signed-off-by: allenwang28 <allencwang@google.com>

Changes to accelerators and things have been tested. Next need to do …

eb19180

…some final touch ups before sending out for review... Signed-off-by: allenwang28 <allencwang@google.com>

Some cleanups + tests

7072d88

Signed-off-by: allenwang28 <allencwang@google.com>

Merge remote-tracking branch 'upstream/master' into tpu_native_resource

dd5e7a5

Signed-off-by: allenwang28 <allencwang@google.com>

Consolidates a few more params, more tests and tests fixes

75bc271

Signed-off-by: allenwang28 <allencwang@google.com>

a bit more cleanup

241aa55

Signed-off-by: allenwang28 <allencwang@google.com>

Fixes some tests

391ec49

Signed-off-by: allenwang28 <allencwang@google.com>

Run formatter

2114e94

Signed-off-by: allenwang28 <allencwang@google.com>

Merge branch 'master' into tpu_native_resource

e6b3697

allenwang28 marked this pull request as ready for review August 22, 2023 18:51

allenwang28 requested review from wuisawesome, DmitriGekhtman, ericl, architkulkarni and a team as code owners August 22, 2023 18:51

architkulkarni self-assigned this Aug 22, 2023

architkulkarni added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Aug 22, 2023

scv119 mentioned this pull request Aug 24, 2023

[Core] holistic design for supporting accelerators in ray core #38504

Closed

allenwang28 added 3 commits August 24, 2023 20:00

Merge remote-tracking branch 'upstream/master' into tpu_native_resource

9315510

Signed-off-by: allenwang28 <allencwang@google.com>

Merge branch 'tpu_native_resource' of https://github.com/allenwang28/ray

67516d8

into tpu_native_resource

remove num_tpus API and revert to custom resources

6f31142

Signed-off-by: allenwang28 <allencwang@google.com>

allenwang28 added 2 commits August 24, 2023 23:12

Merge remote-tracking branch 'upstream/master' into tpu_native_resource

01d69f2

Signed-off-by: allenwang28 <allencwang@google.com>

remove unused tpu constraint name

f97943d

Signed-off-by: allenwang28 <allencwang@google.com>

architkulkarni approved these changes Aug 28, 2023

View reviewed changes

architkulkarni assigned scv119 Aug 28, 2023

jjyao self-assigned this Aug 29, 2023

allenwang28 added 2 commits September 5, 2023 17:11

Merge remote-tracking branch 'upstream/master' into tpu_native_resource

fef8b05

Signed-off-by: allenwang28 <allencwang@google.com>

Addresses PR comments

d2015c1

Signed-off-by: allenwang28 <allencwang@google.com>

ericl approved these changes Sep 5, 2023

View reviewed changes

ericl removed the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Sep 5, 2023

Merge branch 'master' into tpu_native_resource

3b0d7ad

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 6, 2023

architkulkarni merged commit 701a652 into ray-project:master Sep 6, 2023
2 checks passed

architkulkarni mentioned this pull request Sep 6, 2023

[Ray Core] Adds in Google Cloud TPUs as a native Resource (#38669) #39352

Merged

8 tasks

architkulkarni added a commit that referenced this pull request Sep 6, 2023

Revert "[Ray Core] Adds in Google Cloud TPUs as a native Resource (#3…

d8212c5

…8669)" This reverts commit 701a652.

This was referenced Sep 7, 2023

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

Merged

[core] fix runtime_context.get_resource_ids docstring #39365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

allenwang28 commented Aug 21, 2023 •

edited

Loading

architkulkarni commented Aug 22, 2023

ericl commented Aug 24, 2023

allenwang28 commented Aug 24, 2023

allenwang28 commented Aug 24, 2023

architkulkarni left a comment

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Aug 24, 2023

allenwang28 Sep 5, 2023

architkulkarni Sep 5, 2023

architkulkarni Aug 28, 2023

allenwang28 Sep 5, 2023

architkulkarni commented Aug 28, 2023

jjyao commented Aug 29, 2023

allenwang28 commented Aug 30, 2023

architkulkarni commented Sep 5, 2023

ericl left a comment

richardliaw commented Sep 5, 2023

architkulkarni commented Sep 6, 2023

matthewdeng commented Sep 6, 2023

architkulkarni commented Sep 6, 2023

	e.g. "V2", "V3", "V4" if applicable, else None.
	e.g. "TPU-V2", "TPU-V3", "TPU-V4" if applicable, else None.

		@@ -690,7 +690,7 @@ RAY_CONFIG(std::string, predefined_unit_instance_resources, "GPU")
		/// The scheduler will treat these custom resource types as unit_instance.
		/// Default custom_unit_instance_resources is "neuron_cores".

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

Conversation

allenwang28 commented Aug 21, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented Aug 22, 2023

ericl commented Aug 24, 2023

allenwang28 commented Aug 24, 2023

allenwang28 commented Aug 24, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

architkulkarni commented Aug 28, 2023

jjyao commented Aug 29, 2023

allenwang28 commented Aug 30, 2023

architkulkarni commented Sep 5, 2023

ericl left a comment

Choose a reason for hiding this comment

richardliaw commented Sep 5, 2023

architkulkarni commented Sep 6, 2023

matthewdeng commented Sep 6, 2023

architkulkarni commented Sep 6, 2023

allenwang28 commented Aug 21, 2023 •

edited

Loading