Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669

Merged
merged 23 commits into from
Sep 6, 2023

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Aug 21, 2023

Why are these changes needed?

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

To a user, these changes intend to shift from:

@ray.remote(resources={"TPU": 1})
def my_function():
    ...

to

@ray.remote(num_tpus=4)
def my_function():
    ...

Since we're adding TPUs as a native resource, this gives users the added ability to access individual TPU chips in a TPU host, e.g. the following would be valid:

@ray.remote(num_tpus=1) # or 2
def my_function():
    import jax
    print(jax.device_count()) -> 1 (or 2)

This is enabled by setting environment variables as described in jax-ml/jax#14977.

Short overview of changes:

  • Add in autodetect num TPU chips + TPU version (+ tests for each)
  • Adds in num_tpus in ray_params and does input sanitization similar to how GPUs function (e.g. num_tpus instead of TPU in resources)
  • Adds in several constants
  • Since there are a lot of changes related to neuron cores (from Auto-detection of accelerator_type for aws_accelerators trn1_inf #37998) I tried to consolidate the neuron core specific logic with TPU logic where applicable, for instance changing _validate_neuron_core_accelerator -> _validate_accelerator in ray_option_utils.py
  • And added in a lot of tests!

Related issue number

#38085

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
…cally setting TPU_VISIBLE_CHIPS, test cases and other changes around Ray to make sure we're doing the same things as other accelerators. Still need to test more next.

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
…some final touch ups before sending out for review...

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
@allenwang28 allenwang28 marked this pull request as ready for review August 22, 2023 18:51
@architkulkarni architkulkarni self-assigned this Aug 22, 2023
@architkulkarni architkulkarni added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Aug 22, 2023
@architkulkarni
Copy link
Contributor

@richardliaw @scv119 Do you approve the API change? I'll review the PR, perhaps you can suggest other reviewers too.

@ericl
Copy link
Contributor

ericl commented Aug 24, 2023

I don't believe we are intending to add vendor-specific resources directly to the Ray API--- the pattern followed should be similar to #37998 for all new accelerator types.

@allenwang28
Copy link
Contributor Author

I don't believe we are intending to add vendor-specific resources directly to the Ray API--- the pattern followed should be similar to #37998 for all new accelerator types.

Thanks - my mistake. I'll pull out the API specific changes and align with the neuron_cores approach.

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
@allenwang28
Copy link
Contributor Author

I've updated the PR to remove num_tpus, PTAL!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Some nits around maintenance/documentation in case someone less familiar with TPU has to update this code in the future, but these don't need to block the PR in my opinion

numeric_entries = [int(entry) for entry in vfio_entries if entry.isdigit()]
return len(numeric_entries)
except Exception:
logging.info("Failed to detect number of TPUs.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What specific exception are we trying to catch here?
  • Does the original exception get printed automatically? If not, we should probably log it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added FileNotFoundError!

```
(i.e. a GCE VM), vs through GKE:
- GCE VMs will always have a metadata server to poll this info
- GKE VMS will have environment variables preset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there google documentation we can link to as a reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not at this moment, this is all mostly internal knowledge related to GKE. I'm not sure if this will be publicized or not

"""

def accelerator_type_to_version(accelerator_type: str) -> str:
return "TPU-" + str(accelerator_type.split("-")[0]).upper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth checking that accelerator-type contains a "-" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I moved some logic we used from config.py within GCP autoscaler to assert accelerator type (and tested that both are successful)

Comment on lines +489 to +493
RAY_GCE_TPU_ACCELERATOR_ENDPOINT = (
"http://metadata.google.internal/computeMetadata/"
"v1/instance/attributes/accelerator-type"
)
RAY_GCE_TPU_HEADERS = {"Metadata-Flavor": "Google"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could link to documentation for this too in case someone needs to update this later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Attempt to detect the number of TPUs on this machine.

TPU chips are represented as devices within `/dev/`, either as
`/dev/accel*` or `/dev/vfio/*`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All internal knowledge unfortunately :(


Returns:
A string representing the TPU version,
e.g. "V2", "V3", "V4" if applicable, else None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e.g. "V2", "V3", "V4" if applicable, else None.
e.g. "TPU-V2", "TPU-V3", "TPU-V4" if applicable, else None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

raise ValueError(
"Only one of 'num_gpus', 'neuron_cores/accelerator_type:aws-neuron-core' "
"and 'TPU/accelerator_type:TPU-V*' can be set. "
f"Detected {num_configured_accelerators} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider giving more details instead of just num_configured_accelerators, for example listing which of the options were detected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +477 to +479
global last_set_tpu_chips
if last_set_tpu_chips == tpu_chips:
return # optimization: already set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we optimizing? If it's just setting env vars, I'm not sure it's worth the additional complexity of making it stateful with a global variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but I will note I was copying the previous examples that Neuron cores and GPU both used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I'm curious why it's there but no need to change it in this PR

@@ -690,7 +690,7 @@ RAY_CONFIG(std::string, predefined_unit_instance_resources, "GPU")
/// The scheduler will treat these custom resource types as unit_instance.
/// Default custom_unit_instance_resources is "neuron_cores".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Default custom_unit_instance_resources is "neuron_cores".
/// Default custom_unit_instance_resources is "neuron_cores,TPU".

Is this right? Feel free to rewrite the whole comment if you understand it, it's kind of confusing to me at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I've rewritten the comment to match my understanding

@architkulkarni
Copy link
Contributor

Assigning @scv119 as ray core codeowner, and someone who reviewed the AWS neuron core PR

@jjyao jjyao self-assigned this Aug 29, 2023
@jjyao
Copy link
Collaborator

jjyao commented Aug 29, 2023

Hi @allenwang28,

Thanks for the contribution. Given multiple teams are trying to add support of different accelerators to Ray. We'd like to come up with a design first to see how we should support them in a unified way and the task is on me (#38504). Do you mind waiting until the design is done? I'll start to work on it in 2-3 weeks after finishing the Ray 2.7 release. Does this timeline work for you?

@allenwang28
Copy link
Contributor Author

Hey @jjyao - I understand, thanks for the heads up. @richardliaw is actually announcing this feature today so I am a bit worried about that timeline of post 2.7. I'm wondering if we can push this version through in time for the 2.7 release? But I can definitely help out with the refactor after the design if you need an extra set of hands!

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
@architkulkarni
Copy link
Contributor

Thanks, the update looks great!

Hey @jjyao - I understand, thanks for the heads up. @richardliaw is actually announcing this feature today so I am a bit worried about that timeline of post 2.7. I'm wondering if we can push this version through in time for the 2.7 release? But I can definitely help out with the refactor after the design if you need an extra set of hands!

I'm not familiar with the announcement unfortunately, @richardliaw can you confirm whether or not this PR should be included in Ray 2.7?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for API (not a new API any more).

@ericl ericl removed the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Sep 5, 2023
@richardliaw
Copy link
Contributor

hey @architkulkarni looks like approvals are done; can you take a look at tests and merge if tests-ok?

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor

Failed tests:

  • :kubernetes: ❤️‍🩹 🏃 chaos network delay many job submissions unrelated
  • doc build runtime_context.py errors unrelated
  • test_client_builder unrelated
  • py38 build failure unrelated

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 6, 2023
@architkulkarni architkulkarni merged commit 701a652 into ray-project:master Sep 6, 2023
2 checks passed
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Sep 6, 2023
…t#38669)

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

---------

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@matthewdeng
Copy link
Contributor

This broke the documentation build - @allenwang28 @architkulkarni could you take a look?

/home/docs/checkouts/readthedocs.org/user_builds/anyscale-ray/checkouts/38669/python/ray/runtime_context.py:docstring of ray.runtime_context.RuntimeContext.get_resource_ids:5: WARNING: Unexpected indentation.
/home/docs/checkouts/readthedocs.org/user_builds/anyscale-ray/checkouts/38669/python/ray/runtime_context.py:docstring of ray.runtime_context.RuntimeContext.get_resource_ids:3: WARNING: Inline interpreted text or phrase reference start-string without end-string.

@architkulkarni
Copy link
Contributor

I'll revert this PR and open a new one with the fix.

GeneDer pushed a commit that referenced this pull request Sep 7, 2023
…39352)

* [Ray Core] Adds in Google Cloud TPUs as a native Resource (#38669)

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

---------

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

* Fix docstring

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

---------

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Allen Wang <allencwang@google.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…t#38669)

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

---------

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…t#38669)

The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.

---------

Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants