-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669
[Ray Core] Adds in Google Cloud TPUs as a native Resource #38669
Conversation
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
…cally setting TPU_VISIBLE_CHIPS, test cases and other changes around Ray to make sure we're doing the same things as other accelerators. Still need to test more next. Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
…some final touch ups before sending out for review... Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
@richardliaw @scv119 Do you approve the API change? I'll review the PR, perhaps you can suggest other reviewers too. |
I don't believe we are intending to add vendor-specific resources directly to the Ray API--- the pattern followed should be similar to #37998 for all new accelerator types. |
Thanks - my mistake. I'll pull out the API specific changes and align with the |
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
I've updated the PR to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Some nits around maintenance/documentation in case someone less familiar with TPU has to update this code in the future, but these don't need to block the PR in my opinion
python/ray/_private/accelerator.py
Outdated
numeric_entries = [int(entry) for entry in vfio_entries if entry.isdigit()] | ||
return len(numeric_entries) | ||
except Exception: | ||
logging.info("Failed to detect number of TPUs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What specific exception are we trying to catch here?
- Does the original exception get printed automatically? If not, we should probably log it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added FileNotFoundError
!
``` | ||
(i.e. a GCE VM), vs through GKE: | ||
- GCE VMs will always have a metadata server to poll this info | ||
- GKE VMS will have environment variables preset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there google documentation we can link to as a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not at this moment, this is all mostly internal knowledge related to GKE. I'm not sure if this will be publicized or not
""" | ||
|
||
def accelerator_type_to_version(accelerator_type: str) -> str: | ||
return "TPU-" + str(accelerator_type.split("-")[0]).upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth checking that accelerator-type contains a "-" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I moved some logic we used from config.py
within GCP autoscaler to assert accelerator type (and tested that both are successful)
RAY_GCE_TPU_ACCELERATOR_ENDPOINT = ( | ||
"http://metadata.google.internal/computeMetadata/" | ||
"v1/instance/attributes/accelerator-type" | ||
) | ||
RAY_GCE_TPU_HEADERS = {"Metadata-Flavor": "Google"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we could link to documentation for this too in case someone needs to update this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"""Attempt to detect the number of TPUs on this machine. | ||
|
||
TPU chips are represented as devices within `/dev/`, either as | ||
`/dev/accel*` or `/dev/vfio/*`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All internal knowledge unfortunately :(
python/ray/_private/accelerator.py
Outdated
|
||
Returns: | ||
A string representing the TPU version, | ||
e.g. "V2", "V3", "V4" if applicable, else None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. "V2", "V3", "V4" if applicable, else None. | |
e.g. "TPU-V2", "TPU-V3", "TPU-V4" if applicable, else None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
raise ValueError( | ||
"Only one of 'num_gpus', 'neuron_cores/accelerator_type:aws-neuron-core' " | ||
"and 'TPU/accelerator_type:TPU-V*' can be set. " | ||
f"Detected {num_configured_accelerators} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider giving more details instead of just num_configured_accelerators, for example listing which of the options were detected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
global last_set_tpu_chips | ||
if last_set_tpu_chips == tpu_chips: | ||
return # optimization: already set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we optimizing? If it's just setting env vars, I'm not sure it's worth the additional complexity of making it stateful with a global variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but I will note I was copying the previous examples that Neuron cores and GPU both used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! I'm curious why it's there but no need to change it in this PR
src/ray/common/ray_config_def.h
Outdated
@@ -690,7 +690,7 @@ RAY_CONFIG(std::string, predefined_unit_instance_resources, "GPU") | |||
/// The scheduler will treat these custom resource types as unit_instance. | |||
/// Default custom_unit_instance_resources is "neuron_cores". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Default custom_unit_instance_resources is "neuron_cores". | |
/// Default custom_unit_instance_resources is "neuron_cores,TPU". |
Is this right? Feel free to rewrite the whole comment if you understand it, it's kind of confusing to me at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I've rewritten the comment to match my understanding
Assigning @scv119 as ray core codeowner, and someone who reviewed the AWS neuron core PR |
Hi @allenwang28, Thanks for the contribution. Given multiple teams are trying to add support of different accelerators to Ray. We'd like to come up with a design first to see how we should support them in a unified way and the task is on me (#38504). Do you mind waiting until the design is done? I'll start to work on it in 2-3 weeks after finishing the Ray 2.7 release. Does this timeline work for you? |
Hey @jjyao - I understand, thanks for the heads up. @richardliaw is actually announcing this feature today so I am a bit worried about that timeline of post 2.7. I'm wondering if we can push this version through in time for the 2.7 release? But I can definitely help out with the refactor after the design if you need an extra set of hands! |
Signed-off-by: allenwang28 <allencwang@google.com>
Signed-off-by: allenwang28 <allencwang@google.com>
Thanks, the update looks great!
I'm not familiar with the announcement unfortunately, @richardliaw can you confirm whether or not this PR should be included in Ray 2.7? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for API (not a new API any more).
hey @architkulkarni looks like approvals are done; can you take a look at tests and merge if tests-ok? |
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Failed tests:
|
…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
This broke the documentation build - @allenwang28 @architkulkarni could you take a look?
|
I'll revert this PR and open a new one with the fix. |
…39352) * [Ray Core] Adds in Google Cloud TPUs as a native Resource (#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * Fix docstring Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Allen Wang <allencwang@google.com>
…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…t#38669) The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray. --------- Signed-off-by: allenwang28 <allencwang@google.com> Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
The issue below has more details, but at a high level this change addresses the feature request of adding in TPUs as a native resource within Ray.
To a user, these changes intend to shift from:
to
Since we're adding TPUs as a native resource, this gives users the added ability to access individual TPU chips in a TPU host, e.g. the following would be valid:
This is enabled by setting environment variables as described in jax-ml/jax#14977.
Short overview of changes:
num_tpus
inray_params
and does input sanitization similar to how GPUs function (e.g.num_tpus
instead ofTPU
in resources)_validate_neuron_core_accelerator
->_validate_accelerator
inray_option_utils.py
Related issue number
#38085
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.