Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Inject env identifying KubeRay. #562

Merged

Conversation

DmitriGekhtman
Copy link
Collaborator

Signed-off-by: Dmitri Gekhtman dmitri.m.gekhtman@gmail.com

Why are these changes needed?

This change has the KubeRay operator inject an env var identifying a Ray container as having been deployed using KubeRay.
The purpose is to refine Ray usage stats collected by Ray's telemetry service (https://docs.ray.io/en/latest/cluster/usage-stats.html).

See ray-project/ray#28490

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
@DmitriGekhtman
Copy link
Collaborator Author

I'll add an e2e test for this in the Ray CI.

@DmitriGekhtman DmitriGekhtman merged commit 38ac168 into ray-project:master Sep 14, 2022
DmitriGekhtman added a commit to ray-project/ray that referenced this pull request Sep 14, 2022
)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
ArturNiederfahrenhorst added a commit to ray-project/ray that referenced this pull request Sep 15, 2022
…tests (#28535)

* [core/ci] Disallow protobuf 3.19.5 (#28504)

This leads to hangs in Ray client (e.g. test_dataclient_disconnect)

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune] Fix trial checkpoint syncing after recovery from other node (#28470)

On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [air] minor example fix. (#28379)

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [cleanup] Remove memory unit conversion (#28396)

The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore.

Recommendation: Review #28394 first, since this is stacked on top of it.

Co-authored-by: Alex <alex@anyscale.com>

* [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422)

* Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)

* Cast rewards as tf.float32 to fix error in DQN in tf2

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

* Add test case for DQN with integer rewards

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

* [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)

This addresses some of the issues brought up in #28484

* [ci] Increase timeout on test_metrics (#28508)

10 milliseconds is ambitious for the CI to do anything.

Co-authored-by: Alex <alex@anyscale.com>

* [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)

* Add imports to object-spilling.rst Python code (#28507)

* Add imports to object-spilling.rst Python code

Also adjust a couple descriptions, retaining the same general information

Signed-off-by: Jake <DevJake@users.noreply.github.com>

* fix doc build / keep note formatting

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

* another tiny fix

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

Signed-off-by: Jake <DevJake@users.noreply.github.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

* [AIR] Make PathPartitionScheme a dataclass (#28390)

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

* [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>

* [autoscaler][observability] Experimental verbose mode (#28392)

This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API.

Example output

======== Autoscaler status: 2020-12-28 01:02:03 ========
GCS request time: 3.141500s
Node Provider non_terminated_nodes time: 1.618000s

Node status
--------------------------------------------------------
Healthy:
 2 p3.2xlarge
 20 m4.4xlarge
Pending:
 m4.4xlarge, 2 launching
 1.2.3.4: m4.4xlarge, waiting-for-ssh
 1.2.3.5: m4.4xlarge, waiting-for-ssh
Recent failures:
 p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)

Resources
--------------------------------------------------------
Total Usage:
 1/2 AcceleratorType:V100
 530.0/544.0 CPU
 2/2 GPU
 2.00/8.000 GiB memory
 3.14/16.000 GiB object_store_memory

Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups
 {'CPU': 16}: 100+ from request_resources()

Node: 192.168.1.1
 Usage:
  0.1/1 AcceleratorType:V100
  5.0/20.0 CPU
  0.7/1 GPU
  1.00/4.000 GiB memory
  3.14/4.000 GiB object_store_memory

Node: 192.168.1.2
 Usage:
  0.9/1 AcceleratorType:V100
  15.0/20.0 CPU
  0.3/1 GPU
  1.00/12.000 GiB memory
  0.00/4.000 GiB object_store_memory

Co-authored-by: Alex <alex@anyscale.com>

* [doc/tune] fix tune stopper attribute name (#28517)

* [doc] Fix tune stopper doctests (#28531)

* [air] Use self-hosted mirror for CIFAR10 dataset (#28480)

The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* draft

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>
Signed-off-by: Jake <DevJake@users.noreply.github.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: mgerstgrasser <matthias@gerstgrasser.net>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Jake <DevJake@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Co-authored-by: Árpád Rózsás <rozsasarpi@gmail.com>
PaulFenton pushed a commit to PaulFenton/ray that referenced this pull request Sep 19, 2022
…-project#28490)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Signed-off-by: PaulFenton <paul@fentonsoftware.com>
@DmitriGekhtman DmitriGekhtman deleted the dmitri/telemetry-env branch September 22, 2022 13:33
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>

This change has the KubeRay operator inject an env var identifying a Ray container as having been deployed using KubeRay.
The purpose is to refine Ray usage stats collected by Ray's telemetry service (https://docs.ray.io/en/latest/cluster/usage-stats.html).

See ray-project/ray#28490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants