Starting in Ray 1.13, Ray collects usage stats data by default (guarded by an opt-out prompt). This data will be used by the open-source Ray engineering team to better understand how to improve our libraries and core APIs, and how to prioritize bug fixes and enhancements.
Here are the guiding principles of our collection policy:
- No surprises — you will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used.
- Easy opt-out: You will be able to easily opt-out of data collection
- Transparency — you will be able to review all data that is sent to us
- Control — you will have control over your data, and we will honor requests to delete your data.
- We will not collect any personally identifiable data or proprietary code/data
- We will not sell data or buy data about you.
You will always be able to :ref:`disable the usage stats collection <usage-disable>`.
For more context, please refer to this RFC.
We collect non-sensitive data that helps us understand how Ray is used (e.g., which Ray libraries are used). Personally identifiable data will never be collected. Please check the UsageStatsToReport class to see the data we collect.
There are multiple ways to disable usage stats collection before starting a cluster:
- Add
--disable-usage-stats
option to the command that starts the Ray cluster (e.g.,ray start --head --disable-usage-stats
:ref:`command <ray-start-doc>`). - Run :ref:`ray disable-usage-stats <ray-disable-usage-stats-doc>` to disable collection for all future clusters. This won't affect currently running clusters. Under the hood, this command writes
{"usage_stats": true}
to the global config file~/.ray/config.json
. - Set the environment variable
RAY_USAGE_STATS_ENABLED
to 0 (e.g.,RAY_USAGE_STATS_ENABLED=0 ray start --head
:ref:`command <ray-start-doc>`). - If you're using KubeRay, you can add
disable-usage-stats: 'true'
to.spec.[headGroupSpec|workerGroupSpecs].rayStartParams.
.
Currently there is no way to enable or disable collection for a running cluster; you have to stop and restart the cluster.
When a Ray cluster is started via :ref:`ray start --head <ray-start-doc>`, :ref:`ray up <ray-up-doc>`, :ref:`ray submit --start <ray-submit-doc>` or :ref:`ray exec --start <ray-exec-doc>`, Ray will decide whether usage stats collection should be enabled or not by considering the following factors in order:
- It checks whether the environment variable
RAY_USAGE_STATS_ENABLED
is set: 1 means enabled and 0 means disabled. - If the environment variable is not set, it reads the value of key
usage_stats
in the global config file~/.ray/config.json
: true means enabled and false means disabled. - If neither is set and the console is interactive, then the user will be prompted to enable or disable the collection. If the console is non-interactive, usage stats collection will be enabled by default. The decision will be saved to
~/.ray/config.json
, so the prompt is only shown once.
Note: usage stats collection is not enabled when using local dev clusters started via ray.init()
unless it's a nightly wheel. This means that Ray will never collect data from third-party library users not using Ray directly.
If usage stats collection is enabled, a background process on the head node will collect the usage stats
and report to https://usage-stats.ray.io/
every hour. The reported usage stats will also be saved to
/tmp/ray/session_xxx/usage_stats.json
on the head node for inspection. You can check the existence of this file to see if collection is enabled.
Usage stats collection is very lightweight and should have no impact on your workload in any way.
To request removal of collected data, please email us at usage_stats@ray.io
with the session_id
that you can find in /tmp/ray/session_xxx/usage_stats.json
.
Does the session_id map to personal data?
No, the uuid will be a Ray session/job-specific random ID that cannot be used to identify a specific person nor machine. It will not live beyond the lifetime of your Ray session; and is primarily captured to enable us to honor deletion requests.
The session_id is logged so that deletion requests can be honored.
Could an enterprise easily configure an additional endpoint or substitute a different endpoint?
We definitely see this use case and would love to chat with you to make this work -- email usage_stats@ray.io
.
If you have any feedback regarding usage stats collection, please email us at usage_stats@ray.io
.