Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Update Ray Data documentation for {landing,overview,key concepts} #44008

Merged
merged 5 commits into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/data/data-internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Ray Data Internals
This guide describes the implementation of Ray Data. The intended audience is advanced
users and Ray Data developers.

For a gentler introduction to Ray Data, see :ref:`Key concepts <data_key_concepts>`.
For a gentler introduction to Ray Data, see :ref:`Quickstart <data_quickstart>`.

.. _dataset_concept:

Expand Down
14 changes: 6 additions & 8 deletions doc/source/data/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,14 @@ Ray Data: Scalable Datasets for ML
:hidden:

Overview <overview>
key-concepts
quickstart
user-guide
examples
api/api
data-internals

Ray Data is a scalable data processing library for ML workloads. It provides flexible and performant APIs for scaling :ref:`Offline batch inference <batch_inference_overview>` and :ref:`Data preprocessing and ingest for ML training <ml_ingest_overview>`. Ray Data uses `streaming execution <https://www.anyscale.com/blog/streaming-distributed-execution-across-cpus-and-gpus>`__ to efficiently process large datasets.

.. image:: images/dataset.svg

..
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
Expand Down Expand Up @@ -57,19 +55,19 @@ Learn more

.. grid-item-card::

**Key Concepts**
**Quickstart**
^^^

Understand the key concepts behind Ray Data. Learn what
:ref:`Datasets <dataset_concept>` are and how they're used.
Datasets are and how they're used.

+++
.. button-ref:: data_key_concepts
.. button-ref:: data_quickstart
:color: primary
:outline:
:expand:

Learn Key Concepts
Quickstart

.. grid-item-card::

Expand Down Expand Up @@ -118,7 +116,7 @@ Learn more

.. grid-item-card::

**Ray blogs**
**Ray Blogs**
^^^

Get the latest on engineering updates from the Ray team and how companies are using Ray Data.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"\n",
"We'll use Ray Data and a pretrained model from Hugging Face hub. You can easily adapt this example to use other similar models.\n",
"\n",
"We recommend reading [Ray Train Key Concepts](train-key-concepts) and [Ray Data Key Concepts](data_key_concepts) before starting this example.\n",
"We recommend reading [Ray Train Key Concepts](train-key-concepts) and [Ray Data Quickstart](data_quickstart) before starting this example.\n",
"\n",
"```{note}\n",
"To run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model.\n",
Expand Down
1 change: 0 additions & 1 deletion doc/source/data/images/dataset.svg

This file was deleted.

58 changes: 25 additions & 33 deletions doc/source/data/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ Ray Data Overview

.. _data-intro:

.. image:: images/dataset.svg

..
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
Expand All @@ -15,53 +13,41 @@ Ray Data is a scalable data processing library for ML workloads, particularly su
- :ref:`Offline batch inference <batch_inference_overview>`
- :ref:`Data preprocessing and ingest for ML training <ml_ingest_overview>`

It provides flexible and performant APIs for distributed data processing:

- Simple transformations such as maps (:meth:`~ray.data.Dataset.map_batches`)
- Global and grouped aggregations (:meth:`~ray.data.Dataset.groupby`)
- Shuffling operations (:meth:`~ray.data.Dataset.random_shuffle`, :meth:`~ray.data.Dataset.sort`, :meth:`~ray.data.Dataset.repartition`).
It provides flexible and performant APIs for distributed data processing. For more details, see :ref:`Transforming Data <transforming_data>`.

Ray Data is built on top of Ray, so it scales effectively to large clusters and offers scheduling support for both CPU and GPU resources. Ray Data uses `streaming execution <https://www.anyscale.com/blog/streaming-distributed-execution-across-cpus-and-gpus>`__ to efficiently process large datasets.

.. note::

Ray Data doesn't have a SQL interface and isn't meant as a replacement for generic
ETL pipelines like Spark.

Why choose Ray Data?
--------------------

.. dropdown:: Faster and cheaper for modern deep learning applications

Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Through its powerful streaming :ref:`Dataset <dataset_concept>` primitive, Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.
Ray Data is designed for deep learning applications that involve both CPU preprocessing and GPU inference. Ray Data streams working data from CPU preprocessing tasks to GPU inferencing or training tasks, allowing you to utilize both sets of resources concurrently.

By using Ray Data, your GPUs are no longer idle during CPU computation, reducing overall cost of the batch inference job.

.. dropdown:: Cloud, framework, and data format agnostic

Ray Data has no restrictions on cloud provider, ML framework, or data format.

Through the :ref:`Ray cluster launcher <cluster-index>`, you can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including CSV, Parquet, and raw images.
You can start a Ray cluster on AWS, GCP, or Azure clouds. You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow. Ray Data also does not require a particular file format, and supports a :ref:`wide variety of formats <loading_data>` including Parquet, images, JSON, text, CSV, etc.

.. dropdown:: Out of the box scaling
.. dropdown:: Out-of-the-box scaling on heterogeneous clusters

Ray Data is built on Ray, so it easily scales to many machines. Code that works on one machine also runs on a large cluster without any changes.
Ray Data is built on Ray, so it easily scales on a heterogeneous cluster, which has different types of CPU and GPU machines. Code that works on one machine also runs on a large cluster without any changes.

.. dropdown:: Python first
Ray Data can easily scale to hundreds of nodes to process hundreds of TB of data.

With Ray Data, you can express your inference job directly in Python instead of
YAML or other formats, allowing for faster iterations, easier debugging, and a native developer experience.
.. dropdown:: Unified API and backend for batch inference and ML training

With Ray Data, you can express batch inference and ML training job directly under the same Ray Dataset API.


.. _batch_inference_overview:

Offline Batch Inference
-----------------------

.. tip::

`Get in touch <https://forms.gle/sGX7PQhheBGL6yxQ6>`_ to get help using Ray Data, the industry's fastest and cheapest solution for offline batch inference.

Offline batch inference is a process for generating model predictions on a fixed set of input data. Ray Data offers an efficient and scalable solution for batch inference, providing faster execution and cost-effectiveness for deep learning applications. For more details on how to use Ray Data for offline batch inference, see the :ref:`batch inference user guide <batch_inference_home>`.

.. image:: images/stream-example.png
Expand All @@ -72,8 +58,8 @@ Offline batch inference is a process for generating model predictions on a fixed
https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit#slide=id.g230eb261ad2_0_0

How does Ray Data compare to X for offline inference?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How does Ray Data compare to other solutions for offline inference?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. dropdown:: Batch Services: AWS Batch, GCP Batch

Expand All @@ -94,12 +80,15 @@ How does Ray Data compare to X for offline inference?

Ray Data handles many of the same batch processing workloads as `Apache Spark <https://spark.apache.org/>`_, but with a streaming paradigm that is better suited for GPU workloads for deep learning inference.

Ray Data doesn't have a SQL interface and isn't meant as a replacement for generic ETL pipelines like Spark.

For a more detailed performance comarison between Ray Data and Apache Spark, see `Offline Batch Inference: Comparing Ray, Apache Spark, and SageMaker <https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker>`_.

Batch inference case studies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- `ByteDance scales offline inference with multi-modal LLMs to 200 TB on Ray Data <https://www.anyscale.com/blog/how-bytedance-scales-offline-inference-with-multi-modal-llms-to-200TB-data>`_
- `Spotify's new ML platform built on Ray Data for batch inference <https://engineering.atspotify.com/2023/02/unleashing-ml-innovation-at-spotify-with-ray/>`_
- `Sewer AI speeds up object detection on videos 3x using Ray Data <https://www.anyscale.com/blog/inspecting-sewer-line-safety-using-thousands-of-hours-of-video>`_
- `Spotify's new ML platform built on Ray, using Ray Data for batch inference <https://engineering.atspotify.com/2023/02/unleashing-ml-innovation-at-spotify-with-ray/>`_

.. _ml_ingest_overview:

Expand All @@ -114,8 +103,9 @@ Key supported features for distributed training include:
- No dropped rows during distributed dataset iteration

Ray Data serves as a last-mile bridge from storage or ETL pipeline outputs to distributed
applications and libraries in Ray. Don't use it as a replacement for more general data
processing systems. For more details on how to use Ray Data for preprocessing and ingest for ML training, see :ref:`Data loading for ML training <data-ingest-torch>`.
applications and libraries in Ray. Use it for unstructured data processing. For more details
on how to use Ray Data for preprocessing and ingest for ML training, see
:ref:`Data loading for ML training <data-ingest-torch>`.

.. image:: images/dataset-loading-1.svg
:width: 650px
Expand All @@ -125,8 +115,8 @@ processing systems. For more details on how to use Ray Data for preprocessing an
https://docs.google.com/presentation/d/1l03C1-4jsujvEFZUM4JVNy8Ju8jnY5Lc_3q7MBWi2PQ/edit

How does Ray Data compare to X for ML training ingest?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How does Ray Data compare to other solutions for ML training ingest?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. dropdown:: PyTorch Dataset and DataLoader

Expand Down Expand Up @@ -156,7 +146,9 @@ How does Ray Data compare to X for ML training ingest?
* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by NVTabular.
* **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g. both CPU and GPU transformations), while Ray Data supports this.

ML ingest case studies
~~~~~~~~~~~~~~~~~~~~~~
ML training ingest case studies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- `Pinterest uses Ray Data to do last mile data processing for model training <https://medium.com/pinterest-engineering/last-mile-data-processing-with-ray-629affbf34ff>`_
- `DoorDash elevates model training with Ray Data <https://raysummit.anyscale.com/agenda/sessions/144>`_
- `Instacart builds distributed machine learning model training on Ray Data <https://tech.instacart.com/distributed-machine-learning-at-instacart-4b11d7569423>`_
- `Predibase speeds up image augmentation for model training using Ray Data <https://predibase.com/blog/ludwig-v0-7-fine-tuning-pretrained-image-and-text-models-50x-faster-and>`_
- `Spotify's new ML platform built on Ray, using Ray Data for feature preprocessing <https://engineering.atspotify.com/2023/02/unleashing-ml-innovation-at-spotify-with-ray/>`_
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _data_key_concepts:
.. _data_quickstart:

Key Concepts
============
Quickstart
==========

Learn about :class:`Dataset <ray.data.Dataset>` and the capabilities it provides.

Expand Down Expand Up @@ -90,52 +90,19 @@ Consuming data
Pass datasets to Ray Tasks or Actors, and access records with methods like
:meth:`~ray.data.Dataset.take_batch` and :meth:`~ray.data.Dataset.iter_batches`.

.. tab-set::

.. tab-item:: Local

.. testcode::

print(transformed_ds.take_batch(batch_size=3))

.. testoutput::
:options: +NORMALIZE_WHITESPACE

{'sepal length (cm)': array([5.1, 4.9, 4.7]),
'sepal width (cm)': array([3.5, 3. , 3.2]),
'petal length (cm)': array([1.4, 1.4, 1.3]),
'petal width (cm)': array([0.2, 0.2, 0.2]),
'target': array([0, 0, 0]),
'petal area (cm^2)': array([0.28, 0.28, 0.26])}

.. tab-item:: Tasks

.. testcode::

@ray.remote
def consume(ds: ray.data.Dataset) -> int:
num_batches = 0
for batch in ds.iter_batches(batch_size=8):
num_batches += 1
return num_batches

ray.get(consume.remote(transformed_ds))

.. tab-item:: Actors

.. testcode::

@ray.remote
class Worker:

def train(self, data_iterator):
for batch in data_iterator.iter_batches(batch_size=8):
pass
.. testcode::

workers = [Worker.remote() for _ in range(4)]
shards = transformed_ds.streaming_split(n=4, equal=True)
ray.get([w.train.remote(s) for w, s in zip(workers, shards)])
print(transformed_ds.take_batch(batch_size=3))

.. testoutput::
:options: +NORMALIZE_WHITESPACE

{'sepal length (cm)': array([5.1, 4.9, 4.7]),
'sepal width (cm)': array([3.5, 3. , 3.2]),
'petal length (cm)': array([1.4, 1.4, 1.3]),
'petal width (cm)': array([0.2, 0.2, 0.2]),
'target': array([0, 0, 0]),
'petal area (cm^2)': array([0.28, 0.28, 0.26])}

To learn more about consuming datasets, see
:ref:`Iterating over Data <iterating-over-data>` and :ref:`Saving Data <saving-data>`.
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/user-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ User Guides
===========

If you’re new to Ray Data, start with the
:ref:`Ray Data Key Concepts <data_key_concepts>`.
:ref:`Ray Data Quickstart <data_quickstart>`.
This user guide helps you navigate the Ray Data project and
show you how achieve several tasks.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-overview/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Use individual libraries for ML workloads. Click on the dropdowns for your workl
`````{dropdown} <img src="images/ray_svg_logo.svg" alt="ray" width="50px"> Data: Scalable Datasets for ML
:animate: fade-in-slide-down
Scale offline inference and training ingest with [Ray Data](data_key_concepts) --
Scale offline inference and training ingest with [Ray Data](data_quickstart) --
a data processing library designed for ML.
To learn more, see [Offline batch inference](batch_inference_overview) and
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"We load the pre-trained model from the HuggingFace model hub into a LightningModule and launch an FSDP fine-tuning job across 16 T4 GPUs with the help of {class}`Ray TorchTrainer <ray.train.torch.TorchTrainer>`. It is also straightforward to fine-tune other similar large language models in a similar manner as shown in this example.\n",
"\n",
"Before starting this example, we highly recommend reading [Ray Train Key Concepts](train-key-concepts) and [Ray Data Key Concepts](data_key_concepts)."
"Before starting this example, we highly recommend reading [Ray Train Key Concepts](train-key-concepts) and [Ray Data Quickstart](data_quickstart)."
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"```{note}\n",
"This is an advanced example of Large Language Model fine-tuning with Ray Train. If you're a beginner or new to the concepts of Ray Train and our Lightning integrations, it would be beneficial to first explore the introductory documentation below to build a foundational understanding. \n",
"- [Ray Train Key Concepts](train-key-concepts) \n",
"- [Ray Data Key Concepts](data_key_concepts)\n",
"- [Ray Data Quickstart](data_quickstart)\n",
"- {doc}`[Basic] Image Classification with PyTorch Lightning and Ray Train <lightning_mnist_example>`\n",
"- {doc}`[Intermediate] Fine-tuning Lightning Modules with with Ray Data <lightning_cola_advanced>`\n",
"```\n"
Expand Down
2 changes: 1 addition & 1 deletion doc/source/tune/tutorials/tune_get_data_in_and_out.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ For example, passing in a large pandas DataFrame or an unserializable model obje
Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those.

```{note}
[Datasets](data_key_concepts) can be used as values in the search space directly.
[Datasets](data_quickstart) can be used as values in the search space directly.
```

In our example, we want to tune the two model hyperparameters. We also want to set the number of epochs, so that we can easily tweak it later. For the hyperparameters, we will use the `tune.uniform` distribution. We will also modify the `training_function` to obtain those values from the `config` dictionary.
Expand Down