## Data ingestion

Start by reading the data from a public cloud storage bucket.

In [None]:
# Load data.
ds = ray.data.read_images(
    "s3://doggos-dataset/train",
    include_paths=True,
    shuffle="files",
)
ds.take(1)


2025-08-22 00:14:08,238	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.52.10:6379...
2025-08-22 00:14:08,250	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-466hy7cqu1gzrp8zk8l4byz7l7.i.anyscaleuserdata.com [39m[22m
2025-08-22 00:14:08,255	INFO packaging.py:588 -- Creating a file package for local module '/home/ray/default/doggos/doggos'.
2025-08-22 00:14:08,258	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_0193267f6c9951ce.zip' (0.02MiB) to Ray cluster...
2025-08-22 00:14:08,259	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_0193267f6c9951ce.zip'.
2025-08-22 00:14:08,262	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_6d26725922931a7a9e87fca928dfafe4f4e5e54b.zip' (1.18MiB) to Ray cluster...
2025-08-22 00:14:08,268	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_6d26725922931a7a9e87fca928dfafe4f4e5e54b.zip'.
2025-08-22 00:14:08

Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- limit=1 3: 0.00 row [00:00, ? row/s]

2025-08-22 00:15:25,802	INFO streaming_executor.py:231 -- ✔️  Dataset dataset_59_0 execution finished in 77.16 seconds


[{'image': array([[[123, 118,  78],
          [125, 120,  80],
          [128, 120,  83],
          ...,
          [162, 128,  83],
          [162, 128,  83],
          [161, 127,  82]],
  
         [[123, 118,  78],
          [125, 120,  80],
          [127, 119,  82],
          ...,
          [162, 128,  83],
          [162, 128,  83],
          [161, 127,  82]],
  
         [[123, 118,  78],
          [125, 120,  80],
          [127, 119,  82],
          ...,
          [161, 128,  83],
          [161, 128,  83],
          [160, 127,  82]],
  
         ...,
  
         [[235, 234, 239],
          [233, 232, 237],
          [221, 220, 225],
          ...,
          [158,  95,  54],
          [150,  85,  53],
          [151,  88,  57]],
  
         [[219, 220, 222],
          [227, 228, 230],
          [222, 223, 225],
          ...,
          [153,  91,  54],
          [146,  83,  52],
          [149,  88,  59]],
  
         [[213, 217, 216],
          [217, 221, 220],
          [213,

<div class="alert alert-block alert"> <b> ✍️ Distributed READ/WRITE</b> 

Ray Data supports a wide range of data sources for both [loading](https://docs.ray.io/en/latest/data/loading-data.html) and [saving](https://docs.ray.io/en/latest/data/saving-data.html) from generic binary files in cloud storage to structured data formats used by modern data platforms. This example reads data from a public S3 bucket prepared with the dataset. This `read` operation, much like the `write` operation in a later step, runs in a distributed fashion. As a result, Ray Data processes the data in parallel across the cluster and doesn't need to load the data entirely into memory at once, making data loading scalable and memory-efficient.

<div class="alert alert-block alert"> <b>💡 Ray Data best practices</b>

- **trigger lazy execution**: use [`take`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.take.html) to trigger the execution because Ray has lazy execution mode, which decreases execution time and memory utilization. But, this approach means that you need an operation like take, count, write, etc., to actually execute the workflow DAG.
- **shuffling strategies**: to shuffle the dataset because it's all ordered by class, randomly shuffle the ordering of input files before reading. Ray Data also provides an extensive list of [shuffling strategies](https://docs.ray.io/en/latest/data/shuffling-data.html) such as local shuffles, per-epoch shuffles, etc.
- **`materialize` during development**: use [`materialize`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.materialize.html) to execute and materialize the dataset into Ray's [shared memory object store memory](https://docs.ray.io/en/latest/ray-core/objects.html). This way, you save a checkpoint at this point and future operations on the dataset can start from this point. You won't rerun all operations on the dataset again from scratch. This feature is convenient during development, especially in a stateful environment like Jupyter notebooks, because you can run from saved checkpoints.

    ```python
    ds = ds.map(...)
    ds = ds.materialize()
    ```

    **Note**: only use this during development and use it with small datasets, as it will load it all into memory.


You also want to add the class for each data point. When reading the data with `include_paths` Ray Data saves the filename with each data point. The filename has the class label in it so add that to each data point's row. Use Ray Data's [map](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map.html) function to apply the function to each row.

In [None]:
def add_class(row):
    row["class"] = row["path"].rsplit("/", 3)[-2]
    return row


In [None]:
# Add class.
ds = ds.map(add_class, num_cpus=1, num_gpus=0, concurrency=4)


<div class="alert alert-block alert"> <b> Ray Data streaming execution</b> 

❌ Traditional batch execution, for example, non-streaming like Spark without pipelining, SageMaker Batch Transform:
- Reads the entire dataset into memory or a persistent intermediate format.
- Only then starts applying transformations like .map, .filter, etc.
- Higher memory pressure and startup latency.

✅ Streaming execution with Ray Data:
- Starts processing chunks ("blocks") as they're loaded. No need to wait for entire dataset to load.
- Reduces memory footprint (no OOMs) and speeds up time to first output.
- Increase resource utilization by reducing idle time.
- Online-style inference pipelines with minimal latency.

<img src="https://raw.githubusercontent.com/anyscale/multimodal-ai/refs/heads/main/images/streaming.gif" width=1000>

**Note**: Ray Data isn't a real-time stream processing engine like Flink or Kafka Streams. Instead, it's batch processing with streaming execution, which is especially useful for iterative ML workloads, ETL pipelines, and preprocessing before training or inference. Ray typically has a [**2-17x throughput improvement**](https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker#-results-of-throughput-from-experiments) over solutions like Spark and SageMaker Batch Transform, etc.
