## 2. Loading Data

- Datasets uses Ray tasks to read data from remote storage. 
- When reading from a file-based datasource (e.g., S3), it creates a number of read tasks proportional to the number of CPUs in the cluster. 
- Each read task reads its assigned files and produces an output block

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/dataset-read-cropped-v2.svg" width="500px">

Let's load some `MNIST` data from s3.

In [22]:
# Here is our dataset it contains 50 images per class
!aws s3 ls s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/

                           PRE 0/
                           PRE 1/
                           PRE 2/
                           PRE 3/
                           PRE 4/
                           PRE 5/
                           PRE 6/
                           PRE 7/
                           PRE 8/
                           PRE 9/


We will use the `read_images` function to load the image data.

In [21]:
import ray

ds = ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", include_paths=True)
ds

2024-12-06 09:56:54,059	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:56:54,060	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

2024-12-06 09:56:57,428	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:56:57,429	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

Dataset(
   num_rows=?,
   schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8), path: string}
)

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions.

### More about Datasets

- A Dataset consists of a list of Ray object references to **blocks**. 
- Having multiple blocks in a dataset allows for **parallel transformation and ingest**.

The following figure visualizes a **tabular dataset with three blocks, each block holding 1000 rows each**:

<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>

- A Dataset is just a list of Ray object references.
- Fits into Ray's general compute framework.
- It can be freely passed between Ray tasks, actors, and other  libraries like any other object reference.
- This flexibility is a unique characteristic of Ray Datasets.