## 3. Loading data

Let's load some `MNIST` data from s3.

In [None]:
# Here is our dataset it contains 50 images per class
!aws s3 ls s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/

We will use the `read_images` function to load the image data.

In [None]:
ds = ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", include_paths=True)
type(ds)

<div class="alert alert-block alert-info">
  <p><strong>Ray Data supports a variety of data sources for loading data</strong></p>
  <ul>
    <li>
      Reading files from common file formats (e.g. Parquet, CSV, JSON, etc.)
      <ul>
          <li><code>ds = ray.data.read_parquet("s3://...")</code></li>
      </ul>
    </li>
    <li>Loading from in-memory data structures (e.g. NumPy, PyTorch, etc.)
      <ul>
          <li><code>ray.data.from_torch(torch_ds)</code></li>
      </ul>
    <li>Loading from data lakehouses and warehouses such as Snowflake, Iceberg, and Databricks.</li>
      <ul>
          <li><code>ds = ray.data.read_databricks_tables()</code></li>
      </ul>
  </ul>
  <p>
    Start with an extensive list of <a href="https://docs.ray.io/en/latest/data/api/input_output.html#input-output" target="_blank">supported formats</a> and review further options in our <a href="https://docs.ray.io/en/latest/data/loading-data.html#loading-data" target="_blank">data loading guide</a>.
  </p>
</div>

Under the hood, Ray Data uses Ray tasks to read data from remote storage

|<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-data-deep-dive/Ray+Data+Internals+-+reading.png" width="500px" loading="lazy">|
|:--|
|When reading from a file-based datasource, Ray Data starts with a number of read tasks proportional to the number of CPUs in the cluster. |
|Each read task reads its assigned files and produces output blocks.|

### 2.2 Note on blocks

|<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-intro/block.png" width="700px" loading="lazy">|
|:--|
|A Dataset when materialized is a distributed collection of blocks. This example illustrates a materialized dataset with three blocks, each block holding 1000 rows.|

<div class="alert alert-block alert-info">
<strong>Block</strong> is a contiguous subset of rows from a dataset. Blocks are distributed across the cluster and processed independently in parallel. By default blocks are PyArrow tables.
</div>