# ML-Ready Data

In this tutorial, we will go over some of the basics to create dataloaders.

![](../assets/ml_ready_data.png)

In [1]:
import autoroot
import os
import xarray as xr
import matplotlib.pyplot as plt
from xrpatcher import XRDAPatcher
from torch.utils.data import Dataset, DataLoader, ConcatDataset
import numpy as np
import itertools
from dotenv import load_dotenv
from rs_tools._src.utils.io import get_list_filenames


xr.set_options(
    keep_attrs=True, 
    display_expand_data=False, 
    display_expand_coords=False, 
    display_expand_data_vars=False, 
    display_expand_indexes=False
)
np.set_printoptions(threshold=10, edgeitems=2)


save_dir = os.getenv("ITI_DATA_SAVEDIR")

  within=pd.to_timedelta(config["nearesttime"].get("within", "1H")),
  within=pd.to_timedelta(config["nearesttime"].get("within", "1H")),


## ML-Ready Datasets


In [2]:
list_of_files = get_list_filenames(f"{save_dir}/aqua/analysis", ".nc")
len(list_of_files)

96

In [3]:
ds = xr.open_dataset(list_of_files[0], engine="netcdf4")
ds

***

## PyTorch Integration

In [4]:
from rs_tools._src.datamodule.utils import load_nc_file

We will create a very simple demo dataloader

In [5]:
from torch.utils.data import Dataset, DataLoader
from typing import Optional, Callable, List

class NCDataReader(Dataset):
    def __init__(self, data_filenames: List[str], transforms: Optional[Callable]=None):
        self.data_filenames = data_filenames
        self.transforms = transforms

    def __getitem__(self, ind) -> np.ndarray:
        nc_path = self.data_filenames[ind]
        x = load_nc_file(nc_path)
        if self.transforms is not None:
            x = self.transforms(x)
        return x

    def __len__(self):
        return len(self.data_filenames)

In [6]:
list_of_files = get_list_filenames(f"{save_dir}/msg/analysis", ".nc")
len(list_of_files)

36

In [7]:
# initialize a simple dataset objective
ds = NCDataReader(list_of_files[:10])

# initialize the dataloader
dl = DataLoader(ds, batch_size=8)

# do one iteration
out = next(iter(dl))

# list out the available keys
list(out.keys())

In [10]:
out["data"].shape, out["coords"].shape

(torch.Size([8, 11, 32, 32]), torch.Size([8, 2, 32, 32]))

## Transformations / Editors

We can also use custom transformations within the dataset (just like standard PyTorch) to transform our dataset.

For example, let's say we want to do some coordinate normalization transformation and we also want to stack all dictionary elements to a tensor.
We can use the native transformations that are available from the ITI library.

In [11]:
from rs_tools._src.datamodule.editor import StackDictEditor, CoordNormEditor
from toolz import compose_left

In [12]:
transforms = compose_left(
    CoordNormEditor(), 
    StackDictEditor(),
)

The only extra piece we added was a way to compose these transforms so that they are applied sequentially. 
**Note**: this is equivalent to the `torchvision.transforms.Compose` function.

In [13]:
# initialize dataset with transforms
ds = NCDataReader(list_of_files[:10], transforms=transforms)

# initialize dataloader
dl = DataLoader(ds, batch_size=8)

# do one iteration
out = next(iter(dl))

# inspect a batch
out.shape

torch.Size([8, 14, 32, 32])

In the future, we will have demonstrations for how one can include arbitrary transformations from other libraries like the [`Torchvision`](https://pytorch.org/vision/stable/index.html) transformations and the [`albumentations`](https://albumentations.ai/) library.