**DataLoaders** - get the data to the model efficiently.

It's time to worry about optimizing our data-loading pipeline.   
PyTorch's Dataset and DataLoader classes help us fetch the training batches 
in the background using multiple processes. 

To *avoid computational bottlenecks*, using these background processes ensures 
that the next batch of data is ready when the model finishes its backward pass.

Using the **Dataset** and **DataLoader** classes is the most convenient way to load data in PyTorch.   
There is also a new DataPipe class that the PyTorch team is developing. 

In [None]:
"""
for each epoch:
    for each batch:
        # LOAD DATA --> x, y | **Bottleneck**: model waits for the next batch.

        out = model(x)
        loss...
        optimizer...
"""

**Bottleneck** - especially for large datasets that do not fit into compute memory anymore.  
if data loading takes some time --> inefficiency because the model is waiting for the next batch to be loaded.  

To make it *more efficient* - nice to have mini-batches pre-loaded so that the model never has to wait until it receives the next mini-batch

**Idea**: We want the data loading and the model training in two separate processes. 
- Data loading preparing the next mini-batches for the model, which receices mini-batch to compute the rest stuff. 
- Without having waiting period where the model is waiting for next minibatch to be loaded.

Process 1 - model training process  
Process 2 - data loader - preparing the next mini-batch while model training the previous

--> Pytorch does this for us!

In [None]:
"""
for epoch in range(num_epochs):
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        # the rest
"""

train_loader - in the background prepares the next mini-batches when we iterate over it while the model is training.


**Motivation** behind data loaders - want to have a background process running the data loading while the model is training  

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset = train_dataset,
    batch_size = 32,
    shuffle = True,
    drop_last = True,    # drop the last batch if train_size is not evenly divisible by batch size
    num_workers = 2
)

'''
**num_workers** - use multiple processes to load data in the background (pre-fetching next mini-batches) 
so that they are ready when the model is ready to train
'''

### How do Datasets amd DataLoaders work together?

Step 1: Create a custom Dataset
- defines how data is loaded 
- has __getitem__ method  - define how to load a single record (individual data sample)
- __len__ - the lengths of the dataset

Step 2: Instantiate Datasets
- Train/Val/Test Datasets

Step 3: Instantiate DataLoaders
- Train/Val/Test Data Loaders
- define: shuffling, batch size, number of processes etc.

Step 4: Test/Use DataLoaders
- inspect returned features, labels
- labels - shuffled?
- sanity check

## Practice defining efficient Data Loaders

Obtaining dataset from the github repo.

In [None]:
!pip install gitpython

In [3]:
import os
from git import Repo

if not os.path.exists("mnist-pngs"):
    Repo.clone_from("https://github.com/rasbt/mnist-pngs", "mnist-pngs")

In [4]:
import pandas as pd

df_train = pd.read_csv('mnist-pngs/train.csv')
df_train.head()

Unnamed: 0,filepath,label
0,train/0/16585.png,0
1,train/0/24537.png,0
2,train/0/25629.png,0
3,train/0/20751.png,0
4,train/0/34730.png,0


In [5]:
df_test = pd.read_csv('mnist-pngs/test.csv')
df_test.head()

Unnamed: 0,filepath,label
0,test/0/66062.png,0
1,test/0/64675.png,0
2,test/0/62204.png,0
3,test/0/60407.png,0
4,test/0/67368.png,0


In [6]:
# Create Val split

# shuffle
df_train = df_train.sample(frac=1, random_state=123)

# create val
loc = round(df_train.shape[0]*0.9)
df_new_train = df_train.iloc[:loc]
df_new_val = df_train.iloc[loc:]

# save train-val splits
df_new_train.to_csv('mnist-pngs/new_train.csv', index=None)
df_new_val.to_csv('mnist-pngs/new_val.csv', index=None)

### 1) Defining the Dataset Class

In [7]:
from PIL import Image
from torch.utils.data import Dataset

In [8]:
class MyDataset(Dataset):
    def __init__(self, csv_path, img_dir, transform=None):

        df = pd.read_csv(csv_path)
        self.img_dir = img_dir
        self.transform = transform    # optional data transformations

        # based on DataFrame columns
        self.img_names = df["filepath"]
        self.labels = df["label"]

    def __getitem__(self, index):
        img = Image.open(os.path.join(self.img_dir, self.img_names[index]))

        # optinal transformation - data augmentation
        if self.transform is not None:
            img = self.transform(img)

        label = self.labels[index]
        return img, label

    def __len__(self):
        return self.labels.shape[0]

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import torchvision.utils as vutils

In [None]:
def viz_batch_images(batch):
	""" Visualize images in the batch """
	
    plt.figure(figsize=(8, 8))
    plt.axis("off")
    plt.title("Training images")
    plt.imshow(
        np.transpose(
            vutils.make_grid(batch[0][:64], padding=2, normalize=True), (1, 2, 0)
        )
    )
    plt.show()

### Defining optional image transformations

In [11]:
from torchvision import transforms

In [12]:
# Data Augmentations - transformations
data_transforms = {
    "train": transforms.Compose(
        [
            transforms.Resize(32),
            transforms.RandomCrop((28, 28)),
            transforms.ToTensor(),
            # normalize images to [-1, 1] range
            transforms.Normalize((0.5,), (0.5,)),
        ]
    ),
    "test": transforms.Compose(
        [
            transforms.Resize(32),
            transforms.CenterCrop((28, 28)),
            transforms.ToTensor(),
            # normalize images to [-1, 1] range
            transforms.Normalize((0.5,), (0.5,)),
        ]
    ),
}

### 2-3) Defining the DataSets and DataLoaders

In [13]:
from torch.utils.data import DataLoader

train_dataset = MyDataset(
    csv_path="mnist-pngs/new_train.csv",
    img_dir="mnist-pngs/",
    transform=data_transforms["train"],
)

# DataLoader - wrapper around the dataset
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=32,
    shuffle=True,   # want to shuffle the dataset
    num_workers=0,  # number processes/CPUs to use
)

In [14]:
val_dataset = MyDataset(
    csv_path="mnist-pngs/new_val.csv",
    img_dir="mnist-pngs/",
    transform=data_transforms["test"],
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=0,
)

In [15]:
test_dataset = MyDataset(
    csv_path="mnist-pngs/test.csv",
    img_dir="mnist-pngs/",
    transform=data_transforms["test"],
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=32,
    shuffle=False,
    num_workers=0
)

### 4) Testing the data loaders

In [None]:
import time

num_epochs = 1
for epoch in range(num_epochs):

    for batch_idx, (x, y) in enumerate(train_loader):
        time.sleep(1)
        if batch_idx >= 3:
            break
        print(" Batch index:", batch_idx, end="")
        print(" | Batch size:", y.shape[0], end="")
        print(" | x shape:", x.shape, end="")
        print(" | y shape:", y.shape)

print("Labels from current batch:", y)

# Uncomment to visualize a data batch:
# batch = next(iter(train_loader))
# viz_batch_images(batch[0])

 Batch index: 0 | Batch size: 32 | x shape: torch.Size([32, 1, 28, 28]) | y shape: torch.Size([32])
 Batch index: 1 | Batch size: 32 | x shape: torch.Size([32, 1, 28, 28]) | y shape: torch.Size([32])
 Batch index: 2 | Batch size: 32 | x shape: torch.Size([32, 1, 28, 28]) | y shape: torch.Size([32])
Labels from current batch: tensor([9, 1, 3, 2, 8, 4, 8, 3, 1, 3, 6, 5, 4, 9, 6, 2, 6, 0, 5, 8, 0, 5, 1, 0,
        4, 2, 3, 8, 5, 5, 4, 6])
