# Recitation 0J : Dataloaders

# Goal
Our goal in this recitation is to get comfortable using the dataloader object

# Contents


1. Introduction to PyTorch DataLoader
2. Initializing a DataLoader Object
3. Handling Different Batching Strategies
4. Customizing Data Loading with Collate Functions
5. Leveraging Multi-Process Data Loading
6. Optimizing with Pin Memory




## Manual data feed

In [1]:
import torch
import numpy as np

**1 epoch**: one complete pass of the training dataset through the algorithm

**batch_size**: the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you will need.


**No of iterations = No of batches**: number of passes, each pass using batch_size number of examples.

Example: With 100 training examples and batch size of 20 it will take 5 iterations to complete 1 epoch.



```
x = a list of 10000 input samples
y = a list of 10000 target labels corresponding to x

# Load data manually in batches
for epoch in range(10):
    for i in range(n_batches):
        # Local batches and labels
        local_X, local_y = x[i*n_batches:(i+1)*n_batches,], y[i*n_batches:(i+1)*n_batches,]

        # Your model
        [...]
```



# Dataloaders (PyTorch)

Documentation:
[Read Docs](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

The Dataset retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to

1.   Pass samples in “minibatches”
2.   Reshuffle the data at every epoch to reduce model overfitting
3.   Use Python's multiprocessing to speed up data retrieval

# Sample DataLoader

Handles data loading logic


In [3]:
from torch.utils.data import Dataset, DataLoader
# Dataloader will use dataset to create batches, process data etc.
# Visit Dataset Recitation for more details

class MyDataset(Dataset):
    # constructor, in this case it contains the data
    def __init__(self, xs, ys):
        self.input = input
        self.target = target

    # returns the length of the dataset
    def __len__(self):
        return len(self.input)

    # returns the item at index i
    def __getitem__(self, i):
        return self.input[i], self.target[i]

You want to train a model to learn that the target = 2 x input, and hence created the following dataset:

In [None]:
# We are creating a dummy dataset to test Dataloaders
input = list(range(10))
target = list(range(0, 20, 2))
print('input values: ', input)
print('target values: ', target)

# Create an instance of MyDataset class
dataset = MyDataset(input, target)
print("The second sample is: ", dataset[2]) # returns the tuple (input[2], target[2])
# This is basically same as
print("The second sample is: ", dataset.__getitem__(2))
# Which the dataloader needs

input values:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
target values:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
The second sample is:  (2, 4)
The second sample is:  (2, 4)


### Let's look at different ways of creating the Dataloader object using the Dataloader class


In [None]:
# batch size of 1, so we the size of x and y is 1 and no shuffling
for x, y in DataLoader(dataset):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([0]), batch of labels: tensor([0])
batch of inputs: tensor([1]), batch of labels: tensor([2])
batch of inputs: tensor([2]), batch of labels: tensor([4])
batch of inputs: tensor([3]), batch of labels: tensor([6])
batch of inputs: tensor([4]), batch of labels: tensor([8])
batch of inputs: tensor([5]), batch of labels: tensor([10])
batch of inputs: tensor([6]), batch of labels: tensor([12])
batch of inputs: tensor([7]), batch of labels: tensor([14])
batch of inputs: tensor([8]), batch of labels: tensor([16])
batch of inputs: tensor([9]), batch of labels: tensor([18])


In [None]:
# batch size of 4, so x and y both have a size of 4, no shuffling
for x, y in DataLoader(dataset, batch_size=4):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([0, 1, 2, 3]), batch of labels: tensor([0, 2, 4, 6])
batch of inputs: tensor([4, 5, 6, 7]), batch of labels: tensor([ 8, 10, 12, 14])
batch of inputs: tensor([8, 9]), batch of labels: tensor([16, 18])


In [None]:
# batch size of 4, so x and y both have a size of 4, random shuffle
for x, y in DataLoader(dataset, batch_size=4, shuffle=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([4, 9, 8, 5]), batch of labels: tensor([ 8, 18, 16, 10])
batch of inputs: tensor([0, 1, 7, 6]), batch of labels: tensor([ 0,  2, 14, 12])
batch of inputs: tensor([2, 3]), batch of labels: tensor([4, 6])


In [None]:
# batch size of 4, drop the last batch with less than 4 samples
for x, y in DataLoader(dataset, batch_size=4, shuffle=True, drop_last=True):
    print(f"batch of inputs: {x}, batch of labels: {y}")

batch of inputs: tensor([4, 9, 0, 6]), batch of labels: tensor([ 8, 18,  0, 12])
batch of inputs: tensor([3, 8, 5, 7]), batch of labels: tensor([ 6, 16, 10, 14])


# Collate function

A dataloader parameter which can be customized to achieve custom automatic batching.

You may apply some transformation in the collate function;
One can choose to apply transformation in the collate function instaed of dataset class if transformation needs to be applied on batches.
Also, since data loader support multiprocess through multi-workers, hence ```collate_fn()``` also can take advantage of multi-workers performance speed up.

In [None]:
# Create an object of the custom dataset class
class MyNormalDataset(Dataset):
    # constructor, in this case it contains the data
    def __init__(self, xs, ys):
        self.input = input
        self.target = target

    # returns the length of the dataset
    def __len__(self):
        return len(self.input)

    # returns the item at index i
    def __getitem__(self, i):
        return self.input[i], self.target[i]# create a dict of arguments, another way of passing arguments

    def collate_fn(self, batch):
        x, y = zip(*batch)
        x_mean = np.mean(x)
        x_std = np.std(x)
        x_normal = (x-x_mean)/(x_std+1e-9)
        return x_normal, y


input = np.array(list(range(10)))
target = np.array(list(range(0, 20, 2)))
print('input values: ', input)
print('target values: ', target)

# Create an instance of MyDataset class
dataset = MyNormalDataset(input, target)
# Use the custom collate_fn
# pass the arguments
train_dataloader_custom = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn)

# Display collated inputs and labels.
for i, (x, y) in enumerate(train_dataloader_custom):
    print(f"batch of inputs: {x}, batch of labels: {y}")


input values:  [0 1 2 3 4 5 6 7 8 9]
target values:  [ 0  2  4  6  8 10 12 14 16 18]
batch of inputs: [-0.62092042 -0.23284516  1.31945589  0.93138063 -1.39707095], batch of labels: (8, 10, 18, 16, 4)
batch of inputs: [-0.87988269  0.95320625  1.31982403 -0.14664711 -1.24650048], batch of labels: (2, 12, 14, 6, 0)


## Single and multi-process loading

We can use the ```num_workers``` to specify how many subprocesses to use for data loading. \
0 means that the data will be loaded in the main process. \

In [None]:
# 2 subprocesses
train_dataloader_fast = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=2)

In [None]:
# The maximum subprocesses you can use depends on the machine you are training on
# you can try to increase it until you see a warning.

train_dataloader_fast = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=4)

Use ```pin_memory``` to copy Tensors into device/CUDA pinned memory before returning them -> faster processing.

In [None]:
train_dataloader_faster = DataLoader(dataset, batch_size=5, shuffle=True, collate_fn= dataset.collate_fn, num_workers=4, pin_memory= True)