# dataloader

At the heart of PyTorch data loading utility is the torch.utils.data.DataLoader class. It represents a Python iterable over a dataset, with support for

* map-style and iterable-style datasets,

* customizing data loading order,

* automatic batching,

* single- and multi-process data loading,

* automatic memory pinning.

These options are configured by the constructor arguments of a DataLoader, which has signature:
~~~
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)
~~~

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

In [2]:
## step1: get your dataset

In [3]:
sample_size = 9
number_features =  3

features = torch.arange(sample_size*number_features).reshape(sample_size, number_features) * 1.0
label = torch.randint(low=0, high=3, size=(sample_size,))

print(features)
print(label)

tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.],
        [12., 13., 14.],
        [15., 16., 17.],
        [18., 19., 20.],
        [21., 22., 23.],
        [24., 25., 26.]])
tensor([2, 1, 2, 0, 0, 1, 1, 1, 1])


In [4]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        super(CustomDataset, self).__init__()
        self.features = features
        self.labels = labels
        assert features.shape[0] == labels.shape[0]
    
    def __getitem__(self, i):
        return self.features[i], self.labels[i]
    
    def __len__(self):
        return self.features.shape[0]

In [5]:
dataset = CustomDataset(features, label)

In [6]:
## step2: create a dataloader

## Loading Batched and Non-Batched Data

DataLoader supports automatically collating individual fetched data samples into batches via arguments batch_size, drop_last, batch_sampler, and collate_fn (which has a default function).

### Automatic batching (default)

This is the most common case, and corresponds to fetching a minibatch of data and collating them into batched samples, i.e., containing Tensors with one dimension being the batch dimension (usually the first).

When batch_size (default 1) is not None, the data loader yields batched samples instead of individual samples. batch_size and drop_last arguments are used to specify how the data loader obtains batches of dataset keys. For map-style datasets, users can alternatively specify batch_sampler, which yields a list of keys at a

In [7]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=False, drop_last=False)

In [8]:
for batch_sample in dataloader:
    print(batch_sample)
    print("--------")

[tensor([[0., 1., 2.],
        [3., 4., 5.]]), tensor([2, 1])]
--------
[tensor([[ 6.,  7.,  8.],
        [ 9., 10., 11.]]), tensor([2, 0])]
--------
[tensor([[12., 13., 14.],
        [15., 16., 17.]]), tensor([0, 1])]
--------
[tensor([[18., 19., 20.],
        [21., 22., 23.]]), tensor([1, 1])]
--------
[tensor([[24., 25., 26.]]), tensor([1])]
--------


In [9]:
### shuffle = true

In [10]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=False)

In [11]:
for batch_sample in dataloader:
    print(batch_sample)
    print("--------")

[tensor([[0., 1., 2.],
        [6., 7., 8.]]), tensor([2, 2])]
--------
[tensor([[15., 16., 17.],
        [21., 22., 23.]]), tensor([1, 1])]
--------
[tensor([[ 3.,  4.,  5.],
        [ 9., 10., 11.]]), tensor([1, 0])]
--------
[tensor([[18., 19., 20.],
        [24., 25., 26.]]), tensor([1, 1])]
--------
[tensor([[12., 13., 14.]]), tensor([0])]
--------


## set custom sampler

A sequential or shuffled sampler will be automatically constructed based on the shuffle argument to a DataLoader. Alternatively, users may use the sampler argument to specify a custom Sampler object that at each time yields the next index/key to fetch.

A custom Sampler that yields a list of batch indices at a time can be passed as the batch_sampler argument. Automatic batching can also be enabled via batch_size and drop_last arguments. See the next section for more details on this.

In [12]:
sampler = torch.utils.data.RandomSampler(data_source=dataset, replacement=False, num_samples=None)

In [13]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=False, sampler=sampler, drop_last=False) 

In [14]:
sampler = torch.utils.data.RandomSampler(data_source=dataset, replacement=True, num_samples=5)

In [15]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=False, sampler=sampler, drop_last=False)
for batch_sample in dataloader:
    print(batch_sample)
    print("--------")

[tensor([[ 6.,  7.,  8.],
        [ 9., 10., 11.]]), tensor([2, 0])]
--------
[tensor([[15., 16., 17.],
        [ 0.,  1.,  2.]]), tensor([1, 2])]
--------
[tensor([[21., 22., 23.]]), tensor([1])]
--------


## Working with collate_fn

The use of collate_fn is slightly different when automatic batching is enabled or disabled.

When automatic batching is disabled, collate_fn is called with each individual data sample, and the output is yielded from the data loader iterator. In this case, the default collate_fn simply converts NumPy arrays in PyTorch tensors.

When automatic batching is enabled, collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator. The rest of this section describes the behavior of the default collate_fn (default_collate()).

For instance, if each data sample consists of a 3-channel image and an integral class label, i.e., each element of the dataset returns a tuple (image, class_index), the default collate_fn collates a list of such tuples into a single tuple of a batched image tensor and a batched class label Tensor. In particular, the default collate_fn has the following properties:

* It always prepends a new dimension as the batch dimension.

* It automatically converts NumPy arrays and Python numerical values into PyTorch Tensors.

* It preserves the data structure, e.g., if each sample is a dictionary, it outputs a dictionary with the same set of keys but batched Tensors as values (or lists if the values can not be converted into Tensors). Same for list s, tuple s, namedtuple s, etc.

Users may use customized collate_fn to achieve custom batching, e.g., collating along a dimension other than the first, padding sequences of various lengths, or adding support for custom data types.

If you run into a situation where the outputs of DataLoader have dimensions or type that is different from your expectation, you may want to check your collate_fn.

> basically, the collate_fn receives a list of tuples if your __getitem__ function from a Dataset subclass returns a tuple, or just a normal list if your Dataset subclass returns only one element. Its main objective is to create your batch without spending much time implementing it manually. 

In [29]:
def custom_collate_fn(batch): 
    batch_feature_list = []
    batch_label_list = []
    for sample in batch:
        feature, label = sample
        new_feature = feature + 0.1
        new_label = label + 2
        
        batch_feature_list.append(new_feature)
        batch_label_list.append(new_label)
    return [torch.stack(batch_feature_list,0), torch.stack(batch_label_list,0)]

In [30]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=False, drop_last=False, collate_fn = custom_collate_fn)

In [31]:
for batch_sample in dataloader:
    print(batch_sample)
    print("--------")

[tensor([[0.1000, 1.1000, 2.1000],
        [3.1000, 4.1000, 5.1000]]), tensor([4, 3])]
--------
[tensor([[ 6.1000,  7.1000,  8.1000],
        [ 9.1000, 10.1000, 11.1000]]), tensor([4, 2])]
--------
[tensor([[12.1000, 13.1000, 14.1000],
        [15.1000, 16.1000, 17.1000]]), tensor([2, 3])]
--------
[tensor([[18.1000, 19.1000, 20.1000],
        [21.1000, 22.1000, 23.1000]]), tensor([3, 3])]
--------
[tensor([[24.1000, 25.1000, 26.1000]]), tensor([3])]
--------
