# Datasets and Dataloaders

PyTorch helps us a lot with the data during training. `Dataloaders` create random splits of the data in every epoch of training for us, but they do need to know how to get the data and what the data is exactly. This is where the `Dataset` class comes in.

### Dataset

Let's find a super simple dataset to work with. Many of you may be familiar with the iris dataset from R. We also have this in python.

In [1]:
# import some super basic data
from sklearn.datasets import load_iris

# load the iris dataset
iris = load_iris(as_frame=True)
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


With the `Dataset` class, we can prepare the iris dataset for PyTorch. We will need to inherit the PyTorch `Dataset` class and specify our `__init__()`, `__len__()` and `__getitem__()` constructors.

The dataloader only accepts certain types of data. Among these are tensors and numpy arrays. So we should also know what format our data is in and change it if necessary.

In [52]:
print(type(iris.data))
print(type(iris.target))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


The iris data is so far in dataframe format, so we will transform them into tensors in our `__init__()` constructor.

In [11]:
# create a Dataset class for the iris data

# import torch and the Dataset class
import torch
from torch.utils.data import Dataset

class IrisDataset(Dataset):
    """This is a child of Dataset providing the iris data to the dataloader
    The init and getitem constructors are absolutely necessary"""
    def __init__(self, data, targets):
        super().__init__()
        self.data = torch.Tensor(data.values)
        self.targets = torch.Tensor(targets.values)
    
    def __len__(self):
        return self.data.shape[0]
    
    def __getitem__(self, idx):
        # for a given index, return the data and target
        return self.data[idx,:], self.targets[idx]

iris_dataset = IrisDataset(iris.data, iris.target)

This was already the most complicated part in dealing with data.

### Dataloader

Now we are ready to use a dataloader to provide us with random batches of our data during training. There is a lot in the docs that makes it look complicated, but don't worry. Getting started with dataloaders is really simple. For a while, you will not need more than this:

In [12]:
# now we can create a dataloader
from torch.utils.data import DataLoader
iris_dataloader = DataLoader(iris_dataset, batch_size=8, shuffle=True)

# now we can iterate over the dataloader
for data, target in iris_dataloader:
    print(data)
    print(target)
    break

tensor([[4.7000, 3.2000, 1.6000, 0.2000],
        [5.9000, 3.0000, 5.1000, 1.8000],
        [4.6000, 3.2000, 1.4000, 0.2000],
        [6.9000, 3.1000, 5.1000, 2.3000],
        [6.1000, 2.6000, 5.6000, 1.4000],
        [5.5000, 2.4000, 3.8000, 1.1000],
        [5.6000, 3.0000, 4.5000, 1.5000],
        [5.2000, 2.7000, 3.9000, 1.4000]])
tensor([0., 2., 0., 2., 2., 1., 1., 1.])
