<div>
<img src="https://discuss.pytorch.org/uploads/default/original/2X/3/35226d9fbc661ced1c5d17e374638389178c3176.png" width="400" style="margin: 50px auto; display: block; position: relative; left: -30px;" />
</div>

<!--NAVIGATION-->
# [< Basics](3-Basics.ipynb) | Data | [Autograd >](3-Autograd.ipynb)

### Data

In this notebook, we will see how to deal with data in PyTorch using Dataset and DataLoaders. 

### Table of Contents

#### 1. [Dataset](#Dataset)
#### 2. [Dataloader](#Dataloader)
#### 3. [A Real Example](#A-real-example:-Alien-vs-Predator)

---

## Dataset

In [None]:
import torch
from torch.utils.data import Dataset

To work with data, PyTorch provides a Dataset class that can be subclassed.  
A dataset is an object that can be queried with an index and that will return the corresponding sample.  

It should implement two functions:
- `__len__` : this should return the size of the dataset
- `__getitem__` : this should return one sample from the dataset

<p>
<img src="figures/dataset.png" width="600" style="margin-left: auto;margin-right: auto;display: block;" />
</p>


In [None]:
class DummyDataset(Dataset):
    def __init__(self):
        self.data = torch.rand(10, 2)
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        sample = self.data[index]
        label = sample[0] > sample[1]
        return (sample, label)

In [None]:
dataset = DummyDataset()

In [None]:
dataset.data

When indexed, the dataset returns tuple (train data, class label)

In [None]:
dataset[1]

## Dataloader

A `DataLoader` is a PyTorch utility class to iterate over the dataset.  
It allows multi-process data loading, automatic batching, shuffling and more.

In [None]:
from torch.utils.data import DataLoader

In [None]:
loader = DataLoader(dataset, batch_size=5, shuffle=True)

In [None]:
for sample, label in loader:
    print(sample, label, sep="\n")
    break

Use multiple workers to load data in parallel: 
Find how in the [PyTorch doc](https://pytorch.org/docs/master/index.html).

In [None]:
loader = DataLoader(dataset, batch_size=5, shuffle=True)

## A real example: Alien vs Predator

### Dataset 

The dataset is located in data/alien-vs-predator

In [None]:
!tree -nd ./data/alien-vs-predator

Each directory contains images of the corresponding class

![predator](./data/alien-vs-predator/train/predator/10.jpg)
![alien](./data/alien-vs-predator/train/alien/10.jpg)

The code below is implementing a Dataset class for these images.  
It loads all the image paths and add it in the `img_instance` variable along with a label.  
The alien class has label 0 while the predator class has label 1.

<div style="background-color:lightblue;padding:1rem;border-radius: 0.015rem 0.015rem 0.03rem 0.03rem;">
<h3 style="display: inline; font-weight:bold">Your turn!</h3>
</div>

This code is incomplete: you need to fill the `__len__` and `__get_item__` functions.

You can use this snippet to load an image from a `path`:
```python
with open(path, 'rb') as f:
    img = Image.open(f).convert('RGB')
```


In [None]:
from pathlib import Path
from PIL import Image


class AlienPredatorDataset(Dataset):
    def __init__(self, root, split):
        self.root = root
        self.split = split
        
        # Load and save all image paths
        self.img_instances = []
        
        for img_path in Path(root, split, "alien").glob("*.jpg"):
            self.img_instances.append((img_path, 0))
            
        for img_path in Path(root, split, "predator").glob("*.jpg"):
            self.img_instances.append((img_path, 1))
    
    
    def __len__(self):
        return # YOUR TURN
    
    
    def __getitem__(self, index):
        # YOUR TURN
        return (img, target)

In [None]:
dataset = AlienPredatorDataset("./data/alien-vs-predator/", "train")

In [None]:
len(dataset)

In [None]:
dataset[0]  # Here again it returns a tuple (image, class label)

In [None]:
dataset[0][0]

Note that we get PIL images that are of different sizes.  
To create proper PyTorch batches, we need to input **tensors** that have the **same size**.  
To do so, we will use Torchvision transforms.

### Torchvision's transforms

In [None]:
from torchvision.transforms import ToTensor, ToPILImage, RandomCrop

crop_transform = RandomCrop(100)

In [None]:
img = dataset[0][0]
img

In [None]:
crop_transform(img)

In [None]:
from torchvision.transforms import Compose

all_transforms = Compose((
    RandomCrop(100),
    ToTensor(),
))

In [None]:
all_transforms(img)

In [None]:
all_transforms(img).shape

Let's apply it to our dataset !

### Dataloader

In [None]:
loader = DataLoader(dataset, batch_size=5, shuffle=True) # workers

for sample, label in loader:
    print(sample.shape, label)
    break

## Using torchvision ImageFolder

Torchvision provides many more useful classes to deal with images.

Specifically, as image classification is a pretty common computer vision task, torchvision provides a dataset named `ImageFolder` that loads images given a folder (the subfolders are splitting the different classes).

In [None]:
from torchvision.datasets import ImageFolder
dataset = ImageFolder(root="./data/alien-vs-predator/train", transform=all_transforms)

In [None]:
dataset[0]

<!--NAVIGATION-->
# [< Basics](3-Basics.ipynb) | Data | [Autograd >](3-Autograd.ipynb)