# Exercise 00
## PyTorch Introduction - Part II

### Goals of this tutorial

- Understanding PyTorch's approach to loading data from storage.

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset

%load_ext autoreload
%autoreload 2
%matplotlib inline

## 2.1 The Datasets class

In pytorch, the Dataset class primitive is the main way to access the data you are working with. Individual and customized Dataset classes are derived from the base class. The full functionality of the Dataset class is captured in the three dunder methods:
- `__init__` is run once and instantiates the dataset by preloading all relevant information for the dataset
- `__len__` defines how many datapoints/samples are in the dataset
- `__getitem__` defines how to load an individual datasample and defines the structure of the output

Each derived Dataset class must implement these methods. 

### 2.1.1 Creating a custom Dataset
For these examples, we are working with small datasets. Therefore, we can preload them completely and store them in our memory. This usually does not work, especially when working with image data. You will see a proper example in the next notebook. However, for demonstration purposes, we only work with a small dataset.

In [None]:
a = 1.0
b = 2.0
c = 3.0

# X represents the input data, y the target values
X = torch.linspace(-10, 10, 1000)
y = a * X**2 + b * X + c + 2.0 * torch.randn(1000)

The following class represents our dataset. In this case we initialize the object by passing the input data and the target values directly to the ``__init__`` function. However, this depends on the dataset that is used. Especially with larger datasets it is infeasible to store it as a whole in the object. The [documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files) provides an example for more advances custom dataset objects.

In [None]:
class RegressionDataset(Dataset):
  def __init__(self, X: torch.Tensor, y: torch.Tensor):
    # store the input data
    self.X = X
    # store the target values
    self.y = y

  def __len__(self):
    # get the number of datapoints N from the target value tensor with shape N
    return self.y.shape[0]

  def __getitem__(self, index):
    # returns the datapoint at the index as a tuple of input data, output value
    return self.X[index], self.y[index]

### 2.1.2 Visualizing the whole dataset

In [None]:
dataset = RegressionDataset(X, y)

# visualization of the dataset
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plt.title('full dataset')
ax.scatter(dataset.X, dataset.y, s=1.0)
ax.set_xlabel("X")
ax.set_ylabel("y")
plt.show()

### 2.1.3 Splitting the dataset into train/test split

In machine learning tasks you often want to evaluate your model on unseen data. The whole dataset is therefore split into a training and a test split. PyTorch offers you different options to split your dataset into subsets.

In [None]:
train_len = (int) (0.8 * len(dataset))
test_len = len(dataset) - train_len

# creates multiple PyTorch Subset that hold the different splits
train_data, test_data = torch.utils.data.random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(0))

# visualizing the different splits
fig, axs = plt.subplots(1, 2, figsize=(20, 6))
# accessing the data isn't as trivial as before
axs[0].scatter(dataset.X[train_data.indices], dataset.y[train_data.indices], s=1.0)
axs[0].set_title("trainings split")
axs[0].set_xlabel("X")
axs[0].set_ylabel("y")
# accessing the data isn't as trivial as before
axs[1].scatter(dataset.X[test_data.indices], dataset.y[test_data.indices], s=1.0, color='tab:orange')
axs[1].set_title("test split")
axs[1].set_xlabel("X")
axs[1].set_ylabel("y")
plt.show()

### 2.1.4 Working with dataloaders

PyTorch offers a convient way to obtain batches from a ``Dataset`` object by wrapping an iterable around it in the form of an ``DataLoader``. An overview over the different parameters can be found in the [documentation](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader). A `DataLoader` aggregates the output of the `__get_item__` method along a new batch dimension.

In [None]:
from torch.utils.data import DataLoader
# create the python iterables for the datasets
train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=8, shuffle=False)

fig, axs = plt.subplots(1, 2, figsize=(20, 6))

# obtain the next batch from the training split
data = next(iter(train_dataloader))
X_train, y_train = data

# visualize the whole batch
axs[0].scatter(X_train, y_train, s=100.0)
# visualize the first entry of the batch
axs[0].scatter(X_train[0], y_train[0], s=100.0, color="tab:orange")
axs[0].set_title("trainings split")
axs[0].set_xlabel("X")
axs[0].set_ylabel("y")

# obtain the next batch from the test split
data = next(iter(test_dataloader))
X_test, y_test = data

# visualize the whole batch
axs[1].scatter(X_test, y_test, s=100.0)
# visualize the first entry of the batch
axs[1].scatter(X_test[0], y_test[0], s=100.0, color="tab:orange")
axs[1].set_title("test split")
axs[1].set_xlabel("X")
axs[1].set_ylabel("y")
plt.show()

## 2.2 An in-depth Look at `__get_item__`

The previous example was quite simple. We had a single scalar value as an input and a single scalar value as an output. When working with images, this is often not realistic. Our trainings data often consists of either grayscale or RGB images with ground truth that can vary from semantic annotations, to instance annotations and even camera pose transformation matrices. Sometimes, it is necessary to also return some meta-information of the datasample. In these cases it is not enough to simply return a tuple of two values.

In the following we will look at all the different ways, pytorch allows you to return datastructures from a dataset.

### 2.1.1 Returning non-tensor data

The dataloaders in pytorch allow your dataset to return data as most standard python datastructures, including `List`, `Tuple`, and `Dict`. But also allows to return `str`, `np.ndarray`. The next cell shows a `Dataset` that covers all these cases.

In [None]:
class DummyDataset(Dataset):
  def __init__(self, num_data=100):
    self.torch_data = torch.randn((num_data, 2))
    self.numpy_data = np.random.rand(num_data, 2)
    self.int_data = [idx for idx in range(num_data)]
    self.str_data = [f"{idx}" for idx in range(num_data)]

  def __len__(self):
    return self.torch_data.shape[0]

  def __getitem__(self, index):
    # in the following B will denote the batch size
    return (
        # for torch tensors the dataloader will return a torch.Tensor with size 
        # B,... where ... is the shape of your data
        self.torch_data[index],
        # for numpy arrays the dataloader will convert the arrays to a 
        # torch.Tensor with size B,... where ... is the shape of your data
        self.numpy_data[index],
        # for integers, floats, etc. the dataloader will return a torch.Tensor 
        # with size B
        self.int_data[index],
        # for strings the dataloader will return a list of length B
        self.str_data[index],
        # for list the dataloader will accumulate the data for each element of 
        # the list according to the previous rules
        [self.torch_data[index], self.numpy_data[index]], 
        # for dict the dataloader will accumulate the data for each element of 
        # the dict according to the previous rules and keep the keys
        {"torch": self.torch_data[index], "numpy": self.numpy_data[index]},
        )

### 2.1.2 Using a dataloader

Based on the output structure, the pytorch `DataLoader` collates the data from the dataset output. You can familiarise yourself with the different options in the [documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#).

The ``DataLoader`` automatically accumulates your training data along a batch dimension while trying to preserve the datastructure. If you need a more specific aggregation of data, you can always define a custom `collate_fn` yourself. However, this is an advanced topic we will not cover this, as it is rarely needed.

In [None]:
dummy_dataset = DummyDataset()
dataloader = DataLoader(dummy_dataset, batch_size=8)

print(f"Created a dataset with {len(dummy_dataset)} entries.")
print(f"Created a dataloader with {len(dataloader)} batches.")

for batch_idx, data in enumerate(dataloader):
  torch_data = data[0]
  print(f"torch data:\n{torch_data}\n", f"shape:\n{torch_data.shape}")
  numpy_data = data[1]
  print(f"numpy data:\n{numpy_data}\n", f"shape:\n{numpy_data.shape}")
  int_data = data[2]
  print(f"int data:\n{int_data}\n", f"shape:\n{int_data.shape}")
  str_data = data[3]
  print(f"str data:\n{str_data}")
  list_data = data[4]
  print(f"list data:\n{list_data}")
  dict_data = data[5]
  print(f"dict data:\n{dict_data}")
  break

## References

1. [PyTorch Tutorial](https://pytorch.org/tutorials/)


