# Dataset and Dataloader

Based on **Patric Loeber** video: https://www.youtube.com/watch?v=c36lUUr864M&t=7076s

## Not good approach

If we first load all the training samples and we try to compute them all at once we will face long computational problems. Because callculating gradients of all the samples will take to much time.

In [None]:
data = numpy.loadtxt('wine.csv')
# training loop
for epoch in range(1000):
    w, y = data
    # forward + backward + weights updates

## Good approach

When we want to analize the **BIG** datasets we need somehow deal with them. So we can't just load all the data at once. We have to divide the saples into so-called smaller batches and then our training loop looks like one below.

In [None]:
# training loop
for epoch in range(1000):
    # loop over all batches
    for i in range(total_batches):
        x_batch, y_batch = ...
        
# --> use DataSet and DataLoader to load wine.csv

## Batch training terms

+ epoch - one forward and backward pass of ALL training samples
+ batch_size - number of training samples in one forward and backward pass
+ number of iterations - number of passes, each pas usig [batch_size] number fo samples

e.g. 100 samples, batch_size=20 --> 100/20 = 5 iterations for 1 epoch

## Dataset

First row in our Dataset is the header. We want to calculate or to predict the wine category. We have three different wine categories: 1, 2 and 3. The class label is in the very first column and all the other columns are the features. We will load this dataset and split columns into X and y.

## Impelementation

In [4]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

class WineDataset(Dataset):
    
    def __init__(self):
        # data loading
        xy = np.loadtxt('./wine_data/wine.csv', delimiter=",", dtype=np.float32, skiprows=1)
        # all rows and all columns without first one (y)
        self.x = torch.from_numpy(xy[:, 1:])
        self.y = torch.from_numpy(xy[:, [0]]) # n_samples, 1
        self.n_samples = xy.shape[0]
        
    def __getitem__(self, index):
        # dataset[0]
        return self.x[index], self.y[index]
        
    def __len__(self):
        # len(dataset)
        return self.n_samples

## Creating dataset 

In [5]:
dataset = WineDataset()

## unpacking it

In [9]:
first_data = dataset[0]
features, labels = first_data

In [10]:
print(features, labels)

tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03]) tensor([1.])


## Using Data Loader

Shuffle is an optional argument which reshufles data after each epoch. The num_workes argument is used for multi-processing data loading

In [20]:
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True) #, num_workers=2)

In [23]:
dataiter = iter(dataloader)
data = next(dataiter)
features, labels = data
print(features, labels)

tensor([[1.3630e+01, 1.8100e+00, 2.7000e+00, 1.7200e+01, 1.1200e+02, 2.8500e+00,
         2.9100e+00, 3.0000e-01, 1.4600e+00, 7.3000e+00, 1.2800e+00, 2.8800e+00,
         1.3100e+03],
        [1.3360e+01, 2.5600e+00, 2.3500e+00, 2.0000e+01, 8.9000e+01, 1.4000e+00,
         5.0000e-01, 3.7000e-01, 6.4000e-01, 5.6000e+00, 7.0000e-01, 2.4700e+00,
         7.8000e+02],
        [1.3070e+01, 1.5000e+00, 2.1000e+00, 1.5500e+01, 9.8000e+01, 2.4000e+00,
         2.6400e+00, 2.8000e-01, 1.3700e+00, 3.7000e+00, 1.1800e+00, 2.6900e+00,
         1.0200e+03],
        [1.1030e+01, 1.5100e+00, 2.2000e+00, 2.1500e+01, 8.5000e+01, 2.4600e+00,
         2.1700e+00, 5.2000e-01, 2.0100e+00, 1.9000e+00, 1.7100e+00, 2.8700e+00,
         4.0700e+02]]) tensor([[1.],
        [3.],
        [1.],
        [2.]])


## Dummy training loop

In [24]:
num_epochs = 2
total_samples = len(dataset)
# number of iterations in one epoch
n_iterations = math.ceil(total_samples / 4) # where 4 is a batch size
print(f"Number of samples: {total_samples}, number of iterations: {n_iterations}")

Number of samples: 178, number of iterations: 45


In [25]:
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(dataloader):
        # forward backward, update weights
        # normaly we would like to do things above but for example we will print information about batch
        if (i+1) % 5 == 0:
            print(f"epoch: {epoch +1} / {num_epochs}, step: {i+1} / {n_iterations}, inputs: {inputs.shape}")

epoch: 1 / 2, step: 5 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 10 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 15 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 20 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 25 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 30 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 35 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 40 / 45, inputs: torch.Size([4, 13])
epoch: 1 / 2, step: 45 / 45, inputs: torch.Size([2, 13])
epoch: 2 / 2, step: 5 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 10 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 15 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 20 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 25 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 30 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 35 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 40 / 45, inputs: torch.Size([4, 13])
epoch: 2 / 2, step: 45 / 45, inpu