# Dataset and Dataloader in PyTorch

Some important terminologies -
- 1 epoch = 1 forward and backward pass on all data samples
- batch_size = number of training samples in one forward and backward pass
- number of iterations = number of passes, each pass using [batch_size] number of samples
- example: We have 100 samples, with batch size = 20, then number of iterations = 5 (100/20) for 1 epoch.

In [None]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

# creating a custom class for our dataset, which inherits from Dataset.
class WineDataset(Dataset):

    # this function is used for data loading
    def __init__(self):
      # data loading
      xy = np.loadtxt('./wine.csv', delimiter=',', dtype=np.float32, skiprows=1)
      self.x = torch.from_numpy(xy[:, 1:])  # the first column is the output label
      self.y = torch.from_numpy(xy[:, [0]]) # n_samples, 1
      self.n_samples = xy.shape[0]

    # this function allows indexing in our dataset
    def __getitem__(self, index):
      return self.x[index], self.y[index] # the function returns a tuple.

    # this allows us to call len on our dataset.
    def __len__(self):
      return self.n_samples

dataset = WineDataset()
# First sample of our dataset. # This should return a tuple.
first_data = dataset[0]
features, labels = first_data
print(features, labels)

tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03]) tensor([1.])


### DataLoader Class

At the heart of PyTorch data loading utility is the ```torch.utils.data.DataLoader``` class. It represents a Python iterable over a dataset.

For example, if you've got a Dataset of 1000 images, you can iterate certain attributes in the order that they've been stored in the Dataset and nothing else by itself. In the other hand, a DataLoader that wraps that Dataset allows you to iterate the data in batches, shuffle the data, apply functions, sample data, etc.

A dataloader only iterates a dataset, it does not modify it's contents. To be precise, for example, it doesn't shuffle the dataset contents, but it can iterate the contents of a dataset in a random order

For an object to be iterable, we must define the ```__iter__()``` method inside it's class. Since this is already defined in the DataLoader class, we can call both, the ```enumerate()``` and the ```iter()``` methods on the dataloader.

In [None]:
# shuffle=True shuffles the data, num_workers=2 uses multiple subprocesses, making loading faster.
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True, num_workers=2)


The ```iter()``` function looks for the ```__iter__()``` function in the object's class and calls for it behind the scenes. What the iter function does is returns an iterator.

In [None]:
# Now we can convert the dataloader object to an iterator.
dataiter = iter(dataloader)
# The next function gives the first value from the dataset. 
# Everytime it's called, it gives us the next value from the iterator (i.e our dataset)
data = next(dataiter)
features, labels = data
print(features, labels)

tensor([[1.2330e+01, 1.1000e+00, 2.2800e+00, 1.6000e+01, 1.0100e+02, 2.0500e+00,
         1.0900e+00, 6.3000e-01, 4.1000e-01, 3.2700e+00, 1.2500e+00, 1.6700e+00,
         6.8000e+02],
        [1.3320e+01, 3.2400e+00, 2.3800e+00, 2.1500e+01, 9.2000e+01, 1.9300e+00,
         7.6000e-01, 4.5000e-01, 1.2500e+00, 8.4200e+00, 5.5000e-01, 1.6200e+00,
         6.5000e+02],
        [1.3170e+01, 5.1900e+00, 2.3200e+00, 2.2000e+01, 9.3000e+01, 1.7400e+00,
         6.3000e-01, 6.1000e-01, 1.5500e+00, 7.9000e+00, 6.0000e-01, 1.4800e+00,
         7.2500e+02],
        [1.3940e+01, 1.7300e+00, 2.2700e+00, 1.7400e+01, 1.0800e+02, 2.8800e+00,
         3.5400e+00, 3.2000e-01, 2.0800e+00, 8.9000e+00, 1.1200e+00, 3.1000e+00,
         1.2600e+03]]) tensor([[2.],
        [3.],
        [3.],
        [1.]])


We are getting this output because we set our batch_size to four. So, 1 batch from our dataset has 4 samples in it. The shape of the data is (4, 13).

Now creating a dummy training loop.

In [None]:
num_epochs = 2
total_samples = len(dataset)
n_iters = math.ceil(total_samples / 4)   # divided by batch size.
print(total_samples, n_iters)


178 45


The ```enumerate()``` function returns the index of the data along with the data for an iterable. So, what we are doing is already unpacking the data into inputs and labels and we are storing our index in i.

The ```enumerate()``` is a constructor method returns an object of the enumerate class for the given iterable, sequence, iterator, or object that supports iteration. The returned enumerate object contains tuples for each item in the iterable that includes an index and the values obtained from iterating over iterable.

Some documentation for the enumerate function - https://www.tutorialsteacher.com/python/enumerate-method

In [None]:
for epoch in range(num_epochs):
  # i, (inputs, labels) works because of tuple unpacking in python3.
  for i, (inputs, labels) in enumerate(dataloader):
    # forward, backward, update
    # now printing some info.
    if (i+1)%5 == 0:
      print(f'epoch {epoch+1}/{num_epochs}, step {i+1}/{n_iters}, inputs {inputs.shape}')

epoch 1/2, step 5/45, inputs torch.Size([4, 13])
epoch 1/2, step 10/45, inputs torch.Size([4, 13])
epoch 1/2, step 15/45, inputs torch.Size([4, 13])
epoch 1/2, step 20/45, inputs torch.Size([4, 13])
epoch 1/2, step 25/45, inputs torch.Size([4, 13])
epoch 1/2, step 30/45, inputs torch.Size([4, 13])
epoch 1/2, step 35/45, inputs torch.Size([4, 13])
epoch 1/2, step 40/45, inputs torch.Size([4, 13])
epoch 1/2, step 45/45, inputs torch.Size([2, 13])
epoch 2/2, step 5/45, inputs torch.Size([4, 13])
epoch 2/2, step 10/45, inputs torch.Size([4, 13])
epoch 2/2, step 15/45, inputs torch.Size([4, 13])
epoch 2/2, step 20/45, inputs torch.Size([4, 13])
epoch 2/2, step 25/45, inputs torch.Size([4, 13])
epoch 2/2, step 30/45, inputs torch.Size([4, 13])
epoch 2/2, step 35/45, inputs torch.Size([4, 13])
epoch 2/2, step 40/45, inputs torch.Size([4, 13])
epoch 2/2, step 45/45, inputs torch.Size([2, 13])


PyTorch already has some built in datasets. We will use them later in some projects.

In [None]:
# torchvision.datasets.MNIST() is the famous MNIST dataset.