### In previous cases, the data was loaded, training was looped over the number of epochs and then model was optimized based on the whole dataset.

## Above method is very <span style='color:cyan'>time consuming if we did gradient calculation</span> on the whole training data.

## The ideal way is to divided the large dataset into the samples into so-called <span style='color:cyan'>smaller batches</span>. We looped over the number of epochs and then ***loop over to all batched***. Then, we got the x and y batch samples and <span style='color:cyan'>do the optimization only on each batch</span>.

# TERMS
* one epoch means one complete forward and backward pass of <span style='color:cyan'>ALL TRIAINGING SAMPLES</span>
* batch_size = number of training samples in one forward & backward pass
* no_of_iterations = number of passes, each pass using [batch_size] number of samples

## eg. 100 samples, batch_size =20 --> 100/20 = 5 iterations for 1 epoch

In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

In [2]:
class WineDataset(Dataset):
    def __init__(self):
        # data loading
        xy = np.loadtxt('./dataset/wine.csv',delimiter=',',dtype=np.float32,skiprows=1)
        self.x = torch.from_numpy(xy[:,1:])
        self.y = torch.from_numpy(xy[:,[0]])
        self.n_samples = xy.shape[0]
        # putting zero in square bracket makes the dimension of y into two. y.shape => [no_of_samples,1]
        
        # if we did y = xy[:,0] => this will give one dimension
        # y.shape => (no_of_samples,)
    
    def __getitem__(self, index):
        # indexing dataset
        return self.x[index],self.y[index]
        # this will return a tuple

    def __len__(self):
        # len(dataset)
        return self.n_samples

## On Windows, due to multiprocessing restrictions, setting num_workers to > 0 will gives error.
dataloader = DataLoader(dataset=dataset,batch_size=4,shuffle=True,num_workers=2)

## dataloader will get 4 samples each time.

In [3]:
batch_size = 4
dataset = WineDataset()
dataloader = DataLoader(dataset=dataset,batch_size=batch_size,shuffle=True)

In [4]:
# creating an iterator object.
data_iter = iter(dataloader)
# generating data from iterator object.
data = data_iter.next()
features,labels = data
print(f'shape of feature is {features.shape}\n shape of label is {labels.shape}')

shape of feature is torch.Size([4, 13])
 shape of label is torch.Size([4, 1])


In [5]:
# training parameter
num_epochs = 2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/batch_size)
print(total_samples,n_iterations)

178 45


In [9]:
for epoch in range(num_epochs):
    for i, (inputs,labels) in enumerate(dataloader):
        # forward backward, update
        print(f'epoch {epoch+1}/{num_epochs},step {i+1}/{n_iterations},1st input data is {inputs[1,1]:.3f}')

epoch 1/2,step 1/45,1st input data is 3.450
epoch 1/2,step 2/45,1st input data is 2.430
epoch 1/2,step 3/45,1st input data is 1.170
epoch 1/2,step 4/45,1st input data is 1.810
epoch 1/2,step 5/45,1st input data is 4.280
epoch 1/2,step 6/45,1st input data is 1.470
epoch 1/2,step 7/45,1st input data is 1.660
epoch 1/2,step 8/45,1st input data is 1.210
epoch 1/2,step 9/45,1st input data is 1.130
epoch 1/2,step 10/45,1st input data is 1.500
epoch 1/2,step 11/45,1st input data is 2.450
epoch 1/2,step 12/45,1st input data is 3.990
epoch 1/2,step 13/45,1st input data is 1.610
epoch 1/2,step 14/45,1st input data is 1.640
epoch 1/2,step 15/45,1st input data is 1.590
epoch 1/2,step 16/45,1st input data is 2.460
epoch 1/2,step 17/45,1st input data is 4.430
epoch 1/2,step 18/45,1st input data is 3.910
epoch 1/2,step 19/45,1st input data is 1.010
epoch 1/2,step 20/45,1st input data is 0.740
epoch 1/2,step 21/45,1st input data is 1.730
epoch 1/2,step 22/45,1st input data is 1.730
epoch 1/2,step 23/4