### DataLoader and DataSet

In the last chapter, we talked about how to deal with the data from *.csv* file. However, in practice we will always encounter the problem with images. In this case we should use two embedded tools, which are **DataLoader** and **DataSet**, respectively.  
   
Moreover, instead of use **Batch (use all samples to calculate grad)** fashion or **SGD (use 1 samples to calculate grad)** method, we will use the so-called **mini-batch** fashion to balance the performance (saddle point here) and computation speed.   
   

#### Some remarks:

But before we really start, let's have a look at some important definitions. These definitions are important and easy to confuse you (at least confuse me quite a lot when I start with the pytorch):  

+ Epoch: One forward pass and backward pass over **all the training samples**.
+ Batch-Size: The number of training samples in **one forward and backward pass**, in other words, it stands for how many samples you need to calculate the grad.
+ Iterations: The number of passes, each pass will use \[Batch-Size] number of samples.

Let me give an example of it: Assume we have 10,000 training samples in total, and the batch-size is set to 1,000. Therefore we need 10,000 / 1,000 = 10 iterations. For epoch, of course you can set it as whatever you want (but don't be too large or too small).

Good, so...what **DataLoader** and **DataSet** can do?   

In short, **DataSet** will give you a set of data, see example below:
   
![SGD_vs_GD](./img/P8/loader.png)
   
After shuffle to gain more randomness, we can use **DataLoader** to pack the data into mini batches, where the size of mini batch is 2 here. (use *yield* to get next object -> an iterator !!!)

In [6]:
import torch
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

You shall keep in mind that the Dataset here is an abstract class, which means that it cannot be instantiated but can be only inherited. So let's define our class and inherit from it.   

In [7]:
class DiabetesDataset(Dataset):
    def __init__(self, filepath):
        xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
        self.len = xy.shape[0] # shape(多少行，多少列)
        self.x_data = torch.from_numpy(xy[:, :-1])
        self.y_data = torch.from_numpy(xy[:, [-1]])

    # we can get the index by using this magic method
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    # we can get the length by using this magic method
    def __len__(self):
        return self.len

For Dataloader, we can instantiate it.

In [8]:
dataset = DiabetesDataset('diabetes.csv')
train_loader = DataLoader(dataset=dataset, batch_size=32, shuffle=True, num_workers=0)

Then, do the usual:

+ Prepare dataset -> using dataset and dateloader
+ Desgin model using class -> inherit from nn.Module
+ Construct loss and opt -> using Pytorch API
+ Do the trainig -> forward + backward + update

In [9]:
# design model using class
 
 
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear1 = torch.nn.Linear(8, 6)
        self.linear2 = torch.nn.Linear(6, 4)
        self.linear3 = torch.nn.Linear(4, 1)
        self.sigmoid = torch.nn.Sigmoid()
 
    def forward(self, x):
        x = self.sigmoid(self.linear1(x))
        x = self.sigmoid(self.linear2(x))
        x = self.sigmoid(self.linear3(x))
        return x
 
 
model = Model()
 
# construct loss and optimizer
criterion = torch.nn.BCELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
 
# training cycle forward, backward, update
if __name__ == '__main__':
    for epoch in range(100):
        for i, data in enumerate(train_loader, start = 0): # train_loader 是先shuffle后mini_batch
            
            # Prepare data
            inputs, labels = data

            # Forward
            y_pred = model(inputs)
            loss = criterion(y_pred, labels)
            print(epoch, i, loss.item())

            # Backward
            optimizer.zero_grad()
            loss.backward()

            # Update
            optimizer.step()

0 0 0.6934363842010498
0 1 0.6829298138618469
0 2 0.6760309338569641
0 3 0.6819930672645569
0 4 0.6740755438804626
0 5 0.6889039278030396
0 6 0.6808962821960449
0 7 0.674917995929718
0 8 0.6826062798500061
0 9 0.6768490672111511
0 10 0.6822009682655334
0 11 0.6849596500396729
0 12 0.6785305738449097
0 13 0.6755026578903198
0 14 0.6846691966056824
0 15 0.6674026846885681
0 16 0.673801064491272
0 17 0.6908007860183716
0 18 0.6695032119750977
0 19 0.6797128319740295
0 20 0.6720973253250122
0 21 0.6753955483436584
0 22 0.6791607737541199
0 23 0.6699889302253723
1 0 0.6869614720344543
1 1 0.6703138947486877
1 2 0.6825776696205139
1 3 0.6654691696166992
1 4 0.6776126623153687
1 5 0.6684276461601257
1 6 0.686810314655304
1 7 0.6727266907691956
1 8 0.672193169593811
1 9 0.6437758207321167
1 10 0.6713302135467529
1 11 0.6611853241920471
1 12 0.6808934211730957
1 13 0.6706623435020447
1 14 0.6595011353492737
1 15 0.6911572813987732
1 16 0.653641939163208
1 17 0.6802568435668945
1 18 0.6690202355