# CS 39AA - Notebook 6: Minibatches and Dataloaders

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/CS39AA/blob/main/nb6_minibatches_and_dataloaders.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/CS39AA/blob/main/nb6_minibatches_and_dataloaders.ipynb)

In the previous notebook we trained a logistic regression model using Scikit Learn and then again implemented it as a neural network using PyTorch. 

For training the neural network we iterated over the entire dataset many times. In practice, this is not what is done. Instead, we will always use a subset of the training data to update the parameters estimates on (i.e. make predictions, calculate the loss, derive gradients, then update parameters).

Let's first quickly get back to where we were last time by first generating the data of 100 observations, $i = 1, 2, \dots$. Each observation, $i$, had a single __feature__, $x_i$ (aka predictor, independent variable, explanatory variable, etc.) and with an __outcome__ or __target__, $y_i$.

In [None]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

random.seed(1) 
np.random.seed(1)

N = 100 # total number of observations
D_in = 1 # input dimension (i.e. dimension of a single observation's x vector)
D_out = 1 # output dimension (i.e. y), so just 1 for this example

# Create random input data and derive the 'true' labels/output
x = np.random.randn(N, D_in) + 1 
def true_y(x_in, n_obs):
    def addNoise(x):
        if abs(x-1) < 1.0:
            return 0.1
        elif abs(x-1) < 0.1:
            return 0.2
        else:
            return 0.025

    return np.apply_along_axis(lambda x: [int(x < 1) if random.random() < addNoise(x) else int(x > 1)], 1, x_in)
    
y = true_y(x, N).flatten()

plt.scatter(x[y == 1,0], y[y == 1], c='blue', alpha=0.4)
plt.scatter(x[y == 0,0], y[y == 0], c='red', alpha=0.4)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(('positive cases', 'negative cases'), loc='upper left')
plt.show()

We then looked at the surface of the loss function as a function of the parameters, $b$, and $w$, to see what their ideal values should be. 

* $ \mathrm{Loss}_{MSE} = \frac{1}{N} \sum_i^N \Big( y_i - (1 + e^{-(\beta_0 + \beta_1*x_i)})^{-1} \Big)^2$

Since we want to minimize the loss function the parameter estimates will be where the surface is dark green in the plot shown below (i.e. small values of $b$ and large values of $w$).

In [None]:
w = np.arange(6, -4.1, -0.5)
b = np.arange(-6, 4.1, 0.5)
surf = np.array( [[1/N * np.square(y - 1 / (1 + np.exp(-1 * (w[i]*x[:,0] + b[j])))).sum() for j in range(len(b))] for i in range(len(w))] )
df = pd.DataFrame(surf, columns=b, index=w)
p1 = sns.heatmap(df, cbar_kws={'label': 'MSE loss'}, cmap="RdYlGn_r")
plt.xlabel("w")
plt.ylabel("b")
plt.show()

Next, we trained the single-layer perceptron model (that is identical to logistic regression) using PyTorch. 

In [None]:
import torch

#del model 

# Randomly initialize weights and other data
torch.manual_seed(42)
x_tensor = torch.tensor(x).float()
y_tensor = torch.tensor(y).float().reshape(N, D_out)

# Define a Perceptron class 
class Perceptron(torch.nn.Module):
    def __init__(self, input_dim):
        super(Perceptron, self).__init__()
        self.lay1 = torch.nn.Linear(input_dim, 1) # linear layer
        self.act = torch.nn.Sigmoid()             # activation layer
    def forward(self, x, apply_sigmoid=False):
        output = self.lay1(x) 
        if apply_sigmoid:
            output = self.act(output)
        return output

model = Perceptron(1)

# Define loss function to be used 
#loss_fn = torch.nn.MSELoss()
#loss_fn = torch.nn.BCELoss()
loss_fn = torch.nn.BCEWithLogitsLoss() # expects only linear predictor (w/o sigmoid applied)
learning_rate = 5e-2
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
losses = []
params = []

# Carry out gradient descent 
for i in range(200+1):

    params.append(tuple([param.item() for param in model.parameters()]))

    # Forward pass: compute predicted y
    y_pred = model.forward(x_tensor, apply_sigmoid=True)

    # Compute and store loss, and print occassionally 
    loss = loss_fn(y_pred, y_tensor)
    
    losses.append(loss.item())
    if i % 50 == 0:
        print(f"iteration {i}: loss = {loss.item():.4f}") #" w = {w[0]:.4f}, b = {b[0]:.4f}")

    # Zero all gradients before backward pass
    optimizer.zero_grad()

    # Backprop then call optimizer step to update all (relevant) model parameters
    loss.backward()
    optimizer.step()

print("w and b estimates:")
for param in model.parameters():
    print(f"   {param.item():.4f}")


fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(18,4))
ax1.plot(range(0,len(losses)), losses)
ax1.set(xlabel="training iteration", ylabel="loss")
ax2.plot(range(0,len(params)), [parm[0] for parm in params])
ax2.set(xlabel="training iteration", ylabel="parameter: w")
ax3.plot(range(0,len(params)), [parm[1] for parm in params])
ax3.set(xlabel="training iteration", ylabel="parameter: b")

### 2. Using Batches

Now let's try training the same model but with batches (more precisely, mini-batches).

In [None]:
# Randomly initialize weights and other data
torch.manual_seed(42)
x_tensor = torch.tensor(x).float()
y_tensor = torch.tensor(y).float().reshape(N, D_out)

# Define a Perceptron class 
class Perceptron(torch.nn.Module):
    def __init__(self, input_dim):
        super(Perceptron, self).__init__()
        self.lay1 = torch.nn.Linear(input_dim, 1)
        self.act = torch.nn.Sigmoid()
    def forward(self, x):
        output = self.lay1(x)
        output = self.act(output)
        return output

# Declare a perceptron instance
model = Perceptron(1)

# Define loss function to be used 
n_epochs = 50
loss_fn = torch.nn.MSELoss()
learning_rate = 5e-2
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
losses = []
params = []

batch_size = 25
batch_iterations = int(N / batch_size)

# Carry out gradient descent 
for i in range(1, n_epochs+1):

    for j in range(batch_iterations):

        params.append(tuple([param.item() for param in model.parameters()]))

        x_t = x_tensor[j*batch_size:(j+1)*batch_size]
        y_t = y_tensor[j*batch_size:(j+1)*batch_size]

        # Forward pass: compute predicted y
        y_pred = model.forward(x_t)
    
        # Compute and store loss, and print occassionally 
        loss = loss_fn(y_pred, y_t)
        losses.append(loss.item())

        # Zero all gradients before backward pass
        optimizer.zero_grad()

        # Backprop then call optimizer step to update all (relevant) model parameters
        loss.backward()
        optimizer.step()
    if i % 25 == 0:
        print(f"iteration {i}: loss = {loss.item():.4f}, w = {params[-1][0]:.4f}, b = {params[-1][1]:.4f}")

print("w and b estimates:")
for param in model.parameters():
    print(f"    {param.item():.4f}")

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(18,4))
ax1.plot(range(0,len(losses)), losses)
ax1.set(xlabel="training iteration", ylabel="loss")
ax2.plot(range(0,len(params)), [parm[0] for parm in params])
ax2.set(xlabel="training iteration", ylabel="parameter: w")
ax3.plot(range(0,len(params)), [parm[1] for parm in params])
ax3.set(xlabel="training iteration", ylabel="parameter: b")

In [None]:
x_tensor[:5]


Above we see that there is a systematic oscillation, or pattern, to the training. In practice we want to randomly select a mini-batch each time. We could implement such a sampling technique ourselves, but PyTorch has utilities for us already. 

### 3. Batches w/ PyTorch Dataset and DataLoader

To make it much easier to use batches, we'll first need to make use of the PyTorch Dataset class. The typical approach is to create a own child Dataset class that inherits from the PyTorch Dataset class. This way we can inherit the functionality we want while tailoring the class to behave exactly as we need it to (e.g. using names we want for x, y, etc.).

In [None]:
from torch.utils.data import Dataset

class MyDataset(Dataset):

    def __init__(self, x, y):
        self.x = torch.tensor(x, dtype=torch.float64)
        self.y = torch.tensor(y, dtype=torch.float64)

    def __getitem__(self, index):
        return {'x': self.x[index], 'y': self.y[index]}

    def __len__(self):
        return self.x.shape[0]

dataset = MyDataset(x, y)

print(f"Without dataset object x[4] is: {x[4:6]}")

print(f"Using dataset object x[4] is: {dataset[4:6]}")

dataset[4:6]['x']

Next, we can use the PyTorch DataLoader class to make loading our data, and to generate a batch each time we want run a batch iteration. 

In [None]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=10, shuffle=True)

#for batch_index, batch_data in enumerate(dataloader):
#    print("batch", batch_index, ":", batch_data)
for batch_data in dataloader:
    print("batch: ", batch_data)

In [None]:
# Randomly initialize weights and other data
torch.manual_seed(42)
w = torch.randn(1, requires_grad=True).reshape(1,1)
b = torch.randn(1, requires_grad=True).reshape(1,1)

# Define MyDataset class and create an instance
class MyDataset(Dataset):
    def __init__(self, x, y):
        self.x = torch.tensor(x, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32).reshape(N, D_in)
    def __getitem__(self, index):
        return {'x': self.x[index], 'y': self.y[index]}
    def __len__(self):
        return self.x.shape[0]
dataset = MyDataset(x, y)


# Define a Perceptron class 
class Perceptron(torch.nn.Module):
    def __init__(self, input_dim):
        super(Perceptron, self).__init__()
        self.lay1 = torch.nn.Linear(input_dim, 1)
        self.act = torch.nn.Sigmoid()
    def forward(self, x):
        output = self.lay1(x)
        output = self.act(output)
        return output

# Declare a perceptron instance
model = Perceptron(1)

# Define loss function to be used 
n_epochs = 100
loss_fn = torch.nn.MSELoss()
learning_rate = 5e-1
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
losses = []
params = []

batch_size = 50

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Carry out gradient descent 
for i in range(1, n_epochs+1):

    for batch_index, batch_data in enumerate(dataloader):

        params.append(tuple([param.item() for param in model.parameters()]))

        # Forward pass: compute predicted y
        y_pred = model.forward(batch_data['x'])
    
        # Compute and store loss, and print occassionally 
        loss = loss_fn(y_pred, batch_data['y'])
        losses.append(loss.item())

        # Zero all gradients before backward pass
        optimizer.zero_grad()

        # Backprop then call optimizer step to update all (relevant) model parameters
        loss.backward()
        optimizer.step()
        
    if i % 25 == 0:
        print(f"iteration {i}: loss = {loss.item():.4f}, w = {params[-1][0]:.4f}, b = {params[-1][1]:.4f}")


print("w and b estimates:")
for param in model.parameters():
    print(f"    {param.item():.4f}")

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(18,4))
ax1.plot(range(0,len(losses)), losses)
ax1.set(xlabel="optimization iteration", ylabel="loss")
ax2.plot(range(0,len(params)), [parm[0] for parm in params])
ax2.set(xlabel="optimization iteration", ylabel="parameter: w")
ax3.plot(range(0,len(params)), [parm[1] for parm in params])
ax3.set(xlabel="optimization iteration", ylabel="parameter: b")


Another tool we will use oftentimes from now on is from the torchsummary Python module, which will allow us to see all of the layers of a PyTorch model printed out. It will also output the number of parameters in each layer, and the dimensions of the data as it passes through each layer. This will be helpful in the future to understand our models. 

In [None]:
import torchsummary
torchsummary.summary(model, (1, 1))

Lastly, let's take a quick peek at what the surface of the loss function looks like for different (mini)batch sizes. 

__TIP:__ Try changing the batch size parameter, `bs`, below to smaller and larger numbers to see how much it varies relative to the loss surface for all $n=100$ observations. 

In [None]:
bs = 10 # batch_size

random.seed(1)
inds = [random.randint(0,99) for r in range(bs*3)]

x1 = np.array([x[inds[i],0] for i in range(bs)])
y1 = np.array([y[inds[i]] for i in range(bs)])
x2 = np.array([x[inds[i],0] for i in range(bs, bs*2)])
y2 = np.array([y[inds[i]] for i in range(bs, bs*2)])
x3 = np.array([x[inds[i],0] for i in range(bs*2, bs*3)])
y3 = np.array([y[inds[i]] for i in range(bs*2, bs*3)])

b1 = np.arange(6, -4.1, -0.5)
b0 = np.arange(-6, 4.1, 0.5)

surf1 = np.array( [[1/N * np.square(y1 - 1 / (1 + np.exp(-1 * (b1[i]*x1 + b0[j])))).sum() for j in range(len(b0))] for i in range(len(b1))] )
df1 = pd.DataFrame(surf1, columns=b0, index=b1)

surf2 = np.array( [[1/N * np.square(y2 - 1 / (1 + np.exp(-1 * (b1[i]*x2 + b0[j])))).sum() for j in range(len(b0))] for i in range(len(b1))] )
df2 = pd.DataFrame(surf2, columns=b0, index=b1)

surf3 = np.array( [[1/N * np.square(y3 - 1 / (1 + np.exp(-1 * (b1[i]*x3 + b0[j])))).sum() for j in range(len(b0))] for i in range(len(b1))] )
df3 = pd.DataFrame(surf3, columns=b0, index=b1)

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(15,4))
sns.heatmap(df1, ax=ax1, cmap="RdYlGn_r")
sns.heatmap(df2, ax=ax2, cmap="RdYlGn_r")
sns.heatmap(df3, ax=ax3, cbar_kws={'label': 'loss'}, cmap="RdYlGn_r")
plt.xlabel("beta0")
plt.ylabel("beta1")
plt.show()