# Neural Networks: Part 2

In part 2 of this neural networks series of tutorials we'll go into some more advanced ideas that make it easier to train deeper neural networks. We'll discuss concepts like regularization, normalization, activation functions, batches and dataloaders, and stochastic optimizers. We won't discuss new architectures yet, and just continue to stick with MLPs for now. Recall that an MLP is any neural network whose blocks are composed of linear layers + activation functions.

We'll again load the usual libraries we'll need, set a seed, and set a device for those who'd prefer to work on the GPU.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn
import torch.nn.functional as F
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

seed = 12
np.random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x1220ae6d0>

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'mps'

## Regularization

Recall from past tutorials that in practice it's usually bad to have zero loss or 100% accuracy on the training dataset, because it may mean you're just memorizing your data instead of discovering the underlying patterns in the data. This phenomenon is called **overfitting**. To help detect and prevent overfitting it's customary to keep a hold-out sample of dataset that the model doesn't get trained on, and use that dataset to evaluate the performance of the model. Such hold-out datasets are called the **test set** or the **validation set**. A model trained on a training set that also performs well on the test set is said to **generalize**, just a fancy way of saying the model isn't just memorizing the data but actually learning its underlying patterns really well.

Let's look at an example of what overfitting might look like on a simple dataset, and then we'll discuss some deep learning based techniques to prevent it. Let's use the `make_blog` function to create a binary classification dataset of 800 samples with 2 features. We'll then split it up into training and test sets using the sklearn `train_test_split` function. We then plot the data with each axis being a feature $x_1, x_2$ and each color being the class 0 or 1. The test set data is deliberately emphasized in the plot, with the fainter data being the training set data. Notice that the negative samples (the 0 classes) overlap a good bit with the positive samples (the 1 classes).

Let's now try to train a neural network with 2 hidden layers each of 100 neurons to fit this training set.

In [7]:
X_train = torch.tensor(X_train).float().to(device)
y_train = torch.tensor(y_train).float().to(device).reshape(-1, 1)
X_test = torch.tensor(X_test).float().to(device)
y_test = torch.tensor(y_test).float().to(device).reshape(-1, 1)

num_features = X_train.shape[1]
num_hidden = 100
num_targets = 1

model = nn.Sequential(
    nn.Linear(num_features, num_hidden),
    nn.Sigmoid(),
#     nn.Linear(num_hidden, num_hidden),
#     nn.Sigmoid(),
#     nn.Linear(num_hidden, num_hidden),
#     nn.Sigmoid(),
    nn.Linear(num_hidden, num_targets),
    nn.Sigmoid()
)
model = model.to(device)
model

Sequential(
  (0): Linear(in_features=2, out_features=500, bias=True)
  (1): Sigmoid()
  (2): Linear(in_features=500, out_features=500, bias=True)
  (3): Sigmoid()
  (4): Linear(in_features=500, out_features=500, bias=True)
  (5): Sigmoid()
  (6): Linear(in_features=500, out_features=1, bias=True)
  (7): Sigmoid()
)

In [None]:
loss_fn = nn.BCELoss()
opt = torch.optim.SGD(model.parameters(), lr=0.01)
num_iters = 100_000
for i in tqdm(range(num_iters)):
    # training
    opt.zero_grad()
    yhat = model(X_train)
    loss = loss_fn(yhat, y_train)
    loss.backward()
    opt.step()
    # inference
    yhat = model(X_test)
    test_loss = loss_fn(yhat, y_test)
    if i % (num_iters // 10) == 0:
        print(f'iter = {i} \t\t train loss = {loss} \t\t test loss = {test_loss}')
print(f'iter = {i} \t\t train loss = {loss} \t\t test loss = {test_loss}')

  0%|          | 0/100000 [00:00<?, ?it/s]

In [None]:
# loss_fn = nn.CrossEntropyLoss()
# opt = torch.optim.SGD(model.parameters(), lr=0.1)
# num_iters = 1000
# for i in range(num_iters):
#     opt.zero_grad()
#     yhat = model(X)
#     loss = loss_fn(yhat, y)
#     loss.backward()
#     opt.step()
#     if i % 100 == 0:
#         print(f'iter = {i} \t\t loss = {loss}')
# print(f'iter = {i} \t\t loss = {loss}')