# PyTorch

This chapter:  train/eval/finetine/optimize w/ pytorch. Then, optuna library for fine tuning

## Fundamentals

Data type is a tensor. Its a multi-dim array w/shape and datatype. Can live on GPU, and does auto-differentiation.

### Tensors

In [92]:
import torch

X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [93]:
# Tensors only take one data type. If you give it more than one, the most general type will be selected (complex > float > int > bool)
display(X.dtype)
X.shape

torch.float32

torch.Size([2, 3])

In [94]:
# Syntax
X[:,1]
10 * (X+1)
X.exp() #item-wise exponential
X.mean(dim=0) # col-wise mean
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

In [95]:
import numpy as np
torch.tensor(X.numpy(), dtype=torch.float32)
# you can convert btwn numpy and torch easily

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [96]:
# You can modify in place
X[:,1] = 99
X

tensor([[ 1., 99.,  7.],
        [ 2., 99.,  6.]])

In [97]:
X[:,0] = -5
X.relu_() 
# Methods ending in _ are in place, normal methods are not in place

tensor([[ 0., 99.,  7.],
        [ 0., 99.,  6.]])

### Hardware Acceleration

PyTorch has accelerator support for intel, apple, nvidia, amd, etc etc

In [7]:
if torch.cuda.is_available():
    device = "cuda"
    print("We have a gpu")

M = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32)
M = M.to(device)

We have a gpu


In [8]:
M.device
# There are multiple ways to put tensors on gpus, like .cuda(), or setting device= param in torch.Tensor

device(type='cuda', index=0)

In [9]:
R = M @ M.T
R

tensor([[14., 32.],
        [32., 77.]], device='cuda:0')

If your neural net is deep, GPU speed and RAM matters most, if its shallow, getting training data onto GPU is the bottleneck

In [10]:
M = torch.rand((1000,1000))
%timeit M @ M.T

26.5 ms ± 742 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
M = torch.rand((1000,1000), device="cuda")
%timeit M @ M.T

560 µs ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Autograd

PyTorch does reverse-mode auto-diff (ch9) quickly with a method called autograd (auto gradients).

In [12]:
x = torch.tensor(5.0, requires_grad=True) # requires_grad tells pytorch to keep track of computations for backpropagation
f = x**2 # keeps a grad_fn= argument to tell pytorch how to backpropagate through this
f.backward() # computes gradients
x.grad
# the derivative of x**2 at x=5 is in fact 10

tensor(10.)

In [13]:
# To do gradient descent, you need to tell pytorch not to track this step
# Otherwise it would include it in backprop

lr = .1
with torch.no_grad():
    x -= lr*x.grad

# This code is equivalent: (x detached shares memory with x)
# x_detached = x.detach()
# x_detached -= lr*x.grad

In [14]:
# Before you repeat the forward > backward > gradient descent step, need to set gradients to 0
x.grad.zero_()

tensor(0.)

In [15]:
# The whole training loop:
lr = .1
x = torch.tensor(5.0, requires_grad=True)
for i in range(500):
    f = x**2
    f.backward()
    with torch.no_grad():
        x -= lr*x.grad
    x.grad.zero_()

x

tensor(2.8026e-45, requires_grad=True)

In [16]:
# If you want to use in-place operations to save memory you have to be careful
# Autograd doesnt let you do an in-place op to a leaf node

t = torch.tensor(2.0, requires_grad=True)
Z = t.exp() # intermediate step
Z+=1 # in place operation (pytorch has no idea where to keep the computation graph for both steps)
# Z.backward() #-> throws error

# you need to do Z = Z+1, it creates a new step

PyTorch stores different operations differently
- exp(), relu(), sqrt(), sigmoid(), tanh() save output in computation graph during the forward pass. You cannot modify their output in place.
- abs(), cos(), log() save their inputs, so you cant change whatever you input to them before the backward pass
- max(), min(), sgn(), std() save inputs and outputs, so do not change their inputs or outputs in place before .backward()
- ceil(), floor(), mean(), sum() store nothing. Do what you want

Generally, make your models without in-place ops, then if you need to speed up or save memory you can convert to in-place operations

## Implementing Linear Regression

In [17]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing(as_frame=False)
X_temp, X_test, y_temp, y_test = train_test_split(data.data, data.target, test_size=.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=.2)

In [18]:
X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)
means = X_train.mean(dim=0, keepdims=True)
stds = X_train.std(dim=0, keepdims=True)
# stdizing
X_train = (X_train-means)/stds
X_valid = (X_valid-means)/stds
X_test = (X_test-means)/stds

y_train = torch.FloatTensor(y_train).reshape(-1,1)
y_valid = torch.FloatTensor(y_valid).reshape(-1,1)
y_test = torch.FloatTensor(y_test).reshape(-1,1)

In [19]:
torch.manual_seed(42)
n = X_train.shape[1] # n features
w = torch.randn((n,1), requires_grad=True) # weights
b = torch.tensor(0., requires_grad=True) # biases

lr = .4
epochs = 20

for epoch in range(epochs):
    y_pred = X_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        b -= b.grad * lr
        w -= w.grad * lr
        b.grad.zero_()
        w.grad.zero_()
    print(f"Epoch {epoch} loss: {loss}")


Epoch 0 loss: 16.10413360595703
Epoch 1 loss: 4.7485222816467285
Epoch 2 loss: 2.1557693481445312
Epoch 3 loss: 1.2681255340576172
Epoch 4 loss: 0.9314724206924438
Epoch 5 loss: 0.7929383516311646
Epoch 6 loss: 0.7285262942314148
Epoch 7 loss: 0.6930748224258423
Epoch 8 loss: 0.6697714328765869
Epoch 9 loss: 0.6522001624107361
Epoch 10 loss: 0.637814462184906
Epoch 11 loss: 0.6255297064781189
Epoch 12 loss: 0.6148261427879333
Epoch 13 loss: 0.6054105162620544
Epoch 14 loss: 0.5970878601074219
Epoch 15 loss: 0.5897108316421509
Epoch 16 loss: 0.5831598043441772
Epoch 17 loss: 0.5773330926895142
Epoch 18 loss: 0.5721437335014343
Epoch 19 loss: 0.5675159096717834


In [20]:
# Making predictions
with torch.no_grad():
    print(X_test[:3] @ w + b)

tensor([[1.3698],
        [3.1893],
        [4.4496]])


This works but PyTorch has a higher level API to do all this easily.

#### PyTorch API

In [None]:
import torch.nn as nn

torch.manual_seed(42)
model = nn.Linear(in_features=n, out_features=1)
model.weight #.weight and .bias are children of torch.nn.Parameter, which is a child of torch.Tensor

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)

In [22]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)
Parameter containing:
tensor([0.3117], requires_grad=True)


In [23]:
model(X_train[:1])
# not trained yet so predictions r random

tensor([[0.4163]], grad_fn=<AddmmBackward0>)

In [24]:
# We need to pick an optimizier and a loss function
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
mse = nn.MSELoss()

def train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs):
    tenth = n_epochs //10
    for epoch in range(n_epochs):
        yPred = model(X_train)
        loss = criterion(y_train, yPred)
        loss.backward()
        optimizer.step() # updates b, w
        optimizer.zero_grad()
        if(epoch % tenth == 0):
            print(f"Epoch {epoch}, Loss: {loss.item()}")

In [25]:
# training

train_bgd(model, optimizer, mse, X_train, y_train, 200)

Epoch 0, Loss: 4.3149943351745605
Epoch 20, Loss: 0.5357152223587036
Epoch 40, Loss: 0.5262251496315002
Epoch 60, Loss: 0.524800181388855
Epoch 80, Loss: 0.5245195627212524
Epoch 100, Loss: 0.5244590640068054
Epoch 120, Loss: 0.524445652961731
Epoch 140, Loss: 0.5244426131248474
Epoch 160, Loss: 0.5244419574737549
Epoch 180, Loss: 0.5244418382644653


In [26]:
X_new = X_test[:3]
with torch.no_grad():
    y_pred = model(X_new)

y_pred

tensor([[1.3073],
        [3.3351],
        [4.2830]])

## Implementing a Regression MLP

pytorch has `nn.Sequential` that lets you chain modules

In [27]:
model = nn.Sequential(
    nn.Linear(n, 50), # n inputs and any number of outputs
    nn.ReLU(), # shape of output = shape input. just an activation function
    nn.Linear(50,40), # inputs of 2 must equal outputs of 1. number of outputs can be whatever you want though
    nn.ReLU(),
    nn.Linear(40,1) # final n outputs must match targets dimension
)

In [28]:
lr = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
mse = nn.MSELoss()

model = model.to(device)
X_train = X_train.to(device)
y_train = y_train.to(device)


train_bgd(model, optimizer, mse, X_train, y_train, 1000)
print(next(model.parameters()).device) 

Epoch 0, Loss: 5.926211833953857
Epoch 100, Loss: 0.526648998260498
Epoch 200, Loss: 0.45027604699134827
Epoch 300, Loss: 0.4206412136554718
Epoch 400, Loss: 0.4043380916118622
Epoch 500, Loss: 0.3916199207305908
Epoch 600, Loss: 0.38045424222946167
Epoch 700, Loss: 0.37107452750205994
Epoch 800, Loss: 0.3623618185520172
Epoch 900, Loss: 0.354219526052475
cuda:0


## Mini-Batch Gradient Descent w/ DataLoaders

Torch has a `torch.utils.data.DataLoader` class that efficiently loads data and shuffles if we want it to

DataLoaders expects the dataset to have a len() and getitem() method

In [29]:
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

model = nn.Sequential(
    nn.Linear(n, 50), 
    nn.ReLU(), 
    nn.Linear(50,40),
    nn.ReLU(),
    nn.Linear(40,1) 
).to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=lr)

In [30]:
def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train() # this switched modules to training mode, doesnt matter rn but it will later
    for epoch in range(n_epochs):
        total_loss = 0.0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_batch, y_pred)
            total_loss+=loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {mean_loss:.4f}")

In [31]:
train(model, optimizer, mse, train_loader, 20)

Epoch 1/20, Loss: nan
Epoch 2/20, Loss: nan
Epoch 3/20, Loss: nan
Epoch 4/20, Loss: nan
Epoch 5/20, Loss: nan
Epoch 6/20, Loss: nan
Epoch 7/20, Loss: nan
Epoch 8/20, Loss: nan
Epoch 9/20, Loss: nan
Epoch 10/20, Loss: nan
Epoch 11/20, Loss: nan
Epoch 12/20, Loss: nan
Epoch 13/20, Loss: nan
Epoch 14/20, Loss: nan
Epoch 15/20, Loss: nan
Epoch 16/20, Loss: nan
Epoch 17/20, Loss: nan
Epoch 18/20, Loss: nan
Epoch 19/20, Loss: nan
Epoch 20/20, Loss: nan


- You can set `pin_memory=True` in the dataloader to speed up training and the cost of more cpu ram
    - Also set `non_blocking=True` in the .to() method to avoid blocking cpu during data transfer

- This training loop waits until one batch is done to load another. Set dataloaders `num_workers=` to add workers, and tweak number of batches fetched with `prefetch_factor`. Windows sometimes lags with this, so set `persistent_workers=True` to reuse workers 

- Note that it seems like these arguments have issues in juypter notebook and you might have to work in .py

## Model Evaluation

In [None]:
def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

In [33]:
valid_set = TensorDataset(X_valid, y_valid)
valid_loader = DataLoader(valid_set, batch_size=32)
valid_mse = evaluate(model, valid_loader, mse)
valid_mse

tensor(nan, device='cuda:0')

In [34]:
# if you want to use rmse
def rmse(pred, true):
    return ((pred-true)**2).mean().sqrt()

evaluate(model, valid_loader, rmse)

# root(mse) != rmse because torch computed the mean rmse across sets, not the root of total mse

tensor(nan, device='cuda:0')

In [35]:
evaluate(model, valid_loader, mse, aggregate_fn=lambda metrics: torch.sqrt(torch.mean(metrics)))
# Use MSE as metric function, and aggregate by taking the root of the mean mse

tensor(nan, device='cuda:0')

## Nonsequential Models w/ Custom Modules

Wide and Deep neural net: all/part of inputs are connected directly to the output layer, letting the model learn shallow and deep patterns

We need to use `nn.Module` to build our custom network

In [36]:
class WideAndDeep(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X):
        deep_output = self.deep_stack(X)
        wide_and_deep = torch.concat([X, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

Modules have a .children() method that lets you iterate over the submodules. If your model has a changing number of submodules, you should store them in an nn.ModuleList, and if you have a changing number of params, you should keep it in a nn.ParameterList

In [37]:
model = WideAndDeep(n).to(device)
lr = .002
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

train(model, optimizer, mse, train_loader, 20)

Epoch 1/20, Loss: 1.4733
Epoch 2/20, Loss: 0.6642
Epoch 3/20, Loss: 0.5963
Epoch 4/20, Loss: 0.5630
Epoch 5/20, Loss: 0.5397
Epoch 6/20, Loss: 0.5259
Epoch 7/20, Loss: 0.5143
Epoch 8/20, Loss: 0.5062
Epoch 9/20, Loss: 0.4983
Epoch 10/20, Loss: 0.4910
Epoch 11/20, Loss: 0.4866
Epoch 12/20, Loss: 0.4773
Epoch 13/20, Loss: 0.4719
Epoch 14/20, Loss: 0.4653
Epoch 15/20, Loss: 0.4612
Epoch 16/20, Loss: 0.4563
Epoch 17/20, Loss: 0.4504
Epoch 18/20, Loss: 0.4441
Epoch 19/20, Loss: 0.4402
Epoch 20/20, Loss: 0.4345


In [38]:
# If you want to send a subset of the features thru the wide path, and a different (mayb overlapping) part through the deep path, you can do smthn likethis:
class WideAndDeep2(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X):
        X_wide = X[:, :5]
        X_deep = X[:, 2:]
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

### Making Models with Multiple Inputs

Its usually better to just let the model take two tensors as input rather than trying to data split within the model. (like images + text)

In [39]:
class WideAndDeep3(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

In [40]:
train_data = TensorDataset(X_train[:, :5], X_train[:, 2:], y_train)
train_loader_wd = DataLoader(train_data, batch_size=32, shuffle=True)

# you need to make sure your train and eval functions handle 3 tensors
def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for X_batch_wide, X_batch_deep, y_batch in data_loader:
            X_batch_wide = X_batch_wide.to(device)
            X_batch_deep = X_batch_deep.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch_wide, X_batch_deep)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0
        for X_batch_wide, X_batch_deep, y_batch in train_loader:
            X_batch_wide = X_batch_wide.to(device)
            X_batch_deep = X_batch_deep.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch_wide, X_batch_deep)
            loss = criterion(y_batch, y_pred)
            total_loss+=loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {mean_loss:.4f}")

In [41]:
# IFF your inputs are in the same order everywhere, you can use something like this: 

# for *X_batch_inputs, y_batch in data_loader:
#     X_batch_inputs = [X.to(device) for X in X_batch_inputs]
#     y_batch = y_batch.to(device)
#     y_pred = model(*X_batch_inputs)

If your model has a lot of inputs, it can be easy to mess up the order. You can make a custom dataset this way:

In [None]:
class WideAndDeepDataset(torch.utils.data.Dataset):
    def __init__(self, X_wide, X_deep, y):
        self.X_wide = X_wide
        self.X_deep = X_deep
        self.y = y

    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        input_dict = {"X_wide": self.X_wide[idx], "X_deep": self.X_deep[idx]}
        return input_dict, self.y[idx]

In [None]:
train_data_named = WideAndDeepDataset(
                    X_wide = X_train[:,:5], 
                    X_deep = X_train[:,2:], 
                    y=y_train)
train_loader_named = DataLoader(train_data_named, batch_size=32, shuffle=True)

In [44]:
# Update your train and eval with this chunk:

# for inputs, y_batch in data_loader:
#     inputs = {name: X.to(device)for name, X in inputs.items()}
#     y_batch = y_batch.to(device)
#     y_pred = model(X_wide=inputs["X_wide"], X_deep=inputs["X_deep"])

### Models w/ Multiple Outputs

example problems:
- Locate and classify object in an image (regression + classifications)
- Multiple independent tasks based on the same data (is someone smiling, are they wearing glasses)
- Regularization (forcing one part of the network to learn something on its own)

In [45]:
class WideAndDeep4(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)
        self.aux_output_layer = nn.Linear(40,1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        main_output = self.output_layer(wide_and_deep)
        aux_output = self.aux_output_layer(deep_output)
        return 
    
# need to modify training function too

def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0
        for inputs, y_batch in train_loader:
            inputs = {k: v.to(device) for k, v in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred, y_pred_aux = model(**inputs)
            main_loss = criterion(y_pred, y_batch)
            aux_loss = criterion(y_pred_aux, y_batch)
            loss = 0.8 * main_loss + 0.2 * aux_loss # you can fine tune the ratio
            total_loss+=loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for inputs, y_batch in data_loader:
            inputs = {k: v.to(device) for k, v in inputs.items()}
            y_batch = y_batch.to(device)
            y_pred, _ = model(**inputs)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

## Image Classifier in PyTorch

TorchVision has a lot of tools for computer vision

In [46]:
import torchvision
import torchvision.transforms.v2 as T

toTensor = T.Compose([T.ToImage(), T.ToDtype(torch.float32, scale=True)]) #scales 0-1
# by default, fashionmnist loads PIL images, but we need pytorch float tensors
# you can transform data by specifying a transform argument in the dataset call
train_and_valid_data = torchvision.datasets.FashionMNIST(
    root="datasets", train=True, download=True, transform=toTensor
)
test_data = torchvision.datasets.FashionMNIST(
    root="datasets", train=False, download=True, transform=toTensor
)

torch.manual_seed(42)
train_data, valid_data = torch.utils.data.random_split(
    train_and_valid_data, [55000, 5000]
)

100%|██████████| 26.4M/26.4M [00:01<00:00, 13.3MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 212kB/s]
100%|██████████| 4.42M/4.42M [00:01<00:00, 3.93MB/s]
100%|██████████| 5.15k/5.15k [00:00<00:00, 13.6MB/s]


In [47]:
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=32)
test_loader = DataLoader(test_data, batch_size=32)

X_sample, y_sample = train_data[0]
X_sample.shape
# first dim is channel dimension, for greyscale its just 1 channel, rgb has 3

torch.Size([1, 28, 28])

In [48]:
train_and_valid_data.classes

['T-shirt/top',
 'Trouser',
 'Pullover',
 'Dress',
 'Coat',
 'Sandal',
 'Shirt',
 'Sneaker',
 'Bag',
 'Ankle boot']

### Building the Classifier

In [None]:
class ImageClassifier(nn.Module):
    def __init__(self, n_inputs, n_hidden1, n_hidden2, n_classes):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Flatten(), # reshapes each input to single dimension
            nn.Linear(n_inputs, n_hidden1),
            nn.ReLU(),
            nn.Linear(n_hidden1, n_hidden2),
            nn.ReLU(),
            nn.Linear(n_hidden2, n_classes)
        )

    def forward(self, X):
        return self.mlp(X)

In [50]:
model = ImageClassifier(n_inputs=28*28, n_hidden1=300, n_hidden2=100, n_classes=10)
xentropy = nn.CrossEntropyLoss()

We dont need softmax to give class probabilities, pytorch crossentropyloss can work directly with logits rather than probas (more efficient)

But then we need to manually compute probas if we want them

For binary classification, one output, and use `nn.BCEWithLogitsLoss`

In [51]:
!pip install torchmetrics
import torchmetrics
device = "cuda"

def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0
        for inputs, y_batch in train_loader:
            inputs = inputs.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(inputs)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            total_loss += loss.item()
            optimizer.step()
            optimizer.zero_grad()
        print(f"Epoch: {epoch} - Loss: {total_loss/len(train_loader)}")

def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for inputs, y_batch in data_loader:
            inputs = inputs.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(inputs)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

model = model.to("cuda")
accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
train(model, optimizer, xentropy, train_loader, 2)
# training is slow so make epochs > 5 if you really want to train this

Collecting torchmetrics
  Downloading torchmetrics-1.8.2-py3-none-any.whl.metadata (22 kB)
Collecting lightning-utilities>=0.8.0 (from torchmetrics)
  Downloading lightning_utilities-0.15.2-py3-none-any.whl.metadata (5.7 kB)
Downloading torchmetrics-1.8.2-py3-none-any.whl (983 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.2/983.2 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading lightning_utilities-0.15.2-py3-none-any.whl (29 kB)
Installing collected packages: lightning-utilities, torchmetrics
Successfully installed lightning-utilities-0.15.2 torchmetrics-1.8.2
Epoch: 0 - Loss: 2.067446428513097
Epoch: 1 - Loss: 1.2289777294780848


In [52]:
# Doing predictions
iter = __builtins__.iter

model.eval()
X_new, y_new = next(iter(valid_loader))
X_new = X_new[:3].to(device)
with torch.no_grad():
    y_pred_logits = model(X_new)

y_pred = y_pred_logits.argmax(dim=1)
[train_and_valid_data.classes[index] for index in y_pred]

['Sneaker', 'Coat', 'Coat']

In [53]:
# Getting predicted probas
import torch.nn.functional as F

F.softmax(y_pred_logits, dim=1).round(decimals=3)[0]

tensor([0.0020, 0.0020, 0.0090, 0.0030, 0.0050, 0.2260, 0.0050, 0.5580, 0.0690,
        0.1210], device='cuda:0')

In [54]:
## Get top k predictions
y_top4_logits, y_top4_idx = torch.topk(y_pred_logits, k=4, dim=1)
y_top4_probas = F.softmax(y_top4_logits, dim=1)
y_top4_probas.round(decimals=3)
# y_top4_idx has idx of each proba

tensor([[0.5730, 0.2330, 0.1240, 0.0710],
        [0.4800, 0.3040, 0.1670, 0.0500],
        [0.3540, 0.3190, 0.2170, 0.1100]], device='cuda:0')

For unbalanced datasets (n instances per class not equal), you should set `weight` in `nn.CrossEntropyLoss` to a vector containing each classes weight, where weight is total_instances / class_instances (scaled to 0-1), so smaller classes have a greater weight

## Fine-Tuning Neural Nets with Optuna

In [55]:
%pip install optuna
import optuna

Collecting optuna
  Downloading optuna-4.7.0-py3-none-any.whl.metadata (17 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading optuna-4.7.0-py3-none-any.whl (413 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.9/413.9 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.10.1-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna
Successfully installed colorlog-6.10.1 optuna-4.7.0


In [56]:
def objective(trial):
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    n_hidden = trial.suggest_int("n_hidden", 20, 300)
    model = ImageClassifier(n_inputs=1*28*28, n_hidden1=n_hidden, n_hidden2=n_hidden, n_classes=10).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    xentropy = nn.CrossEntropyLoss()
    train(model, optimizer, xentropy, train_loader, 1)
    v = evaluate(model, valid_loader, xentropy)
    return v

In [57]:
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=1)

[I 2026-02-02 01:19:26,380] A new study created in memory with name: no-name-7cd68fd8-e096-4383-b27b-6bd448ebfe52


Epoch: 0 - Loss: 2.270235518787821


[I 2026-02-02 01:19:42,986] Trial 0 finished with value: 2.2352724075317383 and parameters: {'learning_rate': 0.00031489116479568613, 'n_hidden': 287}. Best is trial 0 with value: 2.2352724075317383.


suggest_float and suggst_int ask optuna for good params, and for learning rate we specify to sample from log scale to get low values more often

The objective function returns a score, higher is better, so we have to set direction in create study to maximize

In [58]:
study.best_params
# study.best_value

{'learning_rate': 0.00031489116479568613, 'n_hidden': 287}

Generally its better to pass youe dataloaders into the objective function as an argument

In [59]:
def objective(trial, train_loader, valid_loader):
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    n_hidden = trial.suggest_int("n_hidden", 20, 300)
    model = ImageClassifier(n_inputs=1*28*28, n_hidden1=n_hidden, n_hidden2=n_hidden, n_classes=10).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    xentropy = nn.CrossEntropyLoss()
    train(model, optimizer, xentropy, train_loader, 1)
    v = evaluate(model, valid_loader, xentropy)
    return v

objective_with_data = lambda trial: objective(
    trial, train_loader=train_loader, valid_loader=valid_loader)
study.optimize(objective_with_data, n_trials=1)

Epoch: 0 - Loss: 1.1457130481261997


[I 2026-02-02 01:19:59,612] Trial 1 finished with value: 0.7096635103225708 and parameters: {'learning_rate': 0.008471801418819975, 'n_hidden': 188}. Best is trial 0 with value: 2.2352724075317383.


You can make a pruner to get rid of bad trials instead of wasting compute on them

MedianPruner args:
- n_startup_trials: trials until pruning starts
- n_warmup_steps: epochs until pruning starts (after startup trials)
- interval_steps: how many epochs between checks for pruning

In [60]:
pruner = optuna.pruners.MedianPruner(n_startup_trials=3, n_warmup_steps=0, interval_steps=1)
study = optuna.create_study(direction="maximize", sampler=sampler, pruner=pruner)

n_epochs=1
# adjust the objective function like so:
def objective(trial, train_loader, valid_loader):
    for epoch in range(n_epochs):
        lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
        n_hidden = trial.suggest_int("n_hidden", 20, 300)
        model = ImageClassifier(n_inputs=1*28*28, n_hidden1=n_hidden, n_hidden2=n_hidden, n_classes=10).to(device)
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
        xentropy = nn.CrossEntropyLoss()
        train(model, optimizer, xentropy, train_loader, 1)
        v = evaluate(model, valid_loader, xentropy)
        trial.report(v, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()
    return v

[I 2026-02-02 01:19:59,677] A new study created in memory with name: no-name-be08838d-fc3a-446b-b94b-dcb2e60c2f87


## Saving and Loading PyTorch Models

In [61]:
# simplest way:
torch.save(model, "my_model.pt")

# load:
loaded_model = torch.load("my_model.pt", weights_only=False)

You must keep custom classes/functions when loading a saved model, because the saved model only stores references, not the functions themselves

Setting weights_only meansyou get the whole model, not just weights

In [62]:
loaded_model.eval()
loaded_model(X_new) # make prediction

tensor([[-2.0883, -2.2779, -0.6950, -1.6345, -1.1914,  2.5527, -1.3021,  3.4541,
          1.3635,  1.9228],
        [ 1.2032, -2.2533,  3.9686,  0.2705,  4.4270, -2.0547,  3.3680, -4.3984,
          2.1554, -2.1012],
        [ 1.2739, -2.9186,  3.5424, -0.1503,  3.6466, -1.5096,  3.1561, -3.6141,
          2.4764, -1.5193]], device='cuda:0', grad_fn=<AddmmBackward0>)

There are some issues with this method:
- pickle (the engine behind torch.save()) supports custom code, so someone could put anything in a .pt model you load (malware)
- pickle can break if theres filepath changes to locate code

In [63]:
# to avoid this, only save and load weights
torch.save(model.state_dict(), "weights.pt")

# state_dict() returns all params + buffers (see ch11)

In [64]:
# to use the weights, need to create an identical model and manually set weights

new_model = ImageClassifier(n_inputs=28*28, n_hidden1=300, n_hidden2=100, n_classes=10)
weights = torch.load("weights.pt", weights_only=True)
new_model.load_state_dict(weights)
new_model.eval()

ImageClassifier(
  (mlp): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=300, bias=True)
    (2): ReLU()
    (3): Linear(in_features=300, out_features=100, bias=True)
    (4): ReLU()
    (5): Linear(in_features=100, out_features=10, bias=True)
  )
)

In [65]:
# Its a good idea to pass the hyperparameters through as well

model_data = {
    "model_hyperparameters": {"n_inputs":28*28, "n_hidden1":300, "n_hidden2":100, "n_classes":10},
    "model_state_dict": model.state_dict()
}
torch.save(model_data, "weights.pt")

# you can rebuild like this

loaded_data = torch.load("weights.pt", weights_only=True)
new_model = ImageClassifier(**loaded_data["model_hyperparameters"])
new_model.load_state_dict(loaded_data["model_state_dict"])
new_model.eval()

ImageClassifier(
  (mlp): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=300, bias=True)
    (2): ReLU()
    (3): Linear(in_features=300, out_features=100, bias=True)
    (4): ReLU()
    (5): Linear(in_features=100, out_features=10, bias=True)
  )
)

to stop in the middle of training, save, load, then restart training, you have to save the optimizers state dict and hyperparams + epoch/loss info if you need.

HuggingFace's `safetensors` and TorchScript are also ways to save models.

## Compiling and Optimizing a PyTorch Model

PyTorch can automatically convert model code to TorchScript, improves speed by fusing operations with constants (constant folding)

TorchScript can be serialized, saved to disk, loaded+executed in python or c++ with LibTorch, making running pytorch possible on lots of devices

2 ways to convert model to TorchScript

1: **tracing**

- PyTorch runs your model with sample data, logs every operation, and converts this log to torchscript

In [66]:
# tracing
torchscript_model = torch.jit.trace(model, X_new)

This works good for static models, whose .forward() doesnt depend on conditionals or loops

If you have if/match statement only the one that gets executed will be saved by TorchScript

2nd method: **scripting**

- pytorch parses your code directly and makes it into torchscript
- this works with if/while, as long as the conditions are tensors
- only works on a subset of python: no global vars, no generators, no variable length function arguments, types must be fixed


In [67]:
torchscript_model = torch.jit.script(model)
optimized = torch.jit.optimize_for_inference(torchscript_model)
optimized.save("weights.pt")
torch.jit.load("weights.pt")

RecursiveScriptModule(original_name=ImageClassifier)

You can optimize torchscript models for inference regarless of if you script or trace, then save and load them.

TorchScript has stopped being updated, and now the pytorch team focuses on .compile()

In [68]:
compiled_model = torch.compile(model)

This can be used normally, and itll automatically compile and be optimized when you use it.

Relies on looking at python bytecode to grab conditionals, loops, etc.

## Exercises

1. PyTorch features vs numpy: Autograd, DataLoaders, saving models, lots of activation functions
    - Also hardware acceleration
2. exp() vs exp_() : Not in place vs in place operations
3. Making tensor on gpu: You can specify the device in the tensor declaration or use its .to() method
4. Tensor ops without autograd: with torch.no_grad(),
    - Specify requires_grad = False or .detach()
5. I don't think you can run in place operations on this because youre calling .backward()
    - Not true. You can call .exp_() last since the derivative of exp(x) is exp(x). You cannot call cos_() last since its derivative is -sin
6. Linear(100,200) module. Neurons: 200. Weight is 100x200 matrix, bias is 200x1 matrix. Expects z x 100 input. Produces z x 200 output.
    - Weight is actually 200x100.
7. Training loop steps: Dataloader, optimizer, training call, forward pass, backward, gradient descent, zero_grad().
    - Sample batch > to gpu, forward pass, loss, backward, optimizer step, zero grad.
8. Why create optimizer after model is on gpu: 
    - Most optimizers have some internal sttae, and this state is on the same device as model params
9. pin_memory, num_workers, prefetch_factor
10. Cross Entropy Loss, BCELoss, BCEWithLogitsLoss. BCE/BCEWithLogits is for binary classification, cross entropy is for multiclass.
11. Call model.train() and model.eval() because some layers dont behave in the same way during training and eval
    - nn.Dropout, nn.BatchNorm**
12. trace doesnt work with conditionals in forward, script does


In [104]:
x = torch.tensor([1.2, 3.4], requires_grad=True)
f = torch.sin(x[0]**2 * x[1])
f.backward()
x.grad

# you can also define the inputs in separate tensors and get the same result.

tensor([1.4899, 0.2629])