# PyTorch

This chapter:  train/eval/finetine/optimize w/ pytorch. Then, optuna library for fine tuning

## Fundamentals

Data type is a tensor. Its a multi-dim array w/shape and datatype. Can live on GPU, and does auto-differentiation.

### Tensors

In [93]:
import torch

X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [94]:
# Tensors only take one data type. If you give it more than one, the most general type will be selected (complex > float > int > bool)
display(X.dtype)
X.shape

torch.float32

torch.Size([2, 3])

In [95]:
# Syntax
X[:,1]
10 * (X+1)
X.exp() #item-wise exponential
X.mean(dim=0) # col-wise mean
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

In [96]:
import numpy as np
torch.tensor(X.numpy(), dtype=torch.float32)
# you can convert btwn numpy and torch easily

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [97]:
# You can modify in place
X[:,1] = 99
X

tensor([[ 1., 99.,  7.],
        [ 2., 99.,  6.]])

In [98]:
X[:,0] = -5
X.relu_() 
# Methods ending in _ are in place, normal methods are not in place

tensor([[ 0., 99.,  7.],
        [ 0., 99.,  6.]])

### Hardware Acceleration

PyTorch has accelerator support for intel, apple, nvidia, amd, etc etc

In [99]:
if torch.cuda.is_available():
    device = "cuda"
    print("We have a gpu")

M = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32)
M = M.to(device)

We have a gpu


In [100]:
M.device
# There are multiple ways to put tensors on gpus, like .cuda(), or setting device= param in torch.Tensor

device(type='cuda', index=0)

In [101]:
R = M @ M.T
R

tensor([[14., 32.],
        [32., 77.]], device='cuda:0')

If your neural net is deep, GPU speed and RAM matters most, if its shallow, getting training data onto GPU is the bottleneck

In [102]:
M = torch.rand((1000,1000))
%timeit M @ M.T

18.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [103]:
M = torch.rand((1000,1000), device="cuda")
%timeit M @ M.T

651 µs ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Autograd

PyTorch does reverse-mode auto-diff (ch9) quickly with a method called autograd (auto gradients).

In [104]:
x = torch.tensor(5.0, requires_grad=True) # requires_grad tells pytorch to keep track of computations for backpropagation
f = x**2 # keeps a grad_fn= argument to tell pytorch how to backpropagate through this
f.backward() # computes gradients
x.grad
# the derivative of x**2 at x=5 is in fact 10

tensor(10.)

In [105]:
# To do gradient descent, you need to tell pytorch not to track this step
# Otherwise it would include it in backprop

lr = .1
with torch.no_grad():
    x -= lr*x.grad

# This code is equivalent: (x detached shares memory with x)
# x_detached = x.detach()
# x_detached -= lr*x.grad

In [106]:
# Before you repeat the forward > backward > gradient descent step, need to set gradients to 0
x.grad.zero_()

tensor(0.)

In [107]:
# The whole training loop:
lr = .1
x = torch.tensor(5.0, requires_grad=True)
for iter in range(500):
    f = x**2
    f.backward()
    with torch.no_grad():
        x -= lr*x.grad
    x.grad.zero_()

x

tensor(2.8026e-45, requires_grad=True)

In [108]:
# If you want to use in-place operations to save memory you have to be careful
# Autograd doesnt let you do an in-place op to a leaf node

t = torch.tensor(2.0, requires_grad=True)
Z = t.exp() # intermediate step
Z+=1 # in place operation (pytorch has no idea where to keep the computation graph for both steps)
# Z.backward() #-> throws error

# you need to do Z = Z+1, it creates a new step

PyTorch stores different operations differently
- exp(), relu(), sqrt(), sigmoid(), tanh() save output in computation graph during the forward pass. You cannot modify their output in place.
- abs(), cos(), log() save their inputs, so you cant change whatever you input to them before the backward pass
- max(), min(), sgn(), std() save inputs and outputs, so do not change their inputs or outputs in place before .backward()
- ceil(), floor(), mean(), sum() store nothing. Do what you want

Generally, make your models without in-place ops, then if you need to speed up or save memory you can convert to in-place operations

## Implementing Linear Regression

In [109]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing(as_frame=False)
X_temp, X_test, y_temp, y_test = train_test_split(data.data, data.target, test_size=.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=.2)

In [110]:
X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)
means = X_train.mean(dim=0, keepdims=True)
stds = X_train.std(dim=0, keepdims=True)
# stdizing
X_train = (X_train-means)/stds
X_valid = (X_valid-means)/stds
X_test = (X_test-means)/stds

y_train = torch.FloatTensor(y_train).reshape(-1,1)
y_valid = torch.FloatTensor(y_valid).reshape(-1,1)
y_test = torch.FloatTensor(y_test).reshape(-1,1)

In [111]:
torch.manual_seed(42)
n = X_train.shape[1] # n features
w = torch.randn((n,1), requires_grad=True) # weights
b = torch.tensor(0., requires_grad=True) # biases

lr = .4
epochs = 20

for epoch in range(epochs):
    y_pred = X_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        b -= b.grad * lr
        w -= w.grad * lr
        b.grad.zero_()
        w.grad.zero_()
    print(f"Epoch {epoch} loss: {loss}")


Epoch 0 loss: 16.12990951538086
Epoch 1 loss: 4.828019618988037
Epoch 2 loss: 2.190487861633301
Epoch 3 loss: 1.2756378650665283
Epoch 4 loss: 0.9258849024772644
Epoch 5 loss: 0.7817054390907288
Epoch 6 loss: 0.7151285409927368
Epoch 7 loss: 0.679024875164032
Epoch 8 loss: 0.6556843519210815
Epoch 9 loss: 0.6383019685745239
Epoch 10 loss: 0.6241695284843445
Epoch 11 loss: 0.6121392250061035
Epoch 12 loss: 0.6016687154769897
Epoch 13 loss: 0.5924591422080994
Epoch 14 loss: 0.5843158960342407
Epoch 15 loss: 0.5770943760871887
Epoch 16 loss: 0.5706777572631836
Epoch 17 loss: 0.5649677515029907
Epoch 18 loss: 0.5598797798156738
Epoch 19 loss: 0.5553401112556458


In [112]:
# Making predictions
with torch.no_grad():
    print(X_test[:3] @ w + b)

tensor([[1.7111],
        [0.9956],
        [1.2532]])


This works but PyTorch has a higher level API to do all this easily.

#### PyTorch API

In [113]:
import torch.nn as nn

torch.manual_seed(42)
model = nn.Linear(in_features=n, out_features=1)
model.weight #.weight and .bias are children of torch.nn.Parameter, which is a child of torch.Tensor

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)

In [114]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)
Parameter containing:
tensor([0.3117], requires_grad=True)


In [115]:
model(X_train[:1])
# not trained yet so predictions r random

tensor([[0.6250]], grad_fn=<AddmmBackward0>)

In [116]:
# We need to pick an optimizier and a loss function
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
mse = nn.MSELoss()

def train_bgd(model, optimizer, criterion, X_train, y_train, n_epochs):
    tenth = n_epochs //10
    for epoch in range(n_epochs):
        yPred = model(X_train)
        loss = criterion(y_train, yPred)
        loss.backward()
        optimizer.step() # updates b, w
        optimizer.zero_grad()
        if(epoch % tenth == 0):
            print(f"Epoch {epoch}, Loss: {loss.item()}")

In [117]:
# training

train_bgd(model, optimizer, mse, X_train, y_train, 200)

Epoch 0, Loss: 4.2368388175964355
Epoch 20, Loss: 0.5237898826599121
Epoch 40, Loss: 0.5146560072898865
Epoch 60, Loss: 0.513292670249939
Epoch 80, Loss: 0.5130218267440796
Epoch 100, Loss: 0.5129623413085938
Epoch 120, Loss: 0.5129488706588745
Epoch 140, Loss: 0.5129457712173462
Epoch 160, Loss: 0.5129451155662537
Epoch 180, Loss: 0.5129449367523193


In [118]:
X_new = X_test[:3]
with torch.no_grad():
    y_pred = model(X_new)

y_pred

tensor([[1.7168],
        [0.9449],
        [1.1635]])

## Implementing a Regression MLP

pytorch has `nn.Sequential` that lets you chain modules

In [119]:
model = nn.Sequential(
    nn.Linear(n, 50), # n inputs and any number of outputs
    nn.ReLU(), # shape of output = shape input. just an activation function
    nn.Linear(50,40), # inputs of 2 must equal outputs of 1. number of outputs can be whatever you want though
    nn.ReLU(),
    nn.Linear(40,1) # final n outputs must match targets dimension
)

In [120]:
lr = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
mse = nn.MSELoss()

model = model.to(device)
X_train = X_train.to(device)
y_train = y_train.to(device)


train_bgd(model, optimizer, mse, X_train, y_train, 1000)
print(next(model.parameters()).device) 

Epoch 0, Loss: 5.844911098480225
Epoch 100, Loss: 0.6239899396896362
Epoch 200, Loss: 0.44056236743927
Epoch 300, Loss: 0.41409310698509216
Epoch 400, Loss: 0.39982160925865173
Epoch 500, Loss: 0.38740676641464233
Epoch 600, Loss: 0.37716424465179443
Epoch 700, Loss: 0.36810338497161865
Epoch 800, Loss: 0.3594208359718323
Epoch 900, Loss: 0.3518669605255127
cuda:0


## Mini-Batch Gradient Descent w/ DataLoaders

Torch has a `torch.utils.data.DataLoader` class that efficiently loads data and shuffles if we want it to

DataLoaders expects the dataset to have a len() and getitem() method

In [137]:
from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

model = nn.Sequential(
    nn.Linear(n, 50), 
    nn.ReLU(), 
    nn.Linear(50,40),
    nn.ReLU(),
    nn.Linear(40,1) 
).to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=lr)

In [138]:
def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train() # this switched modules to training mode, doesnt matter rn but it will later
    for epoch in range(n_epochs):
        total_loss = 0.0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = criterion(y_batch, y_pred)
            total_loss+=loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {mean_loss:.4f}")

In [139]:
train(model, optimizer, mse, train_loader, 20)

Epoch 1/20, Loss: 0.5512
Epoch 2/20, Loss: 0.4332
Epoch 3/20, Loss: 0.4005
Epoch 4/20, Loss: 0.3850
Epoch 5/20, Loss: 0.3608
Epoch 6/20, Loss: 0.3594
Epoch 7/20, Loss: 0.3470
Epoch 8/20, Loss: 0.3390
Epoch 9/20, Loss: 0.3317
Epoch 10/20, Loss: 0.3261
Epoch 11/20, Loss: 0.3226
Epoch 12/20, Loss: 0.3943
Epoch 13/20, Loss: 0.3429
Epoch 14/20, Loss: 0.3295
Epoch 15/20, Loss: 0.3286
Epoch 16/20, Loss: 0.3121
Epoch 17/20, Loss: 0.3163
Epoch 18/20, Loss: 0.3074
Epoch 19/20, Loss: 0.3028
Epoch 20/20, Loss: 0.2991


- You can set `pin_memory=True` in the dataloader to speed up training and the cost of more cpu ram
    - Also set `non_blocking=True` in the .to() method to avoid blocking cpu during data transfer

- This training loop waits until one batch is done to load another. Set dataloaders `num_workers=` to add workers, and tweak number of batches fetched with `prefetch_factor`. Windows sometimes lags with this, so set `persistent_workers=True` to reuse workers 

- Note that it seems like these arguments have issues in juypter notebook and you might have to work in .py

## Model Evaluation

In [141]:
def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

In [142]:
valid_set = TensorDataset(X_valid, y_valid)
valid_loader = DataLoader(valid_set, batch_size=32)
valid_mse = evaluate(model, valid_loader, mse)
valid_mse

tensor(0.3346, device='cuda:0')

In [146]:
# if you want to use rmse
def rmse(pred, true):
    return ((pred-true)**2).mean().sqrt()

evaluate(model, valid_loader, rmse)

# root(mse) != rmse because torch computed the mean rmse across sets, not the root of total mse

tensor(0.5589, device='cuda:0')

In [149]:
evaluate(model, valid_loader, mse, aggregate_fn=lambda metrics: torch.sqrt(torch.mean(metrics)))
# Use MSE as metric function, and aggregate by taking the root of the mean mse

tensor(0.5784, device='cuda:0')

## Nonsequential Models w/ Custom Modules

Wide and Deep neural net: all/part of inputs are connected directly to the output layer, letting the model learn shallow and deep patterns

We need to use `nn.Module` to build our custom network

In [150]:
class WideAndDeep(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X):
        deep_output = self.deep_stack(X)
        wide_and_deep = torch.concat([X, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

Modules have a .children() method that lets you iterate over the submodules. If your model has a changing number of submodules, you should store them in an nn.ModuleList, and if you have a changing number of params, you should keep it in a nn.ParameterList

In [153]:
model = WideAndDeep(n).to(device)
lr = .002
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

train(model, optimizer, mse, train_loader, 20)

Epoch 1/20, Loss: 1.4628
Epoch 2/20, Loss: 0.6090
Epoch 3/20, Loss: 0.5629
Epoch 4/20, Loss: 0.5364
Epoch 5/20, Loss: 0.5209
Epoch 6/20, Loss: 0.5116
Epoch 7/20, Loss: 0.5048
Epoch 8/20, Loss: 0.5010
Epoch 9/20, Loss: 0.4959
Epoch 10/20, Loss: 0.4978
Epoch 11/20, Loss: 0.4879
Epoch 12/20, Loss: 0.4893
Epoch 13/20, Loss: 0.4851
Epoch 14/20, Loss: 0.4802
Epoch 15/20, Loss: 0.4772
Epoch 16/20, Loss: 0.4758
Epoch 17/20, Loss: 0.4706
Epoch 18/20, Loss: 0.4687
Epoch 19/20, Loss: 0.4684
Epoch 20/20, Loss: 0.4601


In [154]:
# If you want to send a subset of the features thru the wide path, and a different (mayb overlapping) part through the deep path, you can do smthn likethis:
class WideAndDeep2(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X):
        X_wide = X[:, :5]
        X_deep = X[:, 2:]
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

### Making Models with Multiple Inputs

Its usually better to just let the model take two tensors as input rather than trying to data split within the model. (like images + text)

In [155]:
class WideAndDeep3(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.deep_stack = nn.Sequential(
            nn.Linear(n_features, 50), nn.ReLU(),
            nn.Linear(50,40), nn.ReLU(),
        )
        self.output_layer = nn.Linear(40+n_features, 1)

    def forward(self, X_wide, X_deep):
        deep_output = self.deep_stack(X_deep)
        wide_and_deep = torch.concat([X_wide, deep_output], dim=1)
        return self.output_layer(wide_and_deep)

In [156]:
train_data = TensorDataset(X_train[:, :5], X_train[:, 2:], y_train)
train_loader_wd = DataLoader(train_data, batch_size=32, shuffle=True)

# you need to make sure your train and eval functions handle 3 tensors
def evaluate(model, data_loader, metric_fn, aggregate_fn=torch.mean):
    model.eval()
    metrics=[]
    with torch.no_grad():
        for X_batch_wide, X_batch_deep, y_batch in data_loader:
            X_batch_wide = X_batch_wide.to(device)
            X_batch_deep = X_batch_deep.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch_wide, X_batch_deep)
            metric = metric_fn(y_pred, y_batch)
            metrics.append(metric)
    return aggregate_fn(torch.stack(metrics))

def train(model, optimizer, criterion, train_loader, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0
        for X_batch_wide, X_batch_deep, y_batch in train_loader:
            X_batch_wide = X_batch_wide.to(device)
            X_batch_deep = X_batch_deep.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(X_batch_wide, X_batch_deep)
            loss = criterion(y_batch, y_pred)
            total_loss+=loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        mean_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {mean_loss:.4f}")

In [None]:
# IFF your inputs are in the same order everywhere, you can use something like this: 

# for *X_batch_inputs, y_batch in data_loader:
#     X_batch_inputs = [X.to(device) for X in X_batch_inputs]
#     y_batch = y_batch.to(device)
#     y_pred = model(*X_batch_inputs)