# PyTorch

This chapter:  train/eval/finetine/optimize w/ pytorch. Then, optuna library for fine tuning

## Fundamentals

Data type is a tensor. Its a multi-dim array w/shape and datatype. Can live on GPU, and does auto-differentiation.

### Tensors

In [7]:
import torch

X = torch.tensor([[1.0, 4.0, 7.0], [2.0, 3.0, 6.0]])
X

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [8]:
# Tensors only take one data type. If you give it more than one, the most general type will be selected (complex > float > int > bool)
display(X.dtype)
X.shape

torch.float32

torch.Size([2, 3])

In [9]:
# Syntax
X[:,1]
10 * (X+1)
X.exp() #item-wise exponential
X.mean(dim=0) # col-wise mean
X @ X.T

tensor([[66., 56.],
        [56., 49.]])

In [10]:
import numpy as np
torch.tensor(X.numpy(), dtype=torch.float32)
# you can convert btwn numpy and torch easily

tensor([[1., 4., 7.],
        [2., 3., 6.]])

In [11]:
# You can modify in place
X[:,1] = 99
X

tensor([[ 1., 99.,  7.],
        [ 2., 99.,  6.]])

In [12]:
X[:,0] = -5
X.relu_() 
# Methods ending in _ are in place, normal methods are not in place

tensor([[ 0., 99.,  7.],
        [ 0., 99.,  6.]])

### Hardware Acceleration

PyTorch has accelerator support for intel, apple, nvidia, amd, etc etc

In [13]:
if torch.cuda.is_available():
    device = "cuda"
    print("We have a gpu")

M = torch.tensor([[1,2,3],[4,5,6]], dtype=torch.float32)
M = M.to(device)

We have a gpu


In [14]:
M.device
# There are multiple ways to put tensors on gpus, like .cuda(), or setting device= param in torch.Tensor

device(type='cuda', index=0)

In [15]:
R = M @ M.T
R

tensor([[14., 32.],
        [32., 77.]], device='cuda:0')

If your neural net is deep, GPU speed and RAM matters most, if its shallow, getting training data onto GPU is the bottleneck

In [16]:
M = torch.rand((1000,1000))
%timeit M @ M.T

22.9 ms ± 677 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [17]:
M = torch.rand((1000,1000), device="cuda")
%timeit M @ M.T

543 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Autograd

PyTorch does reverse-mode auto-diff (ch9) quickly with a method called autograd (auto gradients).

In [18]:
x = torch.tensor(5.0, requires_grad=True) # requires_grad tells pytorch to keep track of computations for backpropagation
f = x**2 # keeps a grad_fn= argument to tell pytorch how to backpropagate through this
f.backward() # computes gradients
x.grad
# the derivative of x**2 at x=5 is in fact 10

tensor(10.)

In [19]:
# To do gradient descent, you need to tell pytorch not to track this step
# Otherwise it would include it in backprop

lr = .1
with torch.no_grad():
    x -= lr*x.grad

# This code is equivalent: (x detached shares memory with x)
# x_detached = x.detach()
# x_detached -= lr*x.grad

In [20]:
# Before you repeat the forward > backward > gradient descent step, need to set gradients to 0
x.grad.zero_()

tensor(0.)

In [21]:
# The whole training loop:
lr = .1
x = torch.tensor(5.0, requires_grad=True)
for iter in range(500):
    f = x**2
    f.backward()
    with torch.no_grad():
        x -= lr*x.grad
    x.grad.zero_()

x

tensor(2.8026e-45, requires_grad=True)

In [22]:
# If you want to use in-place operations to save memory you have to be careful
# Autograd doesnt let you do an in-place op to a leaf node

t = torch.tensor(2.0, requires_grad=True)
Z = t.exp() # intermediate step
Z+=1 # in place operation (pytorch has no idea where to keep the computation graph for both steps)
# Z.backward() #-> throws error

# you need to do Z = Z+1, it creates a new step

PyTorch stores different operations differently
- exp(), relu(), sqrt(), sigmoid(), tanh() save output in computation graph during the forward pass. You cannot modify their output in place.
- abs(), cos(), log() save their inputs, so you cant change whatever you input to them before the backward pass
- max(), min(), sgn(), std() save inputs and outputs, so do not change their inputs or outputs in place before .backward()
- ceil(), floor(), mean(), sum() store nothing. Do what you want

Generally, make your models without in-place ops, then if you need to speed up or save memory you can convert to in-place operations

## Implementing Linear Regression

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data = fetch_california_housing(as_frame=False)
X_temp, X_test, y_temp, y_test = train_test_split(data.data, data.target, test_size=.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=.2)

In [29]:
X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)
means = X_train.mean(dim=0, keepdims=True)
stds = X_train.std(dim=0, keepdims=True)
# stdizing
X_train = (X_train-means)/stds
X_valid = (X_valid-means)/stds
X_test = (X_test-means)/stds

y_train = torch.FloatTensor(y_train).reshape(-1,1)
y_valid = torch.FloatTensor(y_valid).reshape(-1,1)
y_test = torch.FloatTensor(y_test).reshape(-1,1)

In [34]:
torch.manual_seed(42)
n = X_train.shape[1] # n features
w = torch.randn((n,1), requires_grad=True) # weights
b = torch.tensor(0., requires_grad=True) # biases

lr = .4
epochs = 20

for epoch in range(epochs):
    y_pred = X_train @ w + b
    loss = ((y_pred - y_train) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        b -= b.grad * lr
        w -= w.grad * lr
        b.grad.zero_()
        w.grad.zero_()
    print(f"Epoch {epoch} loss: {loss}")


Epoch 0 loss: 16.06003189086914
Epoch 1 loss: 4.835346221923828
Epoch 2 loss: 2.212664842605591
Epoch 3 loss: 1.3021349906921387
Epoch 4 loss: 0.952170193195343
Epoch 5 loss: 0.806835949420929
Epoch 6 loss: 0.7393255233764648
Epoch 7 loss: 0.7026336789131165
Epoch 8 loss: 0.6789199709892273
Epoch 9 loss: 0.6612536311149597
Epoch 10 loss: 0.6468537449836731
Epoch 11 loss: 0.6345417499542236
Epoch 12 loss: 0.6237688660621643
Epoch 13 loss: 0.6142401099205017
Epoch 14 loss: 0.6057689785957336
Epoch 15 loss: 0.5982191562652588
Epoch 16 loss: 0.5914807319641113
Epoch 17 loss: 0.5854612588882446
Epoch 18 loss: 0.5800800919532776
Epoch 19 loss: 0.5752664804458618


In [35]:
# Making predictions
with torch.no_grad():
    print(X_test[:3] @ w + b)

tensor([[1.5125],
        [4.2779],
        [1.8360]])


This works but PyTorch has a higher level API to do all this easily.

#### PyTorch API

In [41]:
import torch.nn as nn

torch.manual_seed(42)
model = nn.Linear(in_features=n, out_features=1)
model.weight #.weight and .bias are children of torch.nn.Parameter, which is a child of torch.Tensor

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)

In [44]:
for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.2703,  0.2935, -0.0828,  0.3248, -0.0775,  0.0713, -0.1721,  0.2076]],
       requires_grad=True)
Parameter containing:
tensor([0.3117], requires_grad=True)


In [46]:
model(X_train[:1])
# not trained yet so predictions r random

tensor([[7.5772]], grad_fn=<AddmmBackward0>)

In [None]:
# We need to pick an optimizier and a loss function
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
mse = nn.MSELoss()