# 1 Introduction to Pytorch

This notebook presents an introduction to the PyTorch framework, as part of the material of the first exercise class in the Deep Learning course in the Autumn semester 2024. In the following, a short introduction to PyTorch data structures (tensors) will be made, along with practical examples involving all the components necessary for building, training and testing PyTorch models: starting from the module classes, optimizers, losses, datasets and dataloaders, covering the basics of backpropagation as well. Following the PyTorch introduction, an example of a Perceptron training is presented.

## 1.1 Tensors

Tensors are the main datatype used in the PyTorch framework. Much like numpy arrays, they hold numerical values in arrays of different shapes. Not only that, but in case of mathematical oprations, tensors behave exactly the same as numpy arrays - they even share the same memory space if constructed from one another (when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory)!

Tensors come with some additional features on top of the numpy arrays - the main feature being that they contain **gradient information**! 
PyTorch tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but also **the computational graph** leading to these values.

In the following example, we will create some tensors and show how to compute the gradients with respect to any variable needed - this is a crucial part of neural network training.

### Example 1: Tensor gradients

In [1]:
import torch

In [None]:
x = torch.tensor([1., 2., 3.], requires_grad=True) # Input vector
w = torch.tensor([4., 5., 1.], requires_grad=True) # Weight vecotr
t = torch.tensor([2], requires_grad=False) # Target values

y = w @ x # inner-product of x and w
z = (y - t)**2 # square error between the output and target values

z.backward()  # ask pytorch to trace back the computation of z and compute derivatives

# TODO: Compute the gradient of variable z with respect to the weight vector w, and the input vector x
dzdw = ...
dzdx = ...

print(f"Gradient computed automatically by pytorch: {w.grad}")  # the resulting gradient of z w.r.t w -> computed using backward()
print(f"Gradient computed manually:                 {dzdw}")

print(f"Gradient computed automatically by pytorch: {x.grad}")  # the resulting gradient of z w.r.t 
print(f"Gradient computed manually:                 {dzdx}")

# 1.2 PyTorch models and datasets:

In [3]:
import numpy as np
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.datasets import make_regression, make_blobs
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

import torch
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.nn.functional as F

%config InlineBackend.figure_format = 'svg'

In [4]:
X, y_true = make_regression(n_samples=60, n_features=10, noise=1.)
X_tensor, y_tensor = torch.tensor(X, dtype=torch.float32), torch.tensor(y_true, dtype=torch.float32)
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor.unsqueeze(1), test_size=0.5)

## 1.2.1 Writing custom datasets:
- A Dataset subclass wraps access to the data, and is specialized to the type of data it’s serving.
- The DataLoader knows nothing about the data, but organizes the input tensors served by the Dataset into batches with the parameters you specify.

torch.utils.data.Dataset is an abstract class representing a dataset. Your custom dataset should inherit Dataset and override the methods \__len\__ and \__getitem\__:

In [5]:
class MyDataset(Dataset):
    def __init__(self, X, y, transform=None):
        """
        X is assumed to take the shape [num_samples, embedding_dim]
        y has the shape [num_samples,]
        """
        self.X = X
        self.y = y
        self.transform = transform

    def __len__(self):
        # TODO: Implement the __len__ method for the dataset of vector values
        pass

    def __getitem__(self, index):
        # TODO: Implement the __getitem__ method for the dataset of vector values applying the transformation if possible
        pass
    
class MyTransform(object):
    """
    Example of a custom transform that is applied to every element of the Dataset class before they are returned using the __getitem__ method.
    There's no need to modify this class - it serves just as an example of how tranforms can be made and applied during training of neural networks.
    """
    def __init__(self):
        pass
    def __call__(self, X, y):
        return (X-X.mean(axis=0))/X.std(axis=0), (y-y.mean(axis=0))/y.std(axis=0)

# 1.2.2 Training a neural network steps:
**The loss function** is a measure of how far from our ideal output the model’s prediction was. Mean-squared-error loss is a typical loss function for regression models like ours.
**The optimizer** is what drives the learning. Here we use an optimizer that implements stochastic gradient descent. Besides parameters of the algorithm, like the learning rate and momentum, we also pass in net.parameters() - a collection of all the learning weights in the model - which is what the optimizer adjusts.
**Zeroing the gradients** is an important step. Gradients are accumulated over a batch. If we do not reset them for every batch, they will keep accumulating, which will provide incorrect gradient values, making learning impossible.

Usual structure of a training loop:
- Get the outputs of the current batch by passing it through the network
- Compute the loss (Ground truth labels vs the network output)
- Perform a backwards pass of the network (calculates the gradients of loss wrt networks parameters)
- Perform an optimizer step: why this is decoupled from the backwards step is because the optimizer can be abstracted, as it can implement many different algorithms of optimization, all of which require backwards derivatives 

In [8]:
def train(model, train_loader, optimizer, criterion, device):
    """
    Implements one epoch (one whole pass through the dataset) of training the provided model.
    """
    epoch_loss = 0

    # Set the model to training mode
    model.train()

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        # TODO: Finish the training loop by performing the necessary operations using the optimizer, model and criterion objects.
        #       Accumulate each batch's loss in the variable epoch_loss.
        
    
    return epoch_loss/len(train_loader)

def evaluate(model, val_loader, criterion, device):
    # Set the model to evaluation mode
    model.eval()

    val_loss = 0
    for x, y in val_loader:
        x, y = x.to(device), y.to(device)
        with torch.no_grad():
            val_loss += criterion(model(x), y).item()

    return val_loss/len(val_loader)

**Note:** The loss function is not independent from the optimizer! It is the final leaf in a single computational graph which starts with the model inputs and contains all model parameters. When we do loss.backward() the process of backpropagation starts at the loss and goes through all of its parents all the way to model inputs. All nodes in the graph contain a reference to their parent.

loss.backward() computes the grad attribute of all tensors with requires_grad=True in the computational graph of which loss is the leaf.

Optimizer just iterates through the list of parameters which have requires_grad=True set, which it received on initialization, and it subtracts the value of its gradient stored in its .grad property, simply multiplied by the learning rate in case of SGD. It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.

## 1.2.3 Writing custom models:
Some facts and tips about writing a custom PyTorch model:
- It **inherits from torch.nn.Module** - modules may be nested.
- A model should have an **__init__() function**, where it instantiates its layers, and loads any data artifacts it might need (e.g., an NLP model might load a vocabulary).
- A model should have a **forward()** function. This is where the actual computation happens: An input is passed through the network layers and various functions to generate an output.
- Other than that, you can build out your model class like any other Python class, adding whatever properties and methods you need to support your model’s computation.
- PyTorch models assume they are working on batches of data
- Inference is performed by calling it like a function: model(input)
- Output of a model also has a batch dimension, the size of which should always match the input batch dimension.

In [9]:
class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.input_fc = nn.Linear(input_dim, 10)
        self.hidden_fc = nn.Linear(10, 10)
        self.output_fc = nn.Linear(10, output_dim)

    def forward(self, x):
        # TODO: Write the forward method of the MLP, using a suitable activation function.
        #       The output variable should be named y_pred.

        return y_pred

The below code uses a simple linear model as baseline.
If you run just this you should be able to see how the model performance improves and in the next cell you can plot your predictions against the actual values.
Test your own model by uncommenting the line below. What do you observe and can you fix it somehow? How does your choice of activation function affect the model performance and why?

In [None]:
epochs = 40
device = "cuda" if torch.cuda.is_available() else "cpu"

model = torch.nn.Sequential(torch.nn.Linear(10, 1))
# model = MLP(10, 1)
criterion = torch.nn.MSELoss()

used_transformation = None # MyTransform()
train_dataset = MyDataset(X_train, y_train, transform=used_transformation)
val_dataset = MyDataset(X_test, y_test, transform=used_transformation)

train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=10)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

model.to(device)
train_loss = []
val_loss = []

for epoch in range(epochs):
    
    epoch_loss_train = train(model, train_loader, optimizer, criterion, device)
    train_loss.append(epoch_loss_train)
    
    epoch_loss_val = evaluate(model, val_loader, criterion, device)
    val_loss.append(epoch_loss_val)

    print(f'epoch {epoch+1}, loss {epoch_loss_val:.2f}')

plt.plot(range(epochs), train_loss, label='Training loss')
plt.plot(range(epochs), val_loss, label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

In [None]:
with torch.no_grad():
    pred = model(X_test.cuda()).cpu().numpy()

if used_transformation:
    pred = pred * y_train.std().item() + y_train.mean().item()

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r')
plt.scatter(y_test.numpy(), pred)
plt.grid()

# 2 The perceptron

In [1]:
import numpy as np
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
%config InlineBackend.figure_format = 'svg'

### 2.1 Random dataset with two blobs

In [None]:
X, y = make_blobs(n_samples=50, centers=2, random_state=4)

X = preprocessing.scale(X)
# X[y==0] += 1 # linearly non-separable data

y[y==0]=-1
plt.scatter(X[:,0], X[:,1], c=y)
plt.title("Dataset")
plt.gca().set_aspect('equal', adjustable='box')
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$')
plt.show()

In [None]:
# split train and test data
y_true = y

# add a dimension for bias
X = np.hstack((X, np.ones((X.shape[0], 1))))

X_train, X_test, y_train, y_test = train_test_split(X, y_true, stratify=y)
print(f'Shape X_train: {X_train.shape}')
print(f'Shape y_train: {y_train.shape}')
print(f'Shape X_test: {X_test.shape}')
print(f'Shape y_test: {y_test.shape}')

### 2.2 Perceptron class + training algorithm

In [None]:
class Perceptron():
    
    def __init__(self, n_samples, n_features, lr=1., n_iters=1000):
        self.lr = lr
        self.n_samples = n_samples
        self.n_features = n_features
        self.n_iters = n_iters
        self.theta_hist = np.zeros((self.n_features, self.n_iters))
        self._trained = False
        self.train_accuracies = []
        self.test_accuracies = []

        
    def train(self, X_train, y_train, X_test, y_test):
        
        theta = np.random.uniform(size=(self.n_features,))
        
        for i in tqdm(range(self.n_iters)):
            idx = i % self.n_samples
            #TODO: implement the prediction for X_train[idx,:].
            y_predict = ...

            #TODO: implement the update rule for the preceptron by updating theta.

            self.theta_hist[:,i] = theta

            # computes train accuracy
            y_train_predict = np.sign(np.inner(X_train, theta))
            self.train_accuracies.append(np.mean(y_train_predict == y_train))

            # computes test accuracy
            y_test_predict = np.sign(np.inner(X_test, theta))
            self.test_accuracies.append(np.mean(y_test_predict == y_test))
        
        self._trained = True
        

    def is_trained(self):
        if not self._trained:
            raise ValueError("Model has not been trained.")

    @property
    def theta(self):
        self.is_trained()
        return self.theta_hist[:, -1]
    
    @property
    def pocket_idx(self):
        self.is_trained()
        return np.argmax(self.train_accuracies)
    
    @property
    def pocket_theta(self):
        self.is_trained()
        return self.theta_hist[:, self.pocket_idx]
        
        
p = Perceptron(*X_train.shape)
theta = p.train(X_train, y_train, X_test, y_test)

# pocket: take the theta with the best training accuracy.
print(f'\ntheta: {p.theta}')
print(f'pocket_theta: {p.pocket_theta}')

### 2.3 Plotting training curves

In [None]:
plt.plot(p.train_accuracies, label='train acc')
plt.plot(p.test_accuracies, label='test acc')
plt.scatter(p.pocket_idx, p.train_accuracies[p.pocket_idx], c='k', marker='*', label='pocket', s=100)
plt.ylim([0.3, 1.05])
plt.grid()
plt.legend()

print(f'Train accuracy last: {p.train_accuracies[-1]}')
print(f'Test accuracy last: {p.test_accuracies[-1]}')
print('############################################')
print(f'Train accuracy pocket: {p.train_accuracies[p.pocket_idx]}')
print(f'Test accuracy pocket: {p.test_accuracies[p.pocket_idx]}')
print('########## Thetas ##############')

### 2.4 Visualization of training (parameter space)

In [None]:
theta_history = p.theta_hist

plt.plot(theta_history[0,:], theta_history[1,:], 'b', lw=2)
plt.scatter(theta_history[0,:], theta_history[1,:], facecolor='b', s=30)
plt.scatter(theta_history[0,0], theta_history[1,0], facecolor='r', s=70)
plt.scatter(theta_history[0,-1], theta_history[1,-1], facecolor='g', s=70)

plt.grid()
plt.gca().set_aspect('equal', adjustable='box')
plt.xlabel(r'$\bf \theta$, first coordinate')
plt.ylabel(r'$\bf \theta$, second coordinate')
plt.title('trajectory starts from red point, ends at green point')
plt.savefig('traj.png',dpi = 400)

### 2.5 Visualization of solution

In [None]:
np.max(np.abs(X[:,0]))

fig = plt.figure(figsize=(8,6))
plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
plt.scatter(X_test[:,0], X_test[:,1], c=y_test, marker='x')
x_hyperplane = np.array([np.min(X[:,0]),np.max(X[:,0])])
y_hyperplane = -(x_hyperplane * p.theta[0] + p.theta[2]) / p.theta[1]
plt.plot(x_hyperplane, y_hyperplane, '-', label='last theta')

# pocket solution
y_hyperplane = -(x_hyperplane * p.pocket_theta[0] + p.pocket_theta[2]) / p.pocket_theta[1]
plt.plot(x_hyperplane, y_hyperplane, '-', label='pocket theta')

plt.title("Dataset")
plt.gca().set_aspect('equal', adjustable='box')
plt.xlabel(r'$\bf x$, first coordinate')
plt.ylabel(r'$\bf x$, second coordinate')
print(f'theta: {p.theta}')
print(f'pocket_theta: {p.pocket_theta}')
plt.legend()
plt.show()

### 2.6 Perceptron with Real World Data

In [219]:
from sklearn.datasets import load_digits

X, y = load_digits(n_class=2, return_X_y=True)
y[y==0] = -1.
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
p = Perceptron(*X_train.shape, n_iters=50)
p.train(X_train, y_train, X_test, y_test)

In [None]:
plt.plot(p.train_accuracies, label='train acc')
plt.plot(p.test_accuracies, label='test acc')
plt.scatter(p.pocket_idx, p.train_accuracies[p.pocket_idx], c='k', marker='*', label='pocket', s=100)
plt.ylim([0.3, 1.05])
plt.grid()
plt.legend()

print(f'Train accuracy last: {p.train_accuracies[-1]}')
print(f'Test accuracy last: {p.test_accuracies[-1]}')
print('############################################')
print(f'Train accuracy pocket: {p.train_accuracies[p.pocket_idx]}')
print(f'Test accuracy pocket: {p.test_accuracies[p.pocket_idx]}')
print('########## Thetas ##############')