# TP 9 : Entraîner un MLP sans autograd

In [101]:
import torch
import math

On veut construire un MLP avec deux couches linéaires (et une fonction d'activation après chaque couche).

## Fonction d'activation

Écrivez deux fonctions :

In [102]:
def sigma(x:torch.Tensor)->torch.Tensor:
    # Implementation of tanh
    return torch.tanh(x)

def dsigma(x:torch.Tensor)->torch.Tensor:
    # Implementation of the first derivative of tanh
    return 1.0 - torch.pow(torch.tanh(x), 2)

In [103]:
s = torch.empty(4, 3).normal_().requires_grad_(True)
l = torch.sum(sigma(s)) # has a grad of ones
l.backward()

print((dsigma(s.detach()) == s.grad).all())

tensor(True)


qui prennent en entrée un tenseur de nombres à virgule flottante et renvoient un tenseur de même taille, obtenu en appliquant terme à terme respectivement tanh, et la première dérivée de tanh.    

**Indication** : Les fonctions ne doivent pas avoir de boucle python, et utiliser en particulier `torch.tanh`, `torch.exp`, `torch.mul` et `torch.pow`.

## Fonction de loss

Écrivez deux fonctions :

In [104]:
def loss(v:torch.Tensor, t:torch.Tensor)->torch.Tensor:
    # Implementation of ||t-v||_2^2
    return torch.sum(torch.pow(t - v, 2.0))

def dloss(v:torch.Tensor, t:torch.Tensor)->torch.Tensor:
    # Implementation of the gradient of the loss function
    return 2 * (v - t)

rq: ${||v-t||}_2^2 = \sqrt{\sum{(t_i - v_i)²}}^2 = \sum{(t_i - v_i)²}$

In [105]:
v = torch.empty(4, 3).normal_().requires_grad_(True)
t = torch.empty(4, 3).normal_()
l = loss(v, t) # has a grad of ones
l.backward()

print((dloss(v.detach(), t) == v.grad).all())

tensor(True)


qui prennent en entrée deux tenseurs de nombres à virgule flottante de mêmes dimensions, avec `v` le tenseur prédit et `t` le tenseur cible, et renvoient respectivement  $∣∣t−v∣∣_2^2$, et un tenseur de même taille égal au gradient de cette quantité en fonction de v.

**Indication** : Les fonctions ne doivent pas avoir de boucle python, et utiliser en particulier `torch.sum`, `torch.pow`. 

## Passe forward

Écrivez une fonction :

In [106]:
def forward_step(xIn:torch.Tensor, w:torch.Tensor, b:torch.Tensor)->tuple[torch.Tensor, torch.Tensor]:
    """compute a step of forward pass, returns: (s, x)"""
    s = xIn @ w.T + b
    x = sigma(s)
    return (s, x)
 

In [107]:
def forward_pass(w1:torch.Tensor, b1:torch.Tensor, w2:torch.Tensor, 
                 b2:torch.Tensor, x:torch.Tensor)->tuple[torch.Tensor, ...]:
    """compute the forward pass, returns: (s1, x1, s2, x2)"""
    s1, x1 = forward_step(x, w1, b1)
    s2, x2 = forward_step(x1, w2, b2)
    return (s1, x1, s2, x2)

dont les arguments correspondent à un vecteur d'entrée du réseau, et aux poids et biais des deux couches, et qui renvoie un tuple composé de tous les tenseurs intermédiaires le long de la passe avant :
$$\begin{array}{lll}
x_0 & = & x \\
s_1 & = & w_1 \cdot x_0 + b_1 \\
x_1 & = & \sigma(s_1) \\
s_2 & = & w_2 \cdot x_1 + b_2 \\
x_2 & = & \sigma(s_2)
\end{array}$$

## Passe arrière

Écrivez une fonction :

In [108]:
def backward_step(xIn:torch.Tensor, w:torch.Tensor, 
                  s:torch.Tensor, dl_xOut:torch.Tensor)->tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """compute a step of the backward pass, returns: (dl_w, dl_b, dl_xIn)"""
    dl_s = dsigma(s.detach()) * dl_xOut
    dl_w = dl_s.T @ xIn.detach()
    dl_b = dl_s.sum(dim=0)
    dl_xIn = dl_s @ w.detach()
    return (dl_w, dl_b, dl_xIn)

In [109]:
batchSize=4; nbIn=5; nbOut = 3
xIn = torch.empty((batchSize, nbIn)).normal_().requires_grad_(True)
w = torch.empty((nbOut, nbIn)).normal_().requires_grad_(True)
b = torch.empty((nbOut, )).normal_().requires_grad_(True)

s = xIn @ w.T + b; s.retain_grad()
xOut = sigma(s); xOut.retain_grad()
l = torch.sum(xOut); l.retain_grad()
l.backward()

dl_xOut = torch.ones((batchSize, nbOut)) # cause use sum loss
dl_w, dl_b, dl_xIn = backward_step(xIn, w, s, dl_xOut)

print("dl_xOut:", (dl_xOut == xOut.grad).all())
print("dl_w:", (dl_w == w.grad).all())
print("dl_b:", (dl_b == b.grad).all())
print("dl_xIn:", (dl_xIn == xIn.grad).all())

dl_xOut: tensor(True)
dl_w: tensor(True)
dl_b: tensor(True)
dl_xIn: tensor(True)


In [110]:
def backward_pass(w1, b1, w2, b2, # paramètres du MLP
                  t, # cible
                  x, s1, x1, s2, x2, # tenseurs de la passe avant
                  dl_dw1, dl_db1, dl_dw2, dl_db2 # gradients de la loss selon les paramètres w1, b1, w2, b2 
                 ):
    # compute the gradient
    grad_x2 = dloss(v=x2, t=t)
    grad_w2, grad_b2, grad_x1 = backward_step(x1, w2, s2, grad_x2)
    grad_w1, grad_b1, grad_x = backward_step(x, w1, s1, grad_x1)
    # add the gradient
    dl_dw1 += grad_w1; dl_db1 += grad_b1
    dl_dw2 += grad_w2; dl_db2 += grad_b2

dont les arguments correspondent aux paramètres du réseau, au vecteur cible, aux quantités calculées par la passe avant, et aux tenseurs représentant les gradients, et qui met à jour ces derniers selon la formule de la passe arrière : 

$$\frac{\partial \mathcal{L}}{\partial x_2} = \partial \mathcal{L}(x_2, t)$$

* Dérivations à travers les activations, pour le niveau $\ell$ :
    
On a $x_{\ell}(i) = \sigma(s_{\ell}(i))$, donc :
$$
\frac{\partial \mathcal{L}}{\partial s_{\ell}(i)} = \sigma'(s_{\ell}(i)) \cdot \frac{\partial \mathcal{L}}{\partial x_{\ell}(i)}
$$

On a $s_{\ell}(i) = \sum_j w_{\ell}(i,j) \cdot x_{\ell - 1}(j) + b_{\ell}(i)$, donc :
$$
\frac{\partial \mathcal{L}}{\partial x_{\ell - 1}(j)} = \sum_i w_{\ell}(i,j) \cdot \frac{\partial \mathcal{L}}{\partial s_{\ell}(i)}
$$

En notations matricielles : 
$$
\frac{\partial \mathcal{L}}{\partial s_{\ell}} = \sigma'(s_{\ell}) \cdot \frac{\partial \mathcal{L}}{\partial x_{\ell}}
$$

$$
\frac{\partial \mathcal{L}}{\partial x_{\ell - 1}} = w_{\ell}^T \cdot \frac{\partial \mathcal{L}}{\partial s_{\ell}}
$$

* Dérivations à travers les couches linéaires : 

On a $s_\ell(i) = \sum_j w_{\ell}(i,j) \cdot x_{\ell - 1}(j) + b_{\ell}(i)$, donc :
$$
\frac{\partial \mathcal{L}}{\partial w_{\ell}(i,j)} = \frac{\partial \mathcal{L}}{\partial s_{\ell}(i)} \cdot x_{\ell - 1}(j)
$$
et
$$
\frac{\partial \mathcal{L}}{\partial b_{\ell}(i)} = \frac{\partial \mathcal{L}}{\partial s_{\ell}(i)}
$$

$$
\frac{\partial \mathcal{L}}{\partial w_{\ell}} = \frac{\partial \mathcal{L}}{\partial s_{\ell}} \cdot x_{\ell - 1}^T
$$
et
$$
\frac{\partial \mathcal{L}}{\partial b_{\ell}} = \frac{\partial \mathcal{L}}{\partial s_{\ell}}
$$

**Indication** : Les fonctions ne doivent pas avoir de boucle python, et utiliser en particulier `.T` (transpose), `torch.view`. 

## Boucle d'entraînement

Complétez le code ci-dessous

In [111]:
from sklearn.datasets import load_digits
from sklearn import datasets
import numpy as np


X_digits: np.ndarray; y_digits: np.ndarray
X_digits, y_digits = load_digits(return_X_y=True)
print(X_digits.shape, y_digits.shape)

(1797, 64) (1797,)


In [112]:
def one_hot_encode(y, num_classes=10):
    num_samples = len(y)
    one_hot = np.zeros((num_samples, num_classes))
    one_hot[np.arange(num_samples), y] = 1
    return one_hot

y_digits_one_hot = one_hot_encode(y_digits, num_classes=10)
print(y_digits_one_hot.shape)

(1797, 10)


In [113]:
from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = \
    train_test_split(X_digits, y_digits_one_hot, test_size=0.2, shuffle=True)

In [114]:
train_input = torch.tensor(train_input, dtype=torch.float32)
train_target = torch.tensor(train_target, dtype=torch.long) 
test_input = torch.tensor(test_input, dtype=torch.float32)
test_target = torch.tensor(test_target, dtype=torch.long) 

In [115]:
print(train_input.shape, train_target.shape)
print(test_input.shape, test_target.shape)

torch.Size([1437, 64]) torch.Size([1437, 10])
torch.Size([360, 64]) torch.Size([360, 10])


In [116]:
nb_train_samples, nb_classes = train_target.shape
print(f"{nb_classes=}, {nb_train_samples=}")

nb_classes=10, nb_train_samples=1437


In [117]:
zeta = 0.9

train_target = train_target * zeta
test_target = test_target * zeta

In [122]:
nb_hidden = 50
eta = 1e-2 / nb_train_samples
print(f"{eta=:_.3g}")
epsilon = 1e-6

w1 = torch.empty(nb_hidden, train_input.size(1)).normal_(0, epsilon)
b1 = torch.empty(nb_hidden).normal_(0, epsilon)
w2 = torch.empty(nb_classes, nb_hidden).normal_(0, epsilon)
b2 = torch.empty(nb_classes).normal_(0, epsilon)

dl_dw1 = torch.empty(w1.size())
dl_db1 = torch.empty(b1.size())
dl_dw2 = torch.empty(w2.size())
dl_db2 = torch.empty(b2.size())

eta=6.96e-06


In [123]:
from itertools import batched
N_EPOCHS = 500
f = lambda d, tt: ", ".join(f"{100*v/tt:.1f}" for v in d.values())

for k in range(N_EPOCHS):
    acc_loss = 0
    nb_train_correct = 0

    # Empty the gradients here
    dl_dw1.zero_()
    dl_db1.zero_()
    dl_dw2.zero_()
    dl_db2.zero_()

    train_preds: dict[int, int] = {i:0 for i in range(nb_classes)}
    train_truths: dict[int, int] = {i:0 for i in range(nb_classes)}
    #for n_ in map(list, batched(range(nb_train_samples), n=200)):
    for n_ in [list(range(nb_train_samples))]:
        # Forward pass here
        x = train_input[n_]
        s1, x1, s2, x2 = forward_pass(w1, b1, w2, b2, x)

        # Computing the loss (only for logging purposes)
        preds = x2.argmax(dim=1)
        truths = train_target[n_].argmax(dim=1)
        nb_train_correct += (truths == preds).sum().item()
        for pred in preds.tolist(): train_preds[pred] += 1
        for truth in truths.tolist(): train_truths[truth] += 1
        acc_loss = acc_loss + loss(x2, train_target[n_]).sum()

        # Backward pass here
        t = train_target[n_]
        backward_pass(w1, b1, w2, b2, t, x, s1, x1, s2, x2, dl_dw1, dl_db1, dl_dw2, dl_db2)
    acc_loss /= train_input.size(0)

    # Gradient step
    w1 = w1 - eta * dl_dw1; b1 = b1 - eta * dl_db1
    w2 = w2 - eta * dl_dw2; b2 = b2 - eta * dl_db2

    # Test error
    if ((k % 20 == 0) or (k == N_EPOCHS-1)): 
        nb_test_correct = 0
        test_preds: dict[int, int] = {i:0 for i in range(nb_classes)}
        test_truths: dict[int, int] = {i:0 for i in range(nb_classes)}
        #for n_ in map(list, batched(range(test_input.size(0)), n=200)):
        for n_ in [list(range(test_input.size(0)))]:
            _, _, _, x2 = forward_pass(w1, b1, w2, b2, test_input[n_])

            preds = x2.argmax(dim=1)
            truths = test_target[n_].argmax(dim=1)
            nb_test_correct += (truths == preds).sum().item()
            for pred in preds.tolist(): test_preds[pred] += 1
            for truth in truths.tolist(): test_truths[truth] += 1

        trainAccuracy: float = nb_train_correct / train_input.size(0)
        testAccuracy: float = nb_test_correct / test_input.size(0)
        print()
        print("test_preds:", f(test_preds, test_input.size(0)))
        #print("test_truths:", f(test_truths, test_input.size(0)))
        #print("----")
        print("train_preds:", f(train_preds, train_input.size(0)))
        #print("train_truths:", f(train_truths, train_input.size(0)))
        print(f"epoch:{k:>4d}, trainLoss:{acc_loss:.5f}, tainAcc:{trainAccuracy:.1%}, testAcc:{testAccuracy:.1%}")


test_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0
train_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0
epoch:   0, trainLoss:0.81000, tainAcc:9.5%, testAcc:7.8%

test_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0
train_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0
epoch:  20, trainLoss:0.76502, tainAcc:10.7%, testAcc:7.8%

test_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0
train_preds: 0.0, 0.0, 0.0, 0.0, 0.0, 100.0, 0.0, 0.0, 0.0, 0.0
epoch:  40, trainLoss:0.74400, tainAcc:10.7%, testAcc:7.8%

test_preds: 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
train_preds: 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
epoch:  60, trainLoss:0.72937, tainAcc:10.6%, testAcc:8.3%

test_preds: 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
train_preds: 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
epoch:  80, trainLoss:0.72873, tainAcc:10.6%, testAcc:8.3%

test_preds: 0.0, 100.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
tr