# Transformers

 - Practicum: [Week12](https://www.youtube.com/watch?v=f01J0Dri-6k) - [Instant](https://youtu.be/f01J0Dri-6k?t=2530)
 

## Theory


This lesson can be a bit dense and difficult to follow. At the same time, I have special interest in learn about Transformers. 

For these reasons, in this notebook, I describe and complete the calculations and formulations presented in the video. In particular, I break down each equation and I represent each vector and matrix. 

I changed a bit Alfredo's annotation to make it easier to represent and understand by me. In any case, they are minor changes and they are easy to identify.

The following explanations follow the structure of the video and slides. You can combine the sources to a better understanding.

*GitHub users: the matrixes and other mathematical formulation are not properly visualized in the GitHub's view. For a properly visualization please download the repository (to include images) and load the notebook with Jupiter notebook.* 

### Self-attention


#### Input

Input set of size `t`: $\{x^{(i)}\}_{i=1}^{t} = \{x^{(1)}, x^{(2)}, ..., x^{(t)}\}$ , `t` elements
 
Each $x^{(i)}$  €  $\mathbb{R}^n$  = \{$x_1^{(i)}$, $x_2^{(i)}$, ..., $x_n^{(i)}$\} , embedding of `n` features

![input-repres](res/transformers_xinput_embbd.png)

If we concat the `t` examples together we get a matrix:

 $X$ € $\mathbb{R}^{nxt} = \begin{pmatrix}x_1^{(1)} & x_1^{(2)} & x_1^{(3)} & ... & x_1^{(t)}\\\ 
                 x_2^{(1)} & x_2^{(2)} & x_2^{(3)} & ... & x_2^{(t)}\\\ 
                 ... \\\ 
                 x_n^{(1)} & x_n^{(2)} & x_n^{(3)} & ... & x_n^{(t)}
\end{pmatrix}$ , `n` rows (embedding) and `t` columns (inputs)

<br/>
<br/>

#### Hidden representation

Using a self-atention you try to get a hidden representation: $h$ &rarr; linear combination of the input vectors

$h = X*a$ &rarr; h € $\mathbb{R}^{n}$


$ 
h = 
\begin{pmatrix}
    x_1^{(1)} & x_1^{(2)} & x_1^{(3)} & ... & x_1^{(t)}\\\ 
    x_2^{(1)} & x_2^{(2)} & x_2^{(3)} & ... & x_2^{(t)}\\\ 
    ... \\\ 
    x_n^{(1)} & x_n^{(2)} & x_n^{(3)} & ... & x_n^{(t)}
\end{pmatrix}
\begin{pmatrix}
    a_1\\\
    a_2\\\
    ... \\\
    a_t
\end{pmatrix} = 
  a_1 \begin{pmatrix}x_1^{(1)}\\\ x_2^{(1)}\\\ ... \\\ x_n^{(1)}\end{pmatrix}
+ a_2 \begin{pmatrix}x_1^{(2)}\\\ x_2^{(2)}\\\ ... \\\ x_n^{(2)}\end{pmatrix} 
+ ... 
+ a_t \begin{pmatrix}x_1^{(t)}\\\ x_2^{(t)}\\\ ... \\\ x_n^{(t)}\end{pmatrix}  = 
\begin{pmatrix}
    h_1 \\\ 
    h_2\\\ 
    ... \\\ 
    h_n
\end{pmatrix}$ 


&rarr; $a$ € $\mathbb{R}^{t} = \begin{pmatrix}a_1\\\ a_2\\\ ... \\\ a_t\end{pmatrix}$

<br/>
<br/>

#### Attention
$a$ is the attention.

##### Types of attention
We have 2 kinds of attention:

<br/>

  - *hard attention* 

$||a||_0 = 1$, the zero norm and the non-zero term are equals to 1 &rarr; one-hot encoded vector

example, $a = \begin{pmatrix} 0\\\ 1\\\ 0\\\ ... \\\ 0\end{pmatrix}$

When you use a one-hot vector $h$ only pays attention to only one column of $X$, only one example, $x^{(i)}$, the rest are zero.
 
example, $ h = \begin{pmatrix}x_1^{(1)} & x_1^{(2)} & x_1^{(3)} & ... & x_1^{(t)}\\\ 
                 x_2^{(1)} & x_2^{(2)} & x_2^{(3)} & ... & x_2^{(t)}\\\ 
                 ... \\\ 
                 x_n^{(1)} & x_n^{(2)} & x_n^{(3)} & ... & x_n^{(t)}
\end{pmatrix} \begin{pmatrix}0\\\ 1\\\ 0 \\\ ... \\\ 0 \end{pmatrix}$ = $ 0 \begin{pmatrix}x_1^{(1)}\\\ x_2^{(1)}\\\ ... \\\ x_n^{(1)}\end{pmatrix} + 1 \begin{pmatrix}x_1^{(2)}\\\ x_2^{(2)}\\\ ... \\\ x_n^{(2)}\end{pmatrix} + 0 \begin{pmatrix}x_1^{(3)}\\\ x_2^{(3)}\\\ ... \\\ x_n^{(3)}\end{pmatrix} + ... + 0 \begin{pmatrix}x_1^{(t)}\\\ x_2^{(t)}\\\ ... \\\ x_n^{(t)}\end{pmatrix} $ = $\begin{pmatrix}
0 & x_1^{(2)} & 0 & ... & 0\\\ 
0 & x_2^{(2)} & 0 & ... & 0\\\ 
... \\\ 
0 & x_n^{(2)} & 0 & ... & 0
\end{pmatrix}$

<br/>
<br/>
 
  - *soft attention* 

$||a||_1 = 1$, the summation of the elements of $a$ is 1 &rarr; probability vector

example, $a = \begin{pmatrix} p_1\\\ p_2\\\ p_3\\\ ... \\\ p_t\end{pmatrix}$ => $\sum_{j=1}^{t} p_j = 1$

In this case, your hidden representation is a weighted combination of columns, it pays attention to all the examples $x^{(i)}$ but it puts more attention to ones to the others. 
 
 
<br/>
<br/>

##### Attention calculation

How `attention`, $a$, is calculated:

![attention_equation](res/transformers_attention_eq.png)

You can use:

   [argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html): it gets the index of the max value of the vector &rarr; one-hot vector &rarr; `hard attention`
   
   
   [softargmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html): it obtains the probability distribution &rarr; `soft attention`  

The input of the equation is $X^Tx$; 

With: $X^T$ € $\mathbb{R}^{txn}$; $x$ € $\mathbb{R}^{n}$ &rarr; $X^Tx$ € $\mathbb{R}^{t}$

$ X^Tx = 
\begin{pmatrix}x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & ... & x_n^{(1)}\\\ 
               x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & ... & x_n^{(3)}\\\ 
               ... \\\ 
               x_1^{(t)} & x_2^{(t)} & x_3^{(t)} & ... & x_n^{(t)}
\end{pmatrix}  
\begin{pmatrix}x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}\end{pmatrix}
= 
\begin{pmatrix} \alpha_1 \\\ \alpha_2\\\ ... \\\ \alpha_t\end{pmatrix}
$

> This multiplication, $X^Tx$, compute how aline is $x$ with the input dataset, or in other words, **how similar is $x$ with each element of the dataset $X$**.

#### Vectorization

We have a set of $x$'s: $\{x^{(i)}\}_{i=1}^{t} = \{x^{(1)}, x^{(2)}, ..., x^{(t)}\}$ , `t` elements

That implies you have a set of $a$'s: $\{x^{(i)}\}_{i=1}^{t}$  &rarr;  $\{a^{(i)}\}_{i=1}^{t}$

In other words, you can apply `attention` to each $x^{(i)}$ of the set, and get the corresponding $a^{(i)}$.

Generalization, $\{a^{(i)}\}_{i=1}^{t}$; $a$ € $\mathbb{R}^{t}$  &rarr; $A$ € $\mathbb{R}^{txt}$

$A = \begin{pmatrix}
          a_1^{(1)} & a_1^{(2)} ... & a_1^{(t)} \\\
          a_2^{(1)} & a_2^{(2)} ... & a_2^{(t)} \\\ 
          ... \\\
          a_t^{(1)} & a_t^{(2)} ... & a_t^{(t)} 
     \end{pmatrix} $
     
It has a dimension of $txt$ because you have $t$ examples and for example, each example feature is the measure of `attention` of this $x$ with respect to each of the $x$'s of the dataset.
 
<br/> 


Then, if we have a set of `attentions` $a$'s we can get a set of `hidden layers` $h$'s: $\{a^{(i)}\}_{i=1}^{t}$  &rarr;  $\{h^{(i)}\}_{i=1}^{t}$

Generalization, $\{h^{(i)}\}_{i=1}^{t}$ = $H$ € $\mathbb{R}^{nxt}$

Where, $H = XA$


$H = \begin{pmatrix}
        x_1^{(1)} & x_1^{(2)} & x_1^{(3)} & ... & x_1^{(t)}\\\ 
        x_2^{(1)} & x_2^{(2)} & x_2^{(3)} & ... & x_2^{(t)}\\\ 
        ... \\\ 
        x_n^{(1)} & x_n^{(2)} & x_n^{(3)} & ... & x_n^{(t)}
\end{pmatrix} \begin{pmatrix}
        a_1^{(1)} & a_1^{(2)} ... & a_1^{(t)} \\\
        a_2^{(1)} & a_2^{(2)} ... & a_2^{(t)} \\\ 
        ... \\\
        a_t^{(1)} & a_t^{(2)} ... & a_t^{(t)} 
\end{pmatrix} =
\begin{pmatrix}
        h_{(1,1)} & h_{(1,2)} & h_{(1,3)} & ... & h_{(1,t)}\\\ 
        h_{(2,1)} & h_{(2,2)} & h_{(2,3)} & ... & h_{(2,t)}\\\ 
        ... \\\ 
        h_{(n,1)} & h_{(n,2)} & h_{(n,3)} & ... & h_{(n,t)}
\end{pmatrix} 
$

<br/>
<br/>

### Key-value store

Paradigm for storing, retrieving and managing an associative array (dictionary / hash table). In detail, paradigm to recovery *values* ($v$'s) stored by *keys* ($k$'s) using a *query* ($q$'s). 

| k      | v        |
|--------|----------|
| key_1  | value_A  |
| key_2  | value_B  |
| key_3  | value_C  |

In this paradigm, the retrieving process is: $max(similarity(q,k))=value$

<br/>

For example, you have stored a list of videos (values) indexed by their titles (keys) and you search one using keywords (query).


| k      | v        |
|--------|----------|
| Understand ML  | http://youtube.com/video/understanding_ml  |
| Messi Best Goals   | http://youtube.com/video/messi_best_goals  |
| Practicum: Attention and the Transformer  | http://youtube.com/video/attention_transformer  |
| ...  | ...  |

Querying by : $q$=`transformers` 

We recibe the value `http://youtube.com/video/attention_transformer`


#### Queries, keys and values

We can use the input to calculate queries, keys and values if our goal is related on the transformation of the input set (ex., translate a text to other language).

$q = W_qx$ ; $k = W_kx$ ; $v = W_vx$

To compare $q$ and $k$ they have to have the same dimension: $q,k$ € $\mathbb{R}^{d}$

The value has its own dimensionality: $v$ € $\mathbb{R}^{p}$

In basis that,

 $W_q, W_k$ € $\mathbb{R}^{dxn}$
 
 $W_v$ € $\mathbb{R}^{pxn}$

In detail,

$ q = W_qx = 
\begin{pmatrix}
    w_{q(1,1)} & w_{q(1,2)} & w_{q(1,3)} & ... & w_{q(1,n)}\\\ 
    w_{q(2,1)} & w_{q(2,2)} & w_{q(2,3)} & ... & w_{q(2,n)}\\\
    ... \\\ 
    w_{q(d,1)} & w_{q(d,2)} & w_{q(d,3)} & ... & w_{q(d,n)}\\\
\end{pmatrix}  
\begin{pmatrix}
    x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}
\end{pmatrix}
= 
\begin{pmatrix} 
    q_1 \\\ q_2\\\ ... \\\ q_d 
\end{pmatrix}
$


$ k = W_kx = 
\begin{pmatrix}
    w_{k(1,1)} & w_{k(1,2)} & w_{k(1,3)} & ... & w_{k(1,n)}\\\ 
    w_{k(2,1)} & w_{k(2,2)} & w_{k(2,3)} & ... & w_{k(2,n)}\\\
    ... \\\ 
    w_{k(d,1)} & w_{k(d,2)} & w_{k(d,3)} & ... & w_{k(d,n)}\\\
\end{pmatrix}  
\begin{pmatrix}
    x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}
\end{pmatrix}
= 
\begin{pmatrix} 
    k_1 \\\ k_2\\\ ... \\\ k_d 
\end{pmatrix}
$


$ v = W_vx = 
\begin{pmatrix}
    w_{v(1,1)} & w_{v(1,2)} & w_{v(1,3)} & ... & w_{v(1,n)}\\\ 
    w_{v(2,1)} & w_{v(2,2)} & w_{v(2,3)} & ... & w_{v(2,n)}\\\
    ... \\\ 
    w_{v(p,1)} & w_{v(p,2)} & w_{v(p,3)} & ... & w_{v(p,n)}\\\
\end{pmatrix}  
\begin{pmatrix}
    x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}
\end{pmatrix}
= 
\begin{pmatrix} 
    v_1 \\\ v_2\\\ ... \\\ v_p 
\end{pmatrix}
$




#### Vectorization

We have a set of $x$'s: $\{x^{(i)}\}_{i=1}^{t} = \{x^{(1)}, x^{(2)}, ..., x^{(t)}\}$ , `t` elements

That implies you have sets of $q$'s, $k$'s and $v$'s: $\{x^{(i)}\}_{i=1}^{t}$  &rarr;  $\{q^{(i)}\}_{i=1}^{t}; \{k^{(i)}\}_{i=1}^{t} ; \{v^{(i)}\}_{i=1}^{t} $

In general,

$\{q^{(i)}\}_{i=1}^{t}; q$ € $\mathbb{R}^{d} $ &rarr; $Q$ € $\mathbb{R}^{dxt}$

$Q = \begin{pmatrix}
        q_1^{(1)} & q_1^{(2)} & q_1^{(3)} & ... & q_1^{(t)}\\\ 
        q_2^{(1)} & q_2^{(2)} & q_2^{(3)} & ... & q_2^{(t)}\\\ 
        ... \\\ 
        q_d^{(1)} & q_d^{(2)} & q_d^{(3)} & ... & q_d^{(t)}
\end{pmatrix} $

<br/>

$\{k^{(i)}\}_{i=1}^{t}; k$ € $\mathbb{R}^{d} $ &rarr; $K$ € $\mathbb{R}^{dxt}$

$K = \begin{pmatrix}
        k_1^{(1)} & k_1^{(2)} & k_1^{(3)} & ... & k_1^{(t)}\\\ 
        k_2^{(1)} & k_2^{(2)} & k_2^{(3)} & ... & k_2^{(t)}\\\ 
        ... \\\ 
        k_d^{(1)} & k_d^{(2)} & k_d^{(3)} & ... & k_d^{(t)}
\end{pmatrix} $

<br/>

$\{v^{(i)}\}_{i=1}^{t}; v$ € $\mathbb{R}^{p} $ &rarr; $V$ € $\mathbb{R}^{pxt}$

$V = \begin{pmatrix}
        v_1^{(1)} & v_1^{(2)} & v_1^{(3)} & ... & v_1^{(t)}\\\ 
        v_2^{(1)} & v_2^{(2)} & v_2^{(3)} & ... & v_2^{(t)}\\\ 
        ... \\\ 
        v_p^{(1)} & v_p^{(2)} & v_p^{(3)} & ... & v_p^{(t)}
\end{pmatrix} $

<br/>



### Cross-attention

#### Attention calculation

Now, if we apply this `key-value` paradigm to the `self-attention` process we have the equation: 

![cross-attention](res/transformers_cross_attention_eq.png)

As before, we can use:

 - [argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html): it gets the index of the max value of the vector &rarr; one-hot vector &rarr; from a logic perspective, it only retrieves one $value$
   
   
 - [softargmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html): it obtains the probability distribution &rarr; from a logic perspective, it retrieves weighted $values$

The input of the equation is $K^Tq$; 

With: $K^T$ € $\mathbb{R}^{txd}$; $q$ € $\mathbb{R}^{d}$ &rarr; $K^Tq$ € $\mathbb{R}^{t}$

$ K^Tq = 
\begin{pmatrix}
        k_1^{(1)} & k_2^{(1)} & k_3^{(1)} & ... & k_d^{(1)}\\\ 
        k_1^{(2)} & k_2^{(2)} & k_3^{(2)} & ... & k_d^{(2)}\\\ 
        ... \\\ 
        k_1^{(t)} & k_2^{(t)} & k_3^{(t)} & ... & k_d^{(t)}
\end{pmatrix} 
\begin{pmatrix} q_1\\\ q_2\\\ ... \\\ q_d\end{pmatrix}
= 
\begin{pmatrix} \alpha_1 \\\ \alpha_2\\\ ... \\\ \alpha_t\end{pmatrix}
$

> This multiplication, $K^Tq$, compute how aline is $q$ with each key, or in other words, **how similar is $q$ with each key of the $K$**.

In this case, if we apply `softargmax` we have the problem that the magnitud grows with the square root of the number of dimensions.
To avoid it, we should use:

$ \beta = 1/ \sqrt{d} $

#### Hidden layer

With this attention we can compute the hidden layer:

$h = Va$

In detail, $V$ € $\mathbb{R}^{pxt}$ ; $a$ € $\mathbb{R}^{t}$  &rarr;  $h$ € $\mathbb{R}^{p}$

$h = Va = \begin{pmatrix}
        v_1^{(1)} & v_1^{(2)} & v_1^{(3)} & ... & v_1^{(t)}\\\ 
        v_2^{(1)} & v_2^{(2)} & v_2^{(3)} & ... & v_2^{(t)}\\\ 
        ... \\\ 
        v_p^{(1)} & v_p^{(2)} & v_p^{(3)} & ... & v_p^{(t)}
\end{pmatrix} \begin{pmatrix}a_1\\\ a_2\\\ ... \\\ a_t\end{pmatrix} =
\begin{pmatrix}
    h_1 \\\ 
    h_2\\\ 
    ... \\\ 
    h_p
\end{pmatrix}
$



#### Vectorization

We have sets of $q$'s: $\{q^{(i)}\}_{i=1}^{t} = \{q^{(1)}, q^{(2)}, ..., q^{(t)}\}$ , `t` elements

That implies you have sets of $a$'s: $\{q^{(i)}\}_{i=1}^{t}$  &rarr;  $\{a^{(i)}\}_{i=1}^{t} $ &rarr; $A$ € $\mathbb{R}^{txt}$

$A = \begin{pmatrix}
          a_1^{(1)} & a_1^{(2)} ... & a_1^{(t)} \\\
          a_2^{(1)} & a_2^{(2)} ... & a_2^{(t)} \\\ 
          ... \\\
          a_t^{(1)} & a_t^{(2)} ... & a_t^{(t)} 
     \end{pmatrix} $

 
<br/> 


Then, if we have a set of `attentions` $a$'s we can get a set of `hidden layers` $h$'s: $\{a^{(i)}\}_{i=1}^{t}$  &rarr;  $\{h^{(i)}\}_{i=1}^{t}$

Generalization, $\{h^{(i)}\}_{i=1}^{t}$ = $H$ € $\mathbb{R}^{pxt}$

Where, $H = VA$


$H = \begin{pmatrix}
        v_1^{(1)} & v_1^{(2)} & v_1^{(3)} & ... & v_1^{(t)}\\\ 
        v_2^{(1)} & v_2^{(2)} & v_2^{(3)} & ... & v_2^{(t)}\\\ 
        ... \\\ 
        v_p^{(1)} & v_p^{(2)} & v_p^{(3)} & ... & v_p^{(t)}
\end{pmatrix} \begin{pmatrix}
        a_1^{(1)} & a_1^{(2)} ... & a_1^{(t)} \\\
        a_2^{(1)} & a_2^{(2)} ... & a_2^{(t)} \\\ 
        ... \\\
        a_t^{(1)} & a_t^{(2)} ... & a_t^{(t)} 
\end{pmatrix} =
\begin{pmatrix}
        h_{(1,1)} & h_{(1,2)} & h_{(1,3)} & ... & h_{(1,t)}\\\ 
        h_{(2,1)} & h_{(2,2)} & h_{(2,3)} & ... & h_{(2,t)}\\\ 
        ... \\\ 
        h_{(p,1)} & h_{(p,2)} & h_{(p,3)} & ... & h_{(p,t)}
\end{pmatrix} 
$

<br/>
<br/>

#### Implementation

To parallelize the computation we can stick all the matrices in only one matrix:

![cross-attetion-imp](res/transformers_cross_attention_implementation_eq.png)

This composed matrix is € $\mathbb{R}^{2d+p}$

$
\begin{pmatrix}
        q\\\ 
        k\\\ 
        v
\end{pmatrix} =
\begin{pmatrix}
    w_{q(1,1)} & w_{q(1,2)} & w_{q(1,3)} & ... & w_{q(1,n)}\\\ 
    w_{q(2,1)} & w_{q(2,2)} & w_{q(2,3)} & ... & w_{q(2,n)}\\\
    ... \\\ 
    w_{q(d,1)} & w_{q(d,2)} & w_{q(d,3)} & ... & w_{q(d,n)}\\\ 
    w_{k(1,1)} & w_{k(1,2)} & w_{k(1,3)} & ... & w_{k(1,n)}\\\ 
    w_{k(2,1)} & w_{k(2,2)} & w_{k(2,3)} & ... & w_{k(2,n)}\\\
    ... \\\ 
    w_{k(d,1)} & w_{k(d,2)} & w_{k(d,3)} & ... & w_{k(d,n)}\\\
    w_{v(1,1)} & w_{v(1,2)} & w_{v(1,3)} & ... & w_{v(1,n)}\\\ 
    w_{v(2,1)} & w_{v(2,2)} & w_{v(2,3)} & ... & w_{v(2,n)}\\\
    ... \\\ 
    w_{v(p,1)} & w_{v(p,2)} & w_{v(p,3)} & ... & w_{v(p,n)}\\\
\end{pmatrix}  
\begin{pmatrix}
    x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}
\end{pmatrix} =
\begin{pmatrix}
    q_1 \\
    q_2 \\
    ... \\
    q_d \\
    k_1 \\
    k_2 \\
    ... \\
    k_d \\
    v_1 \\
    v_2 \\
    ... \\
    v_p \\
\end{pmatrix}
$

In this way you can compute all the params in only one iteration.

* **Multi-head**

This implementation of cross-attention is considered `one` head. We can calculate at same time more than one head. That means, different `attention` values and `hidden representations` to the same `input`.

Usually, we consider `h` heads. The output vector is the concatenation of the `h` matrixes/vectors of $q$, $k$ and $v$:

![cross-attetion-heads-imp](./res/transformers_cross_attention_imp_heads_eq.png)

The final vector is € $\mathbb{R}^{h(2d+p)}$



$
\begin{pmatrix}
        q^{(1)}\\\ 
        ...\\\ 
        q^{(h)}\\\ 
        k^{(1)}\\\ 
        ...\\\ 
        k^{(h)}\\\ 
        v^{(1)}\\\ 
        ...\\\ 
        v^{(h)}
\end{pmatrix} =
\begin{pmatrix}
    w_{q(1,1)}^{(1)} & w_{q(1,2)}^{(1)} & w_{q(1,3)}^{(1)} & ... & w_{q(1,n)}^{(1)}\\\ 
    w_{q(2,1)}^{(1)} & w_{q(2,2)}^{(1)} & w_{q(2,3)}^{(1)} & ... & w_{q(2,n)}^{(1)}\\\
    ... \\\ 
    w_{q(d,1)}^{(1)} & w_{q(d,2)}^{(1)} & w_{q(d,3)}^{(1)} & ... & w_{q(d,n)}^{(1)}\\\ 
    ... ... ... \\\
    w_{q(1,1)}^{(h)} & w_{q(1,2)}^{(h)} & w_{q(1,3)}^{(h)} & ... & w_{q(1,n)}^{(h)}\\\ 
    w_{q(2,1)}^{(h)} & w_{q(2,2)}^{(h)} & w_{q(2,3)}^{(h)} & ... & w_{q(2,n)}^{(h)}\\\
    ... \\\ 
    w_{q(d,1)}^{(h)} & w_{q(d,2)}^{(h)} & w_{q(d,3)}^{(h)} & ... & w_{q(d,n)}^{(h)}\\\ 
    w_{k(1,1)}^{(1)} & w_{k(1,2)}^{(1)} & w_{k(1,3)}^{(1)} & ... & w_{k(1,n)}^{(1)}\\\ 
    w_{k(2,1)}^{(1)} & w_{k(2,2)}^{(1)} & w_{k(2,3)}^{(1)} & ... & w_{k(2,n)}^{(1)}\\\
    ... \\\ 
    w_{k(d,1)}^{(1)} & w_{k(d,2)}^{(1)} & w_{k(d,3)}^{(1)} & ... & w_{k(d,n)}^{(1)}\\\
    ... ... ... \\\
    w_{k(1,1)}^{(h)} & w_{k(1,2)}^{(h)} & w_{k(1,3)}^{(h)} & ... & w_{k(1,n)}^{(h)}\\\ 
    w_{k(2,1)}^{(h)} & w_{k(2,2)}^{(h)} & w_{k(2,3)}^{(h)} & ... & w_{k(2,n)}^{(h)}\\\
    ... \\\ 
    w_{k(d,1)}^{(h)} & w_{k(d,2)}^{(h)} & w_{k(d,3)}^{(h)} & ... & w_{k(d,n)}^{(h)}\\\
    w_{v(1,1)}^{(1)} & w_{v(1,2)}^{(1)} & w_{v(1,3)}^{(1)} & ... & w_{v(1,n)}^{(1)}\\\ 
    w_{v(2,1)}^{(1)} & w_{v(2,2)}^{(1)} & w_{v(2,3)}^{(1)} & ... & w_{v(2,n)}^{(1)}\\\
    ... \\\ 
    w_{v(p,1)}^{(1)} & w_{v(p,2)}^{(1)} & w_{v(p,3)}^{(1)} & ... & w_{v(p,n)}^{(1)}\\\
    ... ... ... \\\
    w_{v(1,1)}^{(h)} & w_{v(1,2)}^{(h)} & w_{v(1,3)}^{(h)} & ... & w_{v(1,n)}^{(h)}\\\ 
    w_{v(2,1)}^{(h)} & w_{v(2,2)}^{(h)} & w_{v(2,3)}^{(h)} & ... & w_{v(2,n)}^{(h)}\\\
    ... \\\ 
    w_{v(p,1)}^{(h)} & w_{v(p,2)}^{(h)} & w_{v(p,3)}^{(h)} & ... & w_{v(p,n)}^{(h)}
\end{pmatrix}  
\begin{pmatrix}
    x_1^{'}\\\ x_2^{'}\\\ ... \\\ x_n^{'}
\end{pmatrix} =
\begin{pmatrix}
    q_1^{(1)} \\
    q_2^{(1)} \\
    ... \\
    q_d^{(1)} \\
    ....\\
    q_1^{(h)} \\
    q_2^{(h)} \\
    ... \\
    q_d^{(h)} \\
    k_1^{(1)} \\
    k_2^{(1)} \\
    ... \\
    k_d^{(1)} \\
    ....\\
    k_1^{(h)} \\
    k_2^{(h)} \\
    ... \\
    k_d^{(h)} \\
    v_1^{(1)} \\
    v_2^{(1)} \\
    ... \\
    v_p^{(1)} \\
    ....\\
    v_1^{(h)} \\
    v_2^{(h)} \\
    ... \\
    v_p^{(h)} \\
\end{pmatrix}
$

This head-long vector can be used to calculate a a head-long hidden repressentation:

$ 
H^{heads} =
\begin{pmatrix}
    H^{(1)} \\
    H^{(2)} \\
    ... \\
    H^{(h)} \\
\end{pmatrix} = 
\begin{pmatrix}
    V^{(1)}\\\ 
    ...\\\ 
    V^{(h)}
\end{pmatrix} 
\begin{pmatrix}
    A^{(1)} = [soft](arg)max(K^{T (1)}q^{(1)}) \\\ 
    ...\\\ 
    A^{(h)}  = [soft](arg)max(K^{T (h)}q^{(h)}) \\\ 
\end{pmatrix} =
\begin{pmatrix}
    h_1^{(1)} \\
    h_2^{(1)} \\
    ... \\
    h_p^{(1)} \\
    ....\\
    h_1^{(h)} \\
    h_2^{(h)} \\
    ... \\
    h_p^{(h)} \\
\end{pmatrix} 
$
€ $\mathbb{R}^{hp}$

> Don't confuse the `h` symbol to represent a value in the hidden representation vector with the `h` used to represent the number of heads. 

This head-long hidden repressentation vector can be converted to the desired dimensionality with a linear multiplication.

$
W_h ($ € $\mathbb{R}^{pxhp})
\begin{pmatrix}
    h_1^{(1)} \\
    h_2^{(1)} \\
    ... \\
    h_p^{(1)} \\
    ....\\
    h_1^{(h)} \\
    h_2^{(h)} \\
    ... \\
    h_p^{(h)} \\
\end{pmatrix}  = H^{'} $  € $\mathbb{R}^{p}$


### Transformers

Encoder-decoder architecture. 

This is the same architecture of `auto-encoders`:

![auto-encoder](res/autoencoder.png)

In transformers we have something similar.

#### Transformer encoder

![transformer_encoder](res/transformers_encoder.png)

Where:
 - `self-attention` layer as we explained before.
 - `1-convolution` layer is called too `feed-forward` but at the end is a convolutional layer with kernel size 1. Is a linear layer apply to every element in the set.
 - `Add, norm` layer is composed by 2 component: 1) addition component and then 2) a layer normalization.
 
 ![transformers_add_norm](./res/transformers_add_norm.png)
 
 - Both the `self-attention` as the `1-convolutional` part have a residual connexion. 

In this encoder we insert the input set ($\{x^{(i)}\}_{i=1}^{t} = \{x^{(1)}, x^{(2)}, ..., x^{(t)}\}$) and at the end we get the hidden representation of the output of the encoder ($\{h^{Enc(i)}\}_{i=1}^{t} = \{h^{Enc(1)}, h^{Enc(2)}, ..., h^{Enc(t)}\}$)

#### Transformer decoder

Decoder is similar to the decoder but it includes a new part between the `self-attention` and the `1-convolution` part, the `cross-attention` component.

![transformer_decoder](res/transformers_decoder.png)

The input ($x$) of the `cross-attention` for the keys ($k$) and values ($v$) is the hidden representation of the encoder:

$k = W_kh^{Enc(i)}$ ; $v = W_vh^{Enc(i)}$

Meanwhile, for the query ($q$) the input is the output of the `self-attention` module:

$q = W_qx^{satt}$ 

In this transformer decoder, in a auto-regresion fashion, the outupt is going to be the input in the next step of the encoder.

$\{y^{(i)}\}_{i=0}^{t-1} = \{0, h_1^{Dec}, , h_2^{Dec}, , h_3^{Dec}, ..., , h_{t-1}^{Dec} \}$

In general, the encoder summarise the input in the hidden representation. Then the decoder queries what is require throw the $q$ from the set of representations of the encoder ($k$, $v$).

### FINAL NOTES

 - Transformes allows make all the computation in only one iteration (no sequential operations). In constrat, other models as RNN take the input set item by item, in consequence, it is a sequencial of operations. Transformer allows you parallelize the process: $H = VA$
 
 - In transformers, we have a problem with the size of the matrix $A$ € $\mathbb{R}^{(txt)}$. We have to limit `t` to avoid the explosion of weights in the model.
 
 - Recommended readings:
   - http://jalammar.github.io/illustrated-transformer/ (the matrix are in horizontal but the computation is the same).
   - https://distill.pub/2016/augmented-rnns/ (see the attention illustrations).
   - https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html (there are some error but it's a good explanation).

## Practicum

In this notebook we are going to create an transformer encoder and the we are going to train this encoder to classify texts. In particular, we want classify movie reviews.


In [1]:
import torch 
from torch import nn
import torch.nn.functional as f
import numpy as np 

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
nn_Softargmax = nn.Softmax  # fix wrong name

### Multi head attention

[Theory section](#Implementation)

When we consider `h` heads. The output vector is the concatenation of the `h` matrixes/vectors of $q$, $k$ and $v$:

![cross-attetion-heads-imp](res/transformers_cross_attention_imp_heads_eq.png)

In [3]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, d_model, num_heads, p, d_input=None):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        if d_input is None:
            d_xq = d_xk = d_xv = d_model
        else:
            d_xq, d_xk, d_xv = d_input
            
        # Make sure that the embedding dimension of model is a multiple of number of heads
        assert d_model % self.num_heads == 0

        self.d_k = d_model // self.num_heads
        
        # These are still of dimension d_model. They will be split into number of heads 
        # - This matrix allows us to rotate the current input (see section: #Queries,-keys-and-values)
        self.W_q = nn.Linear(d_xq, d_model, bias=False)  # q = W_q*x
        self.W_k = nn.Linear(d_xk, d_model, bias=False)  # k = W_k*x
        self.W_v = nn.Linear(d_xv, d_model, bias=False)  # v = W_v*x
        
        
        # Outputs of all sub-layers need to be of dimension d_model
        # -  (see section (in cross-attetion): #Implementation)
        self.W_h = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V):
        """
        Attention vectorization calculation: A = softargmax(K^T*Q) and
        Hidden vectorization representation: H = V*A 
        (see section: #Cross-attention)
        """
        
        batch_size = Q.size(0) 
        k_length = K.size(-2) 
        
        # Scaling by d_k so that the soft(arg)max doesnt saturate
        Q = Q / np.sqrt(self.d_k)                           # (bs, n_heads, q_length, dim_per_head)
        
        # -- Compute the K - Q aligment: K^T*Q
        scores = torch.matmul(Q, K.transpose(2,3))          # (bs, n_heads, q_length, k_length)
        
        # -- Compute the attention: softargmax(K-Q aligment)
        A = nn_Softargmax(dim=-1)(scores)   # (bs, n_heads, q_length, k_length)
        
        # -- Compute the hidden representation: H = V*A
        # Get the weighted average of the values
        H = torch.matmul(A, V)     # (bs, n_heads, q_length, dim_per_head)

        return H, A 

        
    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (heads X depth)
        Return after transpose to put in shape (batch_size X num_heads X seq_length X d_k)
        """
        return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

    def group_heads(self, x, batch_size):
        """
        Combine the heads again to get (batch_size X seq_length X (num_heads times d_k))
        """
        return x.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
    

    def forward(self, X_q, X_k, X_v):
        batch_size, seq_length, dim = X_q.size()

        # -- Gets query Q, keys K and values V from the inputs using W_q, W_k and W_v
        # After transforming, split into num_heads 
        Q = self.split_heads(self.W_q(X_q), batch_size)  # (bs, n_heads, q_length, dim_per_head) -> Q = W_q*X_q
        K = self.split_heads(self.W_k(X_k), batch_size)  # (bs, n_heads, k_length, dim_per_head) -> K = W_k*X_k
        V = self.split_heads(self.W_v(X_v), batch_size)  # (bs, n_heads, v_length, dim_per_head) -> V = W_v*X_v
        
        # -- Calculate the attention weights and hidden representations for each of the heads 
        H_cat, A = self.scaled_dot_product_attention(Q, K, V)
        
        # Put all the heads back together by concat
        H_cat = self.group_heads(H_cat, batch_size)    # (bs, q_length, dim)
        
        # -- Collapse the h-long hidden representation vector in the model dimension using W_h
        # -  (see section (in cross-attetion): #Implementation)
        # Final linear layer  
        H = self.W_h(H_cat)          # (bs, q_length, dim)
        
        return H, A

#### Some sanity checks:

Now we are going to do some tests to check if the attention process implemented in the previous class works correctly.

In [4]:
temp_mha = MultiHeadAttention(d_model=512, num_heads=8, p=0)
def print_out(Q, K, V):
    temp_out, temp_attn = temp_mha.scaled_dot_product_attention(Q, K, V)
    print('Attention weights are:', temp_attn.squeeze())
    print('Output is:', temp_out.squeeze())

To check our self attention works - if the query matches with one of the key values, it should have all the attention focused there, with the value returned as the value at that index

In [5]:
test_K = torch.tensor(
    [[10, 0, 0],
     [ 0,10, 0],
     [ 0, 0,10],
     [ 0, 0,10]]
).float()[None,None]

test_V = torch.tensor(
    [[   1,0,0],
     [  10,0,0],
     [ 100,5,0],
     [1000,6,0]]
).float()[None,None]

test_Q = torch.tensor(
    [[0, 10, 0]]
).float()[None,None]

# You can see the q is equal to the second key : [0 10 0]
#  that implies => the attention should go to the second position
#  that implies => the output should be the second value : [10 0 0]
# Of course, the output is not exactly the value (because we use soft attention) but it is very close

print_out(test_Q, test_K, test_V)

Attention weights are: tensor([3.7266e-06, 9.9999e-01, 3.7266e-06, 3.7266e-06])
Output is: tensor([1.0004e+01, 4.0993e-05, 0.0000e+00])


Great! We can see that it focuses on the second key and returns the second value. 

If we give a query that matches two keys exactly, it should return the averaged value of the two values for those two keys. 

In [6]:
test_Q = torch.tensor([[0, 0, 10]]).float()  

# You can see the q is equal to the third and fourth key : [0 0 10]
#  that implies => the attention should go to the third and fourth position
#  that implies => the output should be the average of both values : (1/2) ([100 5 0] + [1000 6 0]) = [550 5.5 0]

print_out(test_Q, test_K, test_V)

Attention weights are: tensor([1.8633e-06, 1.8633e-06, 5.0000e-01, 5.0000e-01])
Output is: tensor([549.9979,   5.5000,   0.0000])


We see that it focuses equally on the third and fourth key and returns the average of their values.

Now giving all the queries at the same time:

In [7]:
test_Q = torch.tensor(
    [[0, 0, 10], [0, 10, 0], [10, 10, 0]]
).float()[None,None]

# Vectorization of the q's
#   we obtain the output for 3 queries in only one iteration

print_out(test_Q, test_K, test_V)

Attention weights are: tensor([[1.8633e-06, 1.8633e-06, 5.0000e-01, 5.0000e-01],
        [3.7266e-06, 9.9999e-01, 3.7266e-06, 3.7266e-06],
        [5.0000e-01, 5.0000e-01, 1.8633e-06, 1.8633e-06]])
Output is: tensor([[5.5000e+02, 5.5000e+00, 0.0000e+00],
        [1.0004e+01, 4.0993e-05, 0.0000e+00],
        [5.5020e+00, 2.0497e-05, 0.0000e+00]])


### 1D convolution with `kernel_size = 1`

This is basically an [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron) with one hidden layer and ReLU activation applied to each and every element in the set.

In [8]:
class CNN(nn.Module):
    def __init__(self, d_model, hidden_dim, p):
        super().__init__()
        self.k1convL1 = nn.Linear(d_model,    hidden_dim)
        self.k1convL2 = nn.Linear(hidden_dim, d_model)
        self.activation = nn.ReLU()

    def forward(self, x):
        x = self.k1convL1(x)
        x = self.activation(x)
        x = self.k1convL2(x)
        return x

### Transformer encoder

Now we have all components for our Transformer Encoder block shown below!!!!


![transformer_encoder](res/transformers_encoder.png)

In [9]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, conv_hidden_dim, p=0.1):
        super().__init__()

        self.mha = MultiHeadAttention(d_model, num_heads, p) # Self-attetion = Cross-attetion with q=k=v=x
        self.cnn = CNN(d_model, conv_hidden_dim, p)

        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
    
    def forward(self, x):
        
        # Self attention 
        attn_output, _ = self.mha(x, x, x)  # (batch_size, input_seq_len, d_model)
        
        # Layer norm after adding the residual connection 
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)
        
        # Feed forward 
        cnn_output = self.cnn(out1)  # (batch_size, input_seq_len, d_model)
        
        #Second layer norm after adding residual connection 
        out2 = self.layernorm2(out1 + cnn_output)  # (batch_size, input_seq_len, d_model)

        return out2

#### Encoder 

##### Blocks of N Encoder Layers + Positional encoding + Input embedding

We are going to use the encoder to text classification. 

As input we have a set of inputs $\{x^{(i)}\}_{i=1}^t = \{x^{(1)},x^{(2)}, ... x^{(t)}\}$, where $x^{(i)}$ € $\mathbb{R}^{n}$

![input-repres](res/transformers_xinput_embbd.png)

The transofmer encoder and attention are permutation equivariant. In other words, they don't use the order of the inputs as feature. You can change the order of the input set and the result is gonna be the same.

In contrast, to classify sentences, the position $i$ of the input $x^{(i)}$ matters. We need include information about what position the input item takes. 

Self attention by itself does not have any recurrence or convolutions so to make it sensitive to position we must provide additional positional encodings. These are calculated as follows:

\begin{aligned}
E(p, 2i)    &= \sin(p / 10000^{2i / d}) \\
E(p, 2i+1) &= \cos(p / 10000^{2i / d})
\end{aligned}

@alfredo: the theorical concept is the important thing, how we include the positional encoding is only technicalities.


In [10]:
def create_sinusoidal_embeddings(nb_p, dim, E):
    theta = np.array([
        [p / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
        for p in range(nb_p)
    ])
    E[:, 0::2] = torch.FloatTensor(np.sin(theta[:, 0::2]))
    E[:, 1::2] = torch.FloatTensor(np.cos(theta[:, 1::2]))
    E.detach_()
    E.requires_grad = False
    E = E.to(device)

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab_size, max_position_embeddings, p):
        super().__init__()
        self.word_embeddings = nn.Embedding(vocab_size, d_model, padding_idx=1)
        self.position_embeddings = nn.Embedding(max_position_embeddings, d_model)
        create_sinusoidal_embeddings(
            nb_p=max_position_embeddings,
            dim=d_model,
            E=self.position_embeddings.weight
        )

        self.LayerNorm = nn.LayerNorm(d_model, eps=1e-12)

    def forward(self, input_ids):
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device) # (max_seq_length)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)                      # (bs, max_seq_length)
        
        # Get word embeddings for each input id
        word_embeddings = self.word_embeddings(input_ids)                   # (bs, max_seq_length, dim)
        
        # Get position embeddings for each position id 
        position_embeddings = self.position_embeddings(position_ids)        # (bs, max_seq_length, dim)
        
        # Add them both 
        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)
        
        # Layer norm 
        embeddings = self.LayerNorm(embeddings)             # (bs, max_seq_length, dim)
        return embeddings

#### Deep Neural Network

We can stick several transformer encoders, one after other, to have a Deep Neural network.

In [11]:
class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim, input_vocab_size,
               maximum_position_encoding, p=0.1):
        super().__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = Embeddings(d_model, input_vocab_size,maximum_position_encoding, p)

        # We can stick different encoder, one after other, to have a Deep Neural network.
        self.enc_layers = nn.ModuleList()
        for _ in range(num_layers):
            self.enc_layers.append(EncoderLayer(d_model, num_heads, ff_hidden_dim, p))
        
    def forward(self, x):
        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)

        # We feed forward one transformer encoder after other
        for i in range(self.num_layers):
            x = self.enc_layers[i](x)

        return x  # (batch_size, input_seq_len, d_model)

#### Train network to classify texts

In [12]:
import torchtext.data as data
import torchtext.datasets as datasets

* IMDB dataset

In [13]:
max_len = 200
text = data.Field(sequential=True, fix_length=max_len, batch_first=True, lower=True, dtype=torch.long)
label = data.LabelField(sequential=False, dtype=torch.long)
datasets.IMDB.download('./')
ds_train, ds_test = datasets.IMDB.splits(text, label, path='./imdb/aclImdb/')
print('train : ', len(ds_train))
print('test : ', len(ds_test))
print('train.fields :', ds_train.fields)



train :  25000
test :  25000
train.fields : {'text': <torchtext.data.field.Field object at 0x7f3d4ecaaf70>, 'label': <torchtext.data.field.LabelField object at 0x7f3d4ecaafd0>}


In [14]:
ds_train, ds_valid = ds_train.split(0.9)
print('train : ', len(ds_train))
print('valid : ', len(ds_valid))
print('test : ', len(ds_test))

train :  22500
valid :  2500
test :  25000


* Dataset Loaders

In [15]:
num_words = 50_000
text.build_vocab(ds_train, max_size=num_words)
label.build_vocab(ds_train)
vocab = text.vocab

In [16]:
batch_size = 164
train_loader, valid_loader, test_loader = data.BucketIterator.splits(
    (ds_train, ds_valid, ds_test), batch_size=batch_size, sort_key=lambda x: len(x.text), repeat=False)



* Composed network = List Encoders + Final layer to classification

In [17]:
class TransformerClassifier(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, conv_hidden_dim, input_vocab_size, num_answers):
        super().__init__()
        
        self.encoder = Encoder(num_layers, d_model, num_heads, conv_hidden_dim, input_vocab_size,
                         maximum_position_encoding=10000)
        self.dense = nn.Linear(d_model, num_answers)

    def forward(self, x):
        x = self.encoder(x)
        
        x, _ = torch.max(x, dim=1)
        x = self.dense(x)
        return x

In [18]:
model = TransformerClassifier(num_layers=1, d_model=32, num_heads=2, 
                         conv_hidden_dim=128, input_vocab_size=50002, num_answers=2)
model.to(device)

TransformerClassifier(
  (encoder): Encoder(
    (embedding): Embeddings(
      (word_embeddings): Embedding(50002, 32, padding_idx=1)
      (position_embeddings): Embedding(10000, 32)
      (LayerNorm): LayerNorm((32,), eps=1e-12, elementwise_affine=True)
    )
    (enc_layers): ModuleList(
      (0): EncoderLayer(
        (mha): MultiHeadAttention(
          (W_q): Linear(in_features=32, out_features=32, bias=False)
          (W_k): Linear(in_features=32, out_features=32, bias=False)
          (W_v): Linear(in_features=32, out_features=32, bias=False)
          (W_h): Linear(in_features=32, out_features=32, bias=True)
        )
        (cnn): CNN(
          (k1convL1): Linear(in_features=32, out_features=128, bias=True)
          (k1convL2): Linear(in_features=128, out_features=32, bias=True)
          (activation): ReLU()
        )
        (layernorm1): LayerNorm((32,), eps=1e-06, elementwise_affine=True)
        (layernorm2): LayerNorm((32,), eps=1e-06, elementwise_affine=True)
   

* Train


All the training processes have to include the following steps:

 1. Forward process: `output = model(data)`
 2. Get loss: `loss = criterion(output, target)`
 3. Clear gradient buffers: `optimizer.zero_grad()`
 4. Calculate gradient (the partial derivate of the loss with the respect the network paramenters): `loss.backward()`
 5. Perform training setp (step in the oppositional direction of the gradient): `optimizer.step()`

If some of these steps are missed the training is going to go wrong!

In [19]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
epochs = 10
t_total = len(train_loader) * epochs

In [20]:
def train(train_loader, valid_loader):
    
    for epoch in range(epochs):
        train_iterator, valid_iterator = iter(train_loader), iter(valid_loader)
        nb_batches_train = len(train_loader)
        train_acc = 0
        model.train()
        losses = 0.0

        for batch in train_iterator:
            x = batch.text.to(device)
            y = batch.label.to(device)
            
            # Perform the forward pass of the model
            out = model(x)  # ①

            # Get loss
            loss = f.cross_entropy(out, y)  # ②
            
            # Clear gradient buffers
            model.zero_grad()  # ③

            # Calculate gradient 
            loss.backward()  # ④
            losses += loss.item()

            # Perform training setp 
            optimizer.step()  # ⑤
                        
            train_acc += (out.argmax(1) == y).cpu().numpy().mean()
        
        print(f"Training loss at epoch {epoch} is {losses / nb_batches_train}")
        print(f"Training accuracy: {train_acc / nb_batches_train}")
        print('Evaluating on validation:')
        evaluate(valid_loader)

In [21]:
def evaluate(data_loader):
    data_iterator = iter(data_loader)
    nb_batches = len(data_loader)
    model.eval()
    acc = 0 
    for batch in data_iterator:
        x = batch.text.to(device)
        y = batch.label.to(device)
                
        out = model(x)
        acc += (out.argmax(1) == y).cpu().numpy().mean()

    print(f"Eval accuracy: {acc / nb_batches}")

Run training

In [22]:
train(train_loader, valid_loader)



Training loss at epoch 0 is 0.6701715005480725
Training accuracy: 0.5873265729939904
Evaluating on validation:
Eval accuracy: 0.6432545731707316
Training loss at epoch 1 is 0.5970086343046548
Training accuracy: 0.6884389360197948
Evaluating on validation:
Eval accuracy: 0.7049542682926829
Training loss at epoch 2 is 0.5117290296416351
Training accuracy: 0.7567161541180628
Evaluating on validation:
Eval accuracy: 0.7530868902439024
Training loss at epoch 3 is 0.42599487995755847
Training accuracy: 0.8094456963591377
Evaluating on validation:
Eval accuracy: 0.7768673780487806
Training loss at epoch 4 is 0.3614137342226678
Training accuracy: 0.8439322640509015
Evaluating on validation:
Eval accuracy: 0.7990853658536585
Training loss at epoch 5 is 0.3108235130059546
Training accuracy: 0.8714927978084124
Evaluating on validation:
Eval accuracy: 0.8073932926829268
Training loss at epoch 6 is 0.2587649969087131
Training accuracy: 0.8969158713326268
Evaluating on validation:
Eval accuracy: 0.8

Run evaluation

In [23]:
evaluate(test_loader)

Eval accuracy: 0.8041274775492854
