<a href="https://colab.research.google.com/github/stefanoiervese/DL_Project/blob/main/Transformer_con_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
import torch.nn as nn

La self attention si basa sulla creazione di tre vettori querie key e value
se prendiamo una frase in input composta da 4 parole andremo a creare i 3 vettori querie, key, value per ogni parola.

Possiamo moltiplicare querie e key^T per ottenere una matrice degli score.
Questa ci dice quanta attenzione dobbiamo mettere su ogni parola in relazione alle altre parole della frase.

Ogni elemento della matrice deve in seguito essere diviso per la radice della dimensione del vettore querie ( o key ) ottenendo la matrice scaled scores e poi utilizziamo una softmax per avere un valore di probabilità compreso tra 0 e 1, otteniamo così la matrice di attention.

La matrice di attenzione deve essere moltiplicata per il vettore value per ottenere così un vettore di output.

L'output passerà per un layer lineare per essere processato.

Se abbiamo un multihead con N head allora questo procedimento verrà fatto N volte in parallelo.

L'output finale sarà il modo in cui ogni parola è legata ad ogni altra parola.

Alla fine l'output del multihead attention layer viene sommato all'ingresso iniziale.

L'output di questa somma viene inoltre passato in ingresso ad un linear layer e l'uscita viene nuovamente sommata con l'ingresso come in figura.


![testo del link](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png)



In [None]:
class SelfAttention(nn.Module):
  def __init__(self, embed_size, heads):
      super(SelfAttention, self).__init__()
      self.embed_size = embed_size
      self.heads = heads
      self.head_dim = embed_size //heads

      assert (self.head_dim * heads == embed_size), "Embed size deve essere divisibile da heads"


      #la self attention si basa sulla creazione di tre vettori querie key e value
      #se prendiamo una frase in input composta da 4 parole andremo a creare i 3 vettori querie key value per ogni parola

      #nn.Linear(dimensione input, dimensione output)
      self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.fc_out = nn.Linear(heads*self.head_dim, embed_size)


  def forward(self, values, keys, query, mask):
    N = query.shape[0]
    value_len, key_len, query_len= values.shape[1], keys.shape[1], query.shape[1]

    #Split embedding into self.heads pieces

    #reshape rimodella il tensore di ingresso N nella dimensione espressa dai 3 parametri successivi value_len x self.heads x self.head_dim
    values= values.reshape(N, value_len, self.heads, self.head_dim)
    keys= keys.reshape(N, key_len, self.heads, self.head_dim)
    queries = query.reshape(N, query_len, self.heads, self.head_dim)

    values =self.values(values)
    keys =self.keys(keys)
    queries =self.queries(queries)

    #LA FRECCIA QUI POTREBBE DARE PROBLEMI
    #einsum calcola il prodotto, in questa eiga andiamo a calcolare la matrice di score  che chiameremo energy
    energy= torch.einsum("nqhd,nkhd→nhqk", [query, keys])
    #queries shape: (N, query_len, heads, heads_dim)
    #keys shape: (N, key_len, heads, heads_dim)
    #energy shape: (N, heads, query_len, key_len)

    #andiamo a realizzare la matrice "maschera" che ci
    #permetterà di prendere solo le parole successive a quella presa in ingresso e non quelle precedenti
    #se non te lo ricordi riguarda il video di spiegazione dei transformer
    if mask is not None:
        energy = energy.masked_fill(mask ==0, float("-1e20"))

    #l'attention è la formula Attention(Q,K,V)=softmax(Q*K^T/(d_k)^(1/2))V
    attention = torch.softmax(energy/ (self.embed_size **(1/2)), dim=3)

    #einsum calcola il prodotto
    out= torch.einsum("nhqk,nlhd→nqhd", [attention, values]).reshape(
        N, query_len, self.heds*self.head_dim
    )
    #attention shape: (N, heads, query_len_key_len)
    #values_shape: (N, value_len, heads, heads_dim)
    # (N, query_len, heads, head_dim)

    out= self.fc_out(out)
    return out

In [None]:
class TransformerBlock(nn.Module):
  def __init__(self, embed_size, heads, dropout, forward_expansion):
    super(TransformerBlock, self).__init__()
    self.attention = SelfAttention(embed_size, heads)
    self.norm1 = nn.LayerNorm(embed_size)
    self.norm2 = nn.LayerNorm(embed_size)

    self.feed_forward = nn.Sequential(
        nn.Linear(embed_size, forward_expansion*embed_size),
        nn.ReLU(),
        nn.Linear(forward_expansion*embed_size, embed_size)
    )
    self.dropout = nn.Dropout(dropout)

  def forward(self, value, key, query, mask):
    attention = self.attention(value, key, query, mask)

    x = self.dropout(self.norm1(attention + query))
    forward = self.feed_forward(x)
    out = self.dropout(self.norm2(forward + x))
    return out

In [None]:
class Encoder(nn.Module):
  def __init__(
      self,
      src_vocab_size,
      embed_size,
      num_layers,
      heads,
      device,
      forward_expansion,
      dropout,
      max_length,
  ):
    super(Encoder, self).__init__()
    self.embed_size = embed_size
    self.device = device
    self.word_embedding = nn.Embedding(src_vocab_size, embed_size)
    self.position_embedding = nn.Embedding(max_length, embed_size)

    self.layers = nn.ModuleList(
        [
            TransformerBlock(
                embed_size,
                heads,
                dropout = dropout,
                forward_expansion = forward_expansion
            )
        for _ in range(num_layers)]
    )
    self.dropout = nn.Dropout(dropout)

  def forward(self, x, mask):
    N, seq_length = x.shape
    positions = torch.arange(0 , seq_length).expand(N, seq_length).to(self.device)

    out = self.dropout(self.word_embedding(x) + self.position_embedding(positions))

    for layer in self.layers:
      out = layer(out, out, out , mask)

    return out



In [None]:
class DecoderBlock(nn.Module):
  def __init__(self, embed_size, heads, forward_expansion, dropout, device):
    super(DecoderBlock, self).__init__()
    self.attention = SelfAttention(embed_size, heads)
    self.norm = nn.LayerNorm(embed_size)
    self.transformer_block = TransformerBlock(
        embed_size, heads, dropout, forward_expansion
    )
    self.dropout = nn.Dropout(dropout)

  def forward(self, x, value, key, src_mask, trg_mask):
     attention = self.attention(x, x, x, trg_mask)
     query = self.dropout(self.norm(attention + x))
     out = self.transformer_block(value, key, query, src_mask)
     return out


In [None]:
class Decoder(nn.Module):
  def __init__(self,
               trg_vocab_size,
               embed_size,
               num_layers,
               heads,
               forward_expansion,
               dropout,
               device,
               max_length
               ):
    super(Decoder, self).__init__()
    self.device = device
    self.word_embedding = nn.Embedding(trg_vocab_size, embed_size)
    self.position_embedding = nn.Embedding(max_length, embed_size)

    self.layer = nn.ModuleList(
        [DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
        for _ in range(num_layers)]

    )

    self.fc_out = nn.Linear(embed_size, trg_vocab_size)
    self.dropout = nn.Dropout(dropout)

  def froward(self, x, enc_out, src_mask, trg_mask):
    N, seq_length = x.values_shape
    positions = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
    x = self.dropout((self.word_embedding(x)+ self.position_embedding(positions)))

    for layer in self.layers:
      x= layer(x, enc_out, enc_out, src_mask, trg_mask)

    out = self.fc_out(x)
    return out



In [None]:
class Transformer(nn.Module):
  def __init__(
      self,
      src_vocab_size,
      trg_vocab_size,
      src_pad_idx,
      trg_pad_idx,
      embed_size=256,
      num_layers=6,
      forward_expansion=4,
      heads=8,
      dropout=0,
      device="cuda",
      max_length=100
  ):
    super(Transformer, self).__init__()

    self.encoder =Encoder(
        src_vocab_size,
        embed_size,
        num_layers,
        heads,
        device,
        forward_expansion,
        dropout,
        max_length
    )

    self.decoder= Decoder(
        trg_vocab_size,
        embed_size,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        device,
        max_length
    )

    self.src_pad_idx = src_pad_idx
    self.trg_pad_idx = trg_pad_idx
    self.device = device

  def make_src_mask(self, src):
    src_mask = (src!= self.src_pad_idx).unsqueeze(1).unsqueeze(2)
    return src_mask.to(self.device)

  def make_trg_mask(self, trg):
    N, trg_len = trg.shape
    trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
        N, 1, trg_len, trg_len
    )
    return trg_mask.to(self.device)

  def forward(self, src, trg):
    src_mask = self.make_src_mask(src)
    trg_mask = self.make_trg_mask(trg)
    enc_src = self.encoder(src, src_mask)
    out = self.decoder(trg, enc_src, src_mask, trg_mask)
    return out




In [1]:
! pip install mathematics_dataset

Collecting mathematics_dataset
  Downloading mathematics_dataset-1.0.1-py3-none-any.whl (93 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/93.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m92.2/93.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.9/93.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mathematics_dataset
Successfully installed mathematics_dataset-1.0.1


In [2]:
! python -m mathematics_dataset.generate --filter=linear_1d

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/mathematics_dataset/generate.py", line 29, in <module>
    from mathematics_dataset.modules import modules
  File "/usr/local/lib/python3.10/dist-packages/mathematics_dataset/modules/modules.py", line 21, in <module>
    from mathematics_dataset.modules import algebra
  File "/usr/local/lib/python3.10/dist-packages/mathematics_dataset/modules/algebra.py", line 25, in <module>
    from mathematics_dataset import example
  File "/usr/local/lib/python3.10/dist-packages/mathematics_dataset/example.py", line 23, in <module>
    from mathematics_dataset.util import composition
  File "/usr/local/lib/python3.10/dist-packages/mathematics_dataset/util/composition.py", line 28, in <module>
    from mathematics