# Transformers

Paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)

Resources:
- [The Illustrated Transformer
](https://jalammar.github.io/illustrated-transformer/)
- [Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
- [TF Docs: Transformer model for language understanding
](https://www.tensorflow.org/text/tutorials/transformer)

#### Summary
- Network based only on attention, without recurrence or convolutions.
- Addresses long sequence dependence problem of recurrent models with single encoder hidden state.
- 

### Approach

- Until done:
    1. Read the paper
    2. Try to implement until stuck
    3. Check resources
    4. Go to 1.


In [1]:
import tensorflow as tf
import numpy as np
import tensorflow_datasets as tfds


# Attention model

### Notation
- $Q$ = Query
- $K$ = Key, dimension $d_k$ = 64
- $V$ = Value, dimension $d_v$ = 64
- $n$ = input sequence length
- $m$ = output sequence length
- $d_\text{model}$ = model dimension = 512
- $d_{ff}$ = inner layer dimension = 2048
- $h$ = heads



## Attention

### Basics
- Deals with problem of relationships between elements of a long sequence experienced by recurrent models.
- Pass all hidden states (all encoder output) to decoder.
- Decoder scores each of the encoder hidden states, applies a softmax to the score, then multiplies the score times the hidden state from the encoder.
- Rather than having a single hidden state from a recurrent sequence, the decoder has *all* the hidden states and a score (or attention) value for each of them to weight their importance.
- Self-attention means the network is learning to associate the relationships between input elements. 

### Multi-head attention
- Query, Key, Value (Q,K,V) for each of the encoder input vector created by the embedding.
- Attention value is the softmax of Query x Key scaled by the dimension $d_k$ then multiplied by Value

$$\text{Attention(Q,K,V)} = \text{softmax}_k(\frac{QK^T}{\sqrt{d_k}})V$$

- Multiple heads - number of attention layers running in parallel 

<img src="https://www.researchgate.net/publication/333078019/figure/fig1/AS:758304078839808@1557805189409/left-Scaled-Dot-Product-Attention-right-Multi-Head-Attention.png" height=300 />



## Attention

In [2]:
import tensorflow as tf
from tensorflow.keras import layers

# Attention
# sec 3.2.1
def scaled_dot_product_attention(q, k, v, mask):
    # get dimensions of the input, cast from tensor to float
    d_k = tf.cast(tf.shape(k)[-1], tf.float32)
    
    # compute queries x keys and scale by dimension
    scaled_attention = tf.matmul(q, k, transpose_b=True) / tf.sqrt(d_k)
    # print(f"scaled attention shape {scaled_attention.shape}")

    # apply decoder mask
    if mask:
        scaled_attention += (mask * 1e-9)

    # normalize all scores
    attention_weights = tf.nn.softmax(scaled_attention, axis=-1)
    # print(f"attention shape {attention_weights.shape}")

    # times value
    output = tf.matmul(attention_weights, v)
    # print(f"output shape {output.shape}")

    return output, attention_weights

Use the `scaled_dot_product_attention` layer to get a handle on how the model is selecting query-key pairs and computing their value.


In [3]:

def print_out(q, k, v):
    temp_out, temp_attn = scaled_dot_product_attention(q, k, v, None)
    print('Attention weights are:')
    print(np.round(temp_attn, decimals=2))
    print('Output is:')
    print(np.round(temp_out, decimals=2))

temp_k = tf.constant([[10, 0, 0],
                      [0, 10, 0],
                      [0, 0, 10],
                      [0, 0, 10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[1, 0],
                      [10, 0],
                      [100, 5],
                      [1000, 6]], dtype=tf.float32)  # (4, 2)


# The dot product attention is selecting the key that aligns with the 
# query and then returning the associated value.
temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

Attention weights are:
[[0. 1. 0. 0.]]
Output is:
[[10.  0.]]


## Multi-head attention

In [4]:
d_model = 512
d_ff = 2048
h = 8
d_k = 64
d_v = 64

# sec 3.2.2
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0 

        self.depth = self.d_model  // num_heads

        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)

        self.dense = layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """
        The inputs need to be reshaped in order to be fed into the attention portion.
        The model dimension d_k get split into heads x depth.
        Then transposed to (batch_size, num_heads, seq_len, depth)

        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    # forward computation
    def call(self, q, k, v, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q) # (batch_size, seq_len, d_model)
        k = self.wk(k)
        v = self.wv(v)
        print(q.shape)

        q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                    (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

mha = MultiHeadAttention(d_model, h)



# Encoder
# 

In [5]:
x = tf.random.normal((1,60, 512), dtype=tf.float32)

out, attn = mha(x,x,x, False)
out.shape, attn.shape


(1, 60, 512)


(TensorShape([1, 60, 512]), TensorShape([1, 8, 60, 60]))

# Data
- Used English-German and English-French dataset. 4.5 million and 36 million sentences respectively
- 37000 word English-German vocab
- 32000 word English-French vocab
- Instead let's use the TF dataset for Russian to English

In [6]:
# examples, metadata = tfds.load('ted_hrlr_translate/ru_to_en', with_info=True,
#                                as_supervised=True)
# train_examples, val_examples = examples['train'], examples['validation']

In [7]:
# print an examples
# for ru, en in train_examples.take(1):
#   print("Russian: ", ru.numpy().decode('utf-8'))
#   print("English:   ", en.numpy().decode('utf-8'))

### Tokenization
- There's no tokenizer for the `ted hrlr` Russian to English set that I can find so I have to make one.
- The Google docs example for attention said they used a sub-word version built with Bert.
- I'll just use the example here https://www.tensorflow.org/text/guide/subwords_tokenizer

In [8]:
!pip install -q -U tensorflow-text

[K     |████████████████████████████████| 4.3MB 15.3MB/s 
[?25h

In [9]:

import tensorflow_text as text
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

# train_en = train_examples.map(lambda pt, en: en)
# train_ru = train_examples.map(lambda ru, en: ru)

In [20]:
bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

# bert_vocab_args = dict(
#     # The target vocabulary size
#     vocab_size = 8000,
#     # Reserved tokens that must be included in the vocabulary
#     reserved_tokens=reserved_tokens,
#     # Arguments for `text.BertTokenizer`
#     bert_tokenizer_params=bert_tokenizer_params,
#     # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
#     learn_params={},
# )

In [21]:
%%time
# en_vocab = bert_vocab.bert_vocab_from_dataset(
#     train_en.batch(1000).prefetch(2),
#     **bert_vocab_args
# )

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.91 µs


In [22]:
# print(en_vocab[:10])
# print(en_vocab[100:110])
# print(en_vocab[1000:1010])
# print(en_vocab[-10:])

In [23]:
# %%time
# ru_vocab = bert_vocab.bert_vocab_from_dataset(
#     train_ru.batch(1000).prefetch(2),
#     **bert_vocab_args
# )

In [24]:
# print(ru_vocab[:10])
# print(ru_vocab[100:110])
# print(ru_vocab[1000:1010])
# print(ru_vocab[-10:])

In [25]:
def write_vocab_file(filepath, vocab):
  with open(filepath, 'w') as f:
    for token in vocab:
      print(token, file=f)

In [26]:
# write_vocab_file('en_vocab.txt', en_vocab)
# write_vocab_file('ru_vocab.txt', ru_vocab)


In [27]:
class CustomTokenizer(tf.Module):
  def __init__(self, reserved_tokens, vocab_path):
    self.tokenizer = text.BertTokenizer(vocab_path, lower_case=True)
    self._reserved_tokens = reserved_tokens
    self._vocab_path = tf.saved_model.Asset(vocab_path)

    vocab = pathlib.Path(vocab_path).read_text().splitlines()
    self.vocab = tf.Variable(vocab)

    ## Create the signatures for export:   

    # Include a tokenize signature for a batch of strings. 
    self.tokenize.get_concrete_function(
        tf.TensorSpec(shape=[None], dtype=tf.string))

    # Include `detokenize` and `lookup` signatures for:
    #   * `Tensors` with shapes [tokens] and [batch, tokens]
    #   * `RaggedTensors` with shape [batch, tokens]
    self.detokenize.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.detokenize.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    self.lookup.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.lookup.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    # These `get_*` methods take no arguments
    self.get_vocab_size.get_concrete_function()
    self.get_vocab_path.get_concrete_function()
    self.get_reserved_tokens.get_concrete_function()

  @tf.function
  def tokenize(self, strings):
    enc = self.tokenizer.tokenize(strings)
    # Merge the `word` and `word-piece` axes.
    enc = enc.merge_dims(-2,-1)
    enc = add_start_end(enc)
    return enc

  @tf.function
  def detokenize(self, tokenized):
    words = self.tokenizer.detokenize(tokenized)
    return cleanup_text(self._reserved_tokens, words)

  @tf.function
  def lookup(self, token_ids):
    return tf.gather(self.vocab, token_ids)

  @tf.function
  def get_vocab_size(self):
    return tf.shape(self.vocab)[0]

  @tf.function
  def get_vocab_path(self):
    return self._vocab_path

  @tf.function
  def get_reserved_tokens(self):
    return tf.constant(self._reserved_tokens)

In [1]:
!wget 

--2021-06-11 01:41:34--  https://drive.google.com/file/d/1CWpmbIjMvJPz5muaIs14lKM2i9rKdw0L/view?usp=sharing
Resolving drive.google.com (drive.google.com)... 172.253.123.101, 172.253.123.102, 172.253.123.138, ...
Connecting to drive.google.com (drive.google.com)|172.253.123.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘view?usp=sharing’

view?usp=sharing        [ <=>                ]  65.11K  --.-KB/s    in 0.003s  

2021-06-11 01:41:34 (20.3 MB/s) - ‘view?usp=sharing’ saved [66676]



In [28]:


tokenizers = tf.Module()
tokenizers.ru = CustomTokenizer(reserved_tokens, 'ru_vocab.txt')
tokenizers.en = CustomTokenizer(reserved_tokens, 'en_vocab.txt')

NotFoundError: ignored

In [None]:
model_name = 'ted_hrlr_translate_ru_en_converter'
tf.saved_model.save(tokenizers, model_name)

In [None]:
reloaded_tokenizers = tf.saved_model.load(model_name)
reloaded_tokenizers.en.get_vocab_size().numpy()

In [None]:
tokens = reloaded_tokenizers.en.tokenize(['Hello TensorFlow!'])
tokens.numpy()

## Embedding
- Learned embeddings convert tokens to vectors of dimension $d_\text{model}$

## Position encoding

# Model

## Full Encoder
- 6 layers
- 2 sub-layers per layer
    - Multi-head self attention
    - FC Feed-forward
    - Residual connections around each sub-layer



## Decoder
- 6 layers
- Same 2 layers as Encoder but with a third layer between
    - Middle layer performs multi-head attention on output from Encoder
    - Same residual connections

# Train

## Optimizer
- Adam $\beta_1$ = 0.9, $\beta_2$ = 0.98 and $\epsilon$ = $10^{-9}$