<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Attention_with_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Attention with Keras

This exercise follows:
https://keras.io/examples/nlp/text_classification_with_transformer/

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import tensorflow as tf
from tensorflow.keras.layers import Layer, Embedding, MultiHeadAttention, Dense, GlobalAveragePooling1D, Dropout, Dense, LayerNormalization, Input
from tensorflow.keras.models import Sequential, Model
import matplotlib.pyplot as plt
from tensorflow.keras.utils import plot_model

## Load the IMDB dataset and its word index

In [None]:
vocab_size = 20000
maxlen = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

word_index = imdb.get_word_index()
inverted_word_index = dict((i+3, word) for (word, i) in word_index.items())
inverted_word_index[0] = ''
inverted_word_index[1] = '<start>'
inverted_word_index[2] = '<oov>'

Let's check the data shape

In [None]:
X_train.shape

In [None]:
X_train[0]

Let's check a couple of sentences for consistency

In [None]:
" ".join(inverted_word_index[i] for i in X_train[0])

In [None]:
" ".join(inverted_word_index[i] for i in X_train[1])

## Token and Position Embedding

In [None]:
class TokenAndPositionEmbedding(Layer):
  def __init__(self, maxlen, vocab_size, embed_dim, **kwargs):
    super(TokenAndPositionEmbedding, self).__init__(**kwargs)
    self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
    self.pos_emb = Embedding(input_dim=maxlen, output_dim=embed_dim)

  def call(self, x):
    maxlen = tf.shape(x)[-1]
    positions = tf.range(start=0, limit=maxlen, delta=1)
    positions = self.pos_emb(positions)
    x = self.token_emb(x)
    return x + positions

Let's display a few sentences:

In [None]:
example_tpe = Sequential([TokenAndPositionEmbedding(maxlen, vocab_size, 32)])

In [None]:
n_reviews = 5

In [None]:
embedded_sentences = example_tpe(X_train[:n_reviews])

In [None]:
plt.figure(figsize=(10, 10))
for i in range(n_reviews):
  plt.subplot(n_reviews, 1, i+1)
  plt.imshow(embedded_sentences.numpy()[i].transpose())
  plt.xlabel("word in sentence -->")
  plt.ylabel("<-- embedding dim")
  plt.title(f"movie review {i}")

plt.tight_layout();

## Transformer Block

In [None]:
class TransformerBlock(Layer):
  def __init__(self, embed_dim, n_att_heads, n_dense_nodes, rate=0.1, **kwargs):
    super(TransformerBlock, self).__init__(**kwargs)
    self.att = MultiHeadAttention(num_heads=n_att_heads, key_dim=embed_dim)
    self.ffn = Sequential([
        Dense(n_dense_nodes, activation="relu"),
        Dense(embed_dim)]
    )
    self.layernorm2 = LayerNormalization(epsilon=1e-6)
    self.layernorm1 = LayerNormalization(epsilon=1e-6)
    self.dropout1 = Dropout(rate)
    self.dropout2 = Dropout(rate)

  def call(self, inputs, training):
    attn_output = self.att(inputs, inputs)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(inputs + attn_output)
    ffn_output = self.ffn(out1)
    ffn_output = self.dropout2(ffn_output, training=training)
    return self.layernorm2(out1 + ffn_output)

## Exercise 1:

Using either the Sequential or the Functional API in Keras build a transformer classification model with the following architecture:

```
    TokenAndPositionEmbedding(...
    TransformerBlock(...
    GlobalAveragePooling1D(...
    Dropout(...
    Dense(...
    Dropout(...
    Dense(2, activation="softmax")
````

Once the model is built, print out the summary.

You will need to decide a few hyperparameters including:

- Embedding size
- Number of attention heads
- Size of the dense hidden layer inside the transformer block
- Size of the other dense layers
- Dropout rate

## Exercise 2

Compile, train, and evaluate the model. Pay attention to the loss function. We defined the output layer as a `Dense(2, activation="softmax")` so you will need to choose the loss accordingly.