# Deep Learning - Exercise 8

This lecture is focused on using the attention mechanism in deep learning models.

We recommend reading [this post](https://analyticsindiamag.com/a-beginners-guide-to-using-attention-layer-in-neural-networks/) for more detailed information.

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_08.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_08.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
!pip install keract
!pip install attention

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np 
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow import string as tf_string
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import LSTM, GRU, Bidirectional, Dense, Layer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import os

from keract import get_activations
from keras import Input, Model
from tensorflow.keras.callbacks import Callback
import keras.backend as K

os.environ['KERAS_ATTENTION_DEBUG'] = '1'
from attention import Attention

plt.rcParams.update({'font.size': 8})

tf.version.VERSION

In [None]:
# tf.config.set_visible_devices([], 'GPU')

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

# 📒 What is the Attention mechanism?

* When we think about the English word **Attention**, we know that it means **directing your focus at something** and taking greater notice
* The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data
    * 📌 Paying attention to important information is necessary and it can improve the performance of the model
* **Attention mechanism can help a neural network to memorize long sequences of the information**
    * 🔎 Remember the RNN and even LSTM long-context issues?
* 🔎 Can you imagine some use-cases where it can help us?

> 💡 In very simple terms the Attention mechanism makes sure that the forget mechanism of LSTM layers is not applied over the important pieces of information

### The process is usually computed in these few steps

![Img00](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_08_04.png?raw=true)

* Let’s say that we have an input with n sequences and output y with m sequence in a network
    * $x=[x_1, x_2, ..., x_n]$
    * $y = [y_1, y_2, ..., y_n]$
    
* The encoder which we are using in the network is a bidirectional LSTM network where it has a forward hidden state and a backward hidden state
    * Representation of the encoder state can be done by concatenation of these forward and backward states
    * $h_i = [h_i^{L2R}, h_i^{R2L}]$

* The hidden state is:
    * $s_t=f(s_{t-1}, y_{t-1}, c_t)$
    
* For the output word at position t, the context vector $C_t$ can be the sum of the hidden states of the input sequence
* Thus we have:

![Img02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_08_02.png?raw=true)

* Here we can see that the sum of the hidden state is weighted by the alignment scores
* 💡 We can say that ${\alpha_{t,i}}$  are the weights that are responsible for defining how much of each source’s hidden state should be taken into consideration for each output

* 💡 There can be various types of alignment scores according to their geometry
    * It can be either linear or in the curve geometry

### 📌 Below are some of the popular attention mechanisms:

![Img03](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_08_03.png?raw=true)

#### 💡 There are many variants of the mechanism in the wild but the basic computation process is the same

### The very common and easy to understand example is **Self-Attention Mechanism**
* When an attention mechanism is applied to the network so that it can relate to different positions of a single sequence and can compute the representation of the same sequence, it can be considered as self-attention

![Img01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_08_01.png?raw=true)

* Here in the image, the red color represents the word which is currently learning and the blue color is of the memory, and the intensity of the color represents the degree of memory activation

## ⚡ We will use Attention layer from the library first and try to solve the *Find-Max task*

### 💡 We need to define callback for vizualizing the attentions maps

In [None]:
class VisualizeAttentionMap(Callback):

    def __init__(self, model, x):
        super().__init__()
        self.model = model
        self.x = x

    def on_epoch_begin(self, epoch, logs=None):
        attention_map = get_activations(self.model, self.x, layer_names='attention_weight')['attention_weight']
        x = self.x[..., 0]
        plt.close()
        fig, axes = plt.subplots(nrows=3, figsize=(10, 8))
        maps = [attention_map, create_argmax_mask(attention_map), create_argmax_mask(x)]
        maps_names = ['attention layer (continuous)', 'attention layer - argmax (discrete)', 'ground truth (discrete)']
        for i, ax in enumerate(axes.flat):
            im = ax.imshow(maps[i], interpolation='none', cmap='jet')
            ax.set_ylabel(maps_names[i] + '\n#sample axis')
            ax.set_xlabel('sequence axis')
            ax.xaxis.set_ticks([])
            ax.yaxis.set_ticks([])
        cbar_ax = fig.add_axes([0.75, 0.15, 0.05, 0.7])
        fig.colorbar(im, cax=cbar_ax)
        fig.suptitle(f'Epoch {epoch} - training\nEach plot shows a 2-D matrix x-axis: sequence length * y-axis: '
                     f'batch/sample axis. \nThe first matrix contains the attention weights (softmax).'
                     f'\nWe manually apply argmax on the attention weights to see which time step ID has '
                     f'the strongest weight. \nFinally, the last matrix displays the ground truth. The task '
                     f'is solved when the second and third matrix match.')
        plt.draw()
        plt.pause(0.001)


def create_argmax_mask(x):
    mask = np.zeros_like(x)
    for i, m in enumerate(x.argmax(axis=1)):
        mask[i, m] = 1
    return mask

# We need to create the training examples first
* 📌 Goal of the task is to predict the maximum value of the input array

In [None]:
seq_length = 10
num_samples = 100000
# https://stats.stackexchange.com/questions/485784/which-distribution-has-its-maximum-uniformly-distributed
# Choose beta(1/N,1) to have max(X_1,...,X_n) ~ U(0, 1) => minimizes amount of knowledge.
# If all the max(s) are concentrated around 1, then it makes the task easy for the model.
x_data = np.random.beta(a=1 / seq_length, b=1, size=(num_samples, seq_length, 1))
y_data = np.max(x_data, axis=1)

In [None]:
fig = sns.histplot(x_data.flatten(), kde=False)
fig.set_ylim([0, 10000])
plt.title('Histogram of input values sampled from Beta distribution')

## The data looks like this

In [None]:
print(f'Input:\n {x_data[0]},\n Output:\n {y_data[0]}')

In [None]:
print(f'Input:\n {x_data[1]},\n Output:\n {y_data[1]}')

# ⚡ We will employ a simple LSTM-based model with attention layer stacked to it
* 🔎 What is the intuition behind using attention?

In [None]:
model_input = Input(shape=(seq_length, 1))
x = LSTM(128, return_sequences=True)(model_input)
x = Attention()(x)
x = Dense(1, activation='linear')(x)
model = Model(model_input, x)

model.compile(loss='mae')

In [None]:
model.summary()

# 🚀 Let's train the model
* 🔎 Take a look at the attention mask output - what is the ideal state?

In [None]:
max_epoch = 100
# visualize the attention on the first 12 samples.
visualize = VisualizeAttentionMap(model, x_data[0:12])
model.fit(x_data, y_data, epochs=max_epoch, validation_split=0.2, callbacks=[visualize])

## ⚡ Now as we know how the Attention layer works we can employ it for the sentiment analysis task
* We will use Yelp dataset which contains reviews of restaurants with either positive (1) or negative (0) labels assigned

## Download and load the dataset

In [None]:
path_to_file = tf.keras.utils.get_file('yelp_labelled.txt', 'https://raw.githubusercontent.com/rasvob/VSB-FEI-Deep-Learning-Exercises/main/datasets/yelp_labelled.txt')

In [None]:
path_to_file

In [None]:
with open(path_to_file) as f:
    lines = f.readlines()
    lines = [x.rstrip() for x in lines]

In [None]:
len(lines)

In [None]:
lines_dict = [{'Text': x[:-1].rstrip(), 'Label': int(x[-1])} for x in lines]

In [None]:
df = pd.DataFrame.from_dict(lines_dict)

In [None]:
df.head()

### ⚡ We will use TextVectorization layer as usuall and we will create baseline model without the Attention layer first

In [None]:
embedding_dim = 64 # Dimension of embedded representation - this is already part of latent space, there is captured some dependecy among words, we are learning this vectors in ANN
max_tokens = 3000
sequence_length = 32 # Output dimension after vectorizing - words in vectorited representation are independent

vect_layer = TextVectorization(max_tokens=max_tokens, output_mode='int', output_sequence_length=sequence_length)
vect_layer.adapt(df.Text.values)

In [None]:
vocab = vect_layer.get_vocabulary()

##  The dataset is balanced
* 💡 We will use `stratify` parameter of the `train_test_split` to make sure that it stays balanced

In [None]:
df.Label.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.Label, test_size=0.20, random_state=13, stratify=df.Label)

In [None]:
print(X_train.shape, X_test.shape)

In [None]:
print('Train')
print(y_train.value_counts())
print('Test')
print(y_test.value_counts())

## Let's define very simple model first

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(len(vocab), output_dim=embedding_dim, embeddings_regularizer=keras.regularizers.l2(.001))(x_v)
x = LSTM(50, dropout=0.3,recurrent_dropout=0.4)(emb)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.BinaryCrossentropy(from_logits=False), metrics=keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.tf',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
batch_size = 128
epochs = 10

history = model.fit(X_train.values, y_train.values, validation_split=0.2, callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

show_history(history)

In [None]:
model.evaluate(X_test.values, y_test.values)

# Now we will create our own Attention layer and add it to the model

![meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_08_meme_01.jpg?raw=true)



In [None]:
class MyAttention(Layer):
    def __init__(self,**kwargs):
        super(MyAttention,self).__init__(**kwargs)

    def build(self,input_shape):
        self.W=self.add_weight(name="att_weight",shape=(input_shape[-1],1),initializer="normal")
        self.b=self.add_weight(name="att_bias",shape=(input_shape[1],1),initializer="zeros")        
        super(MyAttention, self).build(input_shape)

    def call(self,x):
        # print('input', x.shape, "W:", self.W.shape, "b:", self.b.shape)
        dot = K.dot(x,self.W)+self.b
        # print('dot', dot.shape)
        th = K.tanh(dot)
        # print('th', th.shape)
        et=K.squeeze(th,axis=-1)
        # print('squeeze', et.shape)
        at=K.softmax(et)
        # print('softmax', at.shape)
        at=K.expand_dims(at,axis=-1)
        # print('expand_dims', at.shape)
        output=x*at
        # print('output', output.shape)
        res = K.sum(output,axis=1)
        # print('res', res.shape)
        return res

    def compute_output_shape(self,input_shape):
        return (input_shape[0],input_shape[-1])

    def get_config(self):
        return super(MyAttention,self).get_config()

## 💡 If we uncomment the `print` statements the output for our model will look like this:
* input (None, 32, 128) W: (128, 1) b: (32, 1)
* dot (None, 32, 1)
* th (None, 32, 1)
* squeeze (None, 32)
* softmax (None, 32)
* expand_dims (None, 32, 1)
* output (None, 32, 128)
* res (None, 128)

### 🔎 Why do we have `32` biases and `128` weights?

In [None]:
input_layer = keras.layers.Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = keras.layers.Embedding(len(vocab), output_dim=embedding_dim, embeddings_regularizer=keras.regularizers.l2(.001))(x_v)
x = LSTM(128, dropout=0.3,recurrent_dropout=0.2, return_sequences=True)(emb)
x = MyAttention()(x)
output_layer = keras.layers.Dense(1, 'sigmoid')(x)

model = keras.Model(input_layer, output_layer)
model.summary()

model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.BinaryCrossentropy(from_logits=False), metrics=keras.metrics.BinaryAccuracy())

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.tf',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
batch_size = 128
epochs = 10

history = model.fit(X_train.values, y_train.values, validation_split=0.2, callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

show_history(history)

In [None]:
model.evaluate(X_test.values, y_test.values)

## 🔎 Can you notice any difference in the model accuracy or training process progress?

# ✅  Tasks for the lecture (2p)

* `Attention` layer from the [library](https://github.com/philipperemy/keras-attention) has 2 `score` variants (1p)
    * Use the layer in your model and test both `score` variants
    * Is there any difference in the performance?

* It is possible to make LSTM/GRU layers `Bidirectional` using the [Bidirectional layer
](https://keras.io/api/layers/recurrent_layers/bidirectional/) (1p)
    * Use it in your model - what happened to the number of weights? Was there any improvement?