# MultiHead-Attention Mechanism (Graded)

Welcome to your MultiHead-Attention (required) programming assignment! You will build a **Sentiment Analysis** model using MultiHead-Attention. You will be using [Coronavirus tweets](https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs) dataset which contains 40k+ tweets from the twitter about the corona virus.

Your goal is to build a text classifier to classify the sentiment of each tweet if they are positive, neutral or negative.

**Instructions:**
* Do not modify any of the codes.
* Only write code when prompted. For example in some sections you will find the following,
```
# YOUR CODE GOES HERE
# YOUR CODE STARTS HERE
# TODO
```
Only modify those sections of the code.
* You will find **REFLECTION** under few code cells where you are asked to write your thoughts or interpretations on the outputs.


**You will learn to:**
* Explore the [Corona Virus Tweets](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/data?select=Corona_NLP_train.csv) dataset.
* Clean the dataset before using it for training.
* Build a robust text classification model using just MultiHead-Attention mechanism.
* Inference using trained model to make predictions.

# Corona Virus Tweet Classification using MultiHead-Attention Mechanism

Download the train/test csv files from [here](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/data?select=Corona_NLP_train.csv).

In [None]:
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import re
import nltk
from tests import *

# Download the train/test csv files and load them
df_train = #TODO: Load using Pandas
df_test = #TODO: Load using Pandas
df_train.head()

In [None]:
df_train.shape

In [None]:
# Train and test set
x_train = df_train['OriginalTweet']
y_train = df_train['Sentiment']

x_test = df_test['OriginalTweet']
y_test = df_test['Sentiment']

In [None]:
# TODO: plot a bar plot to check the count of Sentiment values


**Expected Output:**

<img src="assets/distribution.png" width=400>

# Data Preparation

The data is prepared by carrying out the following steps:

* **Data cleaning:** Lowercasing, removing punctuations, URLs, HTML tags, excluding special characters etc.
* **Tokenizing and Padding:** Padding input and target tensors to a uniform length.
* **Sentiment Mapping and OneHot Encoding:** Map the sentiment labels to numerical values using a dictionary.

## Data Cleaning

Example tweet from the dataset:
```
My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
```

We'll be performing some data cleaning steps to remove the following,
* **URLs**: This ensures that web addresses don't interfere with the sentiment analysis.

* **HTML Tags**: It removes any HTML tags that might be present in the tweets.

* **Digits**: Numerical digits usually don't carry much sentimental value.

* **Hashtags and Mentions**: Hashtags (e.g., #COVID19) and mentions (e.g., @WHO)

* **Stop Words**: common words like "the," "a," "is," etc.

In [None]:
# TODO

def text_cleaner(tweet):
    # TODO: remove urls

    # TODO: remove html tags

    # TODO: remove digits

    # TODO: remove hashtags

    # TODO: remove mentions

    #TODO: remove stop words

    return tweet

stop_words = #TODO: Load English stopwords

#TODO: Clean tweets using text_cleaner on x_train and x_test
x_train_clean =
x_test_clean =

x_train_clean.head()

## Tokenizing and Padding

Example:

```
Sentence:
Due COVID- retail store classroom Atlanta open walk-in business classes next two weeks, beginning Monday, March . We continue process online phone orders normal! Thank understanding!

After tokenizing :
[34, 1, 69, 4, 11239, 4874, 153, 665, 39, 104, 2637, 174, 172, 146, 812, 766, 186, 25, 267, 1487, 13, 802, 450, 326, 102, 2185]

After padding :
[   34     1    69     4 11239  4874   153   665    39   104  2637   174
   172   146   812   766   186    25   267  1487    13   802   450   326
   102  2185     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0]

```

In [None]:
len(x_train[0])

In [None]:
# TODO

# TODO: Import Tokenizer and pad_sequences

#TODO: Define tokenizer and fit on clean input text
tokenizer =


#TODO: Convert the clean text to sequences
x_train_seqs =
x_test_seqs =

# TODO: Pad the train and test sequences
x_train_seqs =
x_test_seqs =



In [None]:
x_test_seqs.shape

## Sentiment Mapping and One Hot Encoding

Build a sentiment map to create 3 classes:

* **Negative [0]** -> Extremely Negative, Negative
* **Neutral [1]**
* **Positive [2]** -> Extemely Positive, Positive


In [None]:
# TODO

# TODO: Map the sentiment labels to numerical values using a dictionary.
sentiments =

# TODO: Map the train and test target labels to above sentiments dictionary
y_train =
y_test =



In [None]:
# TODO: One hot encode target and test target

from tensorflow.keras.utils import to_categorical

y_train =
y_test =

print(y_train.shape)
print(y_test.shape)

**REFLECTION**

\<Why do you think we have to one hot encode our target values?>

In [None]:
validate_data_preparation(x_train_seqs, x_test_seqs, y_train, y_test)


# Model Training and Evaluation


In [None]:
# TODO

#TODO: Import the necessary layers from the TensorFlow

# Define the model
def multihead_attention_model(vocab_size, embedding_dim, num_heads, ff_dim):
    # TODO: Define an input layer

    # TODO: Define an Embedding layer

    # TODO: Define a Multi-head attention layer

    # TODO: Define a feed forward layer


    # TODO: Define a LayerNormalization layer

    # TODO: Define a Global Average Pooling layer to get a fixed-length representation

    # TODO: Define Output layer with 3 units corresponding to the number of classes

    model = Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
# TODO: Define model parameters
vocab_size =
embedding_dim =
num_heads =
ff_dim =

# TODO: Build the model
model =
model.summary()

**REFLECTION**

\<Write your thoughts about the model structure here>

In [None]:
# TODO

def main(model):

  # TODO: Compile the model with a categorical crossentropy loss


  # TODO: Train the model
  history =

  return model, history

if __name__ == "__main__":
    model, history = main(model)

**REFLECTION**

\<Write your observations here. Why is model behaving the way its behaving?>

In order to pass this test, **achieve atleast 85% test accuracy**.

In [None]:
from sklearn.metrics import classification_report

# Evaluate the model
loss, test_accuracy = model.evaluate(x_test_seqs, y_test)
print('Test accuracy:', accuracy)

# Get predictions for the test set
y_pred = model.predict(x_test_seqs)

# Convert predictions to class labels (0, 1, 2)
y_pred_labels = numpy.argmax(y_pred, axis=1)

# Convert true labels (y_test) to class labels
y_true_labels = numpy.argmax(y_test, axis=1)

# Calculate classification report
# You can also use other metrics like precision, recall, F1-score
print(classification_report(y_true_labels, y_pred_labels))

test_model_accuracy(test_accuracy)

**REFLECTION**

\<Write your observations here>

In [None]:
plot_metrics(history)

**REFLECTION**

\<Write your observations here>

# Improvement Strategies

Here are some model improvement strategies you can consider to improve the model:

1. **Embedding Dimension (embedding_dim)**: Try values like 64, 128, 256, or 512.
2. **Number of Heads (num_heads)**: Experiment with values like 4, 8, or 16.
3. **Feedforward Dimension (ff_dim)**: Try values like 128, 256, or 512.
4. **Add More Layers**: Consider adding more multi-head attention layers and feedforward layers to increase the model's capacity to learn complex patterns.
5. **Learned Positional Embeddings**: Add a trainable embedding layer to represent the position of each word.

# Inference

In [None]:
from sklearn.metrics import accuracy_score, classification_report

labels = ['Negative', 'Neutral', 'Positive']
sentence = "I dont know what to do!"

predicted_sentiment = run_inference(model, text_cleaner, tokenizer, sentence, labels)

print(f"Input Sentence: {sentence}")
print(f"Predicted Sentiment: {predicted_sentiment}")