## Introduction
Semantic textual similarity deals with determining how similar a pair of text documents are. The goal of the first task is to implement a new architecture by combining the ideas from papers
- Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al (will be referred as the AAAI paper)
- A Structured Self-Attentive Sentence Embedding, Zhouhan Lin et. al (will be referred as the ICLR paper) <br/><br/>
Furthermore, you'd be evaluating whether the new architecture improves the results of **Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al**. Your overall network architecture should look similar to the following figure. 
![Untitled%20Diagram.drawio%20%281%29.png](https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/download.png)
<br/><br/>


Moreover, you'd be required to implement further helper functions that these papers propose i.e., attention penalty term for loss, etc.

### SICK dataset
We will use SICK dataset throughout the project (at least in the first two tasks). To get more information about the dataset you can refer to the original [paper](http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf) on the dataset. You can download the dataset using one of the following links:
- [dataset page 1](https://marcobaroni.org/composes/sick.html)
- [dataset page 2](https://huggingface.co/datasets/sick)    

The relevant columns for the project are `sentence_A`, `sentence_B`, `relatedness_score`, where `relatedness_score` is the label. <br><br>
**Hint: For each task make sure to decide whether the label should be normalized or not.**<br><br>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import torch
import test
import sts_data
from importlib import reload

## Part 1. Data pipeline (3 points)
Before starting working on the model, we must configure the data pipeline to load the data in the correct format. Please, implement the functions for processing the data.

### Part 1.1 Loading and preprocessing the data (1 point)
Download the SICK dataset and store it in [pandas](https://pandas.pydata.org/docs/index.html) `Dataframe`'s. You should use the official data split.  

Implement `load_data` method of `STSData` class in `sts_data.py`. The method must download the dataset and perform basic preprocessing. Minimal preprocessing required:  
1. normalize text to lower case
2. remove punctuations  
3. remove [stopwords](https://en.wikipedia.org/wiki/Stop_word) - we provided you with the list of English stopwords.
4. Optionally, any other preprocessing that you deem necessary.

All the preprocessing code must be contained in the `preprocessing.py` file.  
You can use Hugginface's [datasets library](https://huggingface.co/docs/datasets/) for easy dataset download.

### Part 1.2 Building vocabulary (1 point)
Before we can feed our text to the model it must be vectorized. We use 300 dimensional pretrained [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) for mapping words to vectors. To know more general information about embeddings you can refer to [this video](https://www.youtube.com/watch?v=ERibwqs9p38) (even though, we use different types of embeddings - FastText vs Word2Vec described in the video - the general purpose of them is the same).  
In order to apply the embedding, we must first construct the vocabulary for data. Complete the `create_vocab` method of `STSData` class in `sts_data.py` where you concatenate each sentence pair, tokenize it and construct the vocabulary for the whole training data. You should use [torchtext](https://torchtext.readthedocs.io/en/latest/data.html
) for processing the data. For tokenization, you can use any library (or write your own tokenizer), but we recommend you to use tokenizer by [spacy](https://spacy.io/). Use the `fasttext.simple.300d` as pretrained vectors.  
In the end, you must have a vocabulary object capable of mapping your input to corresponding vectors. Remember that the vocabulary is created using only training data (not touching validation or test data).

### Part 1.3 Creating DataLoader (1 point)
Implement `get_data_loader` method of `STSData` class in `sts_data.py`. It must perform the following operations on each of the data splits:
1. vectorize each pair of the sentences by replacing all tokens with their index in vocabulary
2. normalize labels
3. convert everything to PyTorch tensors
4. pad every sentence so that all of them have the same length
5. create `STSDataset` from `dataset.py`
6. create PyTorch DataLoader out of the created dataset. 


We have provided you with the interfaces of possible helper functions, but you can change them as you need.   
In the end, you must have 3 data loaders for each of the splits.

In [None]:
reload(sts_data)
from sts_data import STSData

columns_mapping = {
        "sent1": "sentence_A",
        "sent2": "sentence_B",
        "label": "relatedness_score",
    }
dataset_name = "sick"
sick_data = STSData(
    dataset_name=dataset_name,
    columns_mapping=columns_mapping,
    normalize_labels=True,
    normalization_const=5.0,
)
batch_size = 64
sick_dataloaders = sick_data.get_data_loader(batch_size=batch_size)

INFO:root:loading and preprocessing data...
INFO:root:reading and preprocessing data completed...
INFO:root:creating vocabulary...
INFO:torchtext.vocab:Loading vectors from .vector_cache/wiki.simple.vec.pt
INFO:root:creating vocabulary completed...
INFO:root:creating STSDataset completed...
INFO:root:creating dataloaders completed...


## Part 2. Model Configuration & Hyperparameter Tuning (3 points)
In this part, you are required to define a model capable of learning self-attentive sentence embeddings described in [this ICLR paper](https://arxiv.org/pdf/1703.03130.pdf). The sentence embedding learned by this model will be used for computing the similarity score instead of the simpler embeddings described in the original AAAI paper.  
Please familiarize yourself with the model described in the ICLR paper and implement `SiameseBiLSTMAttention` and `SelfAttention` classes in `siamese_lstm_attention.py`. Remember that you must run the model on each sentence in the sentence pair to calculate the similarity between them. You can use `similarity_score` from `utils.py` to compute the similarity score between two sentences. 
  
To get more theoretical information about attention mechanisms you can refer to [this chapter](https://web.stanford.edu/~jurafsky/slp3/10.pdf) of ["Speech and Language Processing" book](https://web.stanford.edu/~jurafsky/slp3/) by Dan Jurafsky and James H. Martin, where the attention mechanism is described in the context of the machine translation task. 

Finally, once your implementation works on the default parameters stated below, make sure to perform **hyperparameter tuning** to find the best combination of hyperparameters.

In [None]:
output_size = 1
hidden_size = 128
vocab_size = len(sick_data.vocab)
embedding_size = 300
embedding_weights = sick_data.vocab.vectors
lstm_layers = 4
learning_rate = 1e-1
fc_hidden_size = 64
max_epochs = 20
bidirectional = True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## self attention config
self_attention_config = {
    "hidden_size": 150,  ## refers to variable 'da' in the ICLR paper
    "output_size": 20,  ## refers to variable 'r' in the ICLR paper
    "penalty": 0.0,  ## refers to penalty coefficient term in the ICLR paper
}

In [None]:
from siamese_lstm_attention import SiameseBiLSTMAttention
from train import train_model
from test import evaluate_test_set
from tuning import tune_model
import optuna

In [None]:
## Hyperparameter Tuning

# We use the Optuna framework here (reference below). It is a comprehensive platform-independent framework
# with inbuilt sampling and pruning mechanisms. We found it helpful to use Optuna along with PyTorch.

results = tune_model(sick_data, sick_dataloaders)
for key, value in results.best_params.items():
        print("{}: {}".format(key, value))

[32m[I 2022-03-11 12:57:07,507][0m A new study created in memory with name: no-name-89857217-9050-4cc6-a9a3-3db05824ecf3[0m
100%|██████████| 20/20 [11:44<00:00, 35.24s/it]
[32m[I 2022-03-11 13:08:52,318][0m Trial 0 finished with value: 0.001 and parameters: {'hidden_size': 128, 'lstm_layers': 12, 'learning_rate': 1.0, 'fc_hidden_size': 112, 'da': 50, 'r': 20, 'penalty': 0.4}. Best is trial 0 with value: 0.001.[0m
100%|██████████| 20/20 [02:48<00:00,  8.41s/it]
[32m[I 2022-03-11 13:11:40,448][0m Trial 1 finished with value: 0.45686687703647466 and parameters: {'hidden_size': 64, 'lstm_layers': 4, 'learning_rate': 1e-05, 'fc_hidden_size': 64, 'da': 250, 'r': 10, 'penalty': 0.4}. Best is trial 1 with value: 0.45686687703647466.[0m
100%|██████████| 20/20 [07:38<00:00, 22.92s/it]
[32m[I 2022-03-11 13:19:18,902][0m Trial 2 finished with value: 0.40026872829181787 and parameters: {'hidden_size': 128, 'lstm_layers': 8, 'learning_rate': 0.001, 'fc_hidden_size': 96, 'da': 300, 'r': 10

In [None]:
# Plotting hyperparameter importances
optuna.visualization.plot_param_importances(results)

In [None]:
# Plotting parameter results for self-attention
optuna.visualization.plot_parallel_coordinate(results, params=["da", "r","penalty"])

In [None]:
best_params = results.best_params
output_size = 1
hidden_size = best_params['hidden_size']
vocab_size = len(sick_data.vocab)
embedding_size = 300
embedding_weights = sick_data.vocab.vectors
lstm_layers = best_params['lstm_layers']
learning_rate = best_params['learning_rate']
fc_hidden_size = best_params['fc_hidden_size']
max_epochs = 20
bidirectional = False
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## self attention config
self_attention_config = {
    "hidden_size": best_params['da'],  ## refers to variable 'da' in the ICLR paper
    "output_size": best_params['r'],  ## refers to variable 'r' in the ICLR paper
    "penalty": best_params['penalty'],  ## refers to penalty coefficient term in the ICLR paper
}

In [None]:
## init siamese lstm
siamese_lstm_attention = SiameseBiLSTMAttention(
    batch_size=batch_size,
    output_size=output_size,
    hidden_size=hidden_size,
    vocab_size=vocab_size,
    embedding_size=embedding_size,
    embedding_weights=embedding_weights,
    lstm_layers=lstm_layers,
    self_attention_config=self_attention_config,
    fc_hidden_size=fc_hidden_size,
    device=device,
    bidirectional=bidirectional,
)
## move model to device
siamese_lstm_attention.to(device)
optimizer = torch.optim.Adam(params=siamese_lstm_attention.parameters())

## Part 3. Training (2 points)  
Perform the final training of the model by implementing functions in `train.py` after setting values of your best-chosen hyperparameters. Note you can use the same training function when performing hyperparameter tuning.
- **What is a good choice of performance metric here for evaluating your model?** [Max 2-3 lines]
- **What other performance evaluation metric can we use here for this task? Motivate your answer.**[Max 2-3 lines]

In [None]:
tot_val_acc = train_model(
    model=siamese_lstm_attention,
    optimizer=optimizer,
    dataloader=sick_dataloaders,
    data=sick_data,
    max_epochs=max_epochs,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)

INFO:root:Starting training...
  0%|          | 0/20 [00:00<?, ?it/s]INFO:root:Epoch 0:
INFO:root:Accuracy: 0.015615549750802957 Training Loss: 9.211551666259766
INFO:root:Evaluating accuracy on dev set
INFO:root:Train loss: 9.211551666259766 - acc: 0.015615549750802957 -- Validation loss: 0.5284193158149719 - acc: 0.01879272700172839
  5%|▌         | 1/20 [00:03<01:12,  3.81s/it]INFO:root:Epoch 1:
INFO:root:Accuracy: 0.05644307565969653 Training Loss: 4.692107677459717
INFO:root:Evaluating accuracy on dev set
INFO:root:Train loss: 4.692107677459717 - acc: 0.05644307565969653 -- Validation loss: 0.5687175393104553 - acc: 0.012255406359314436
 10%|█         | 2/20 [00:07<01:08,  3.80s/it]INFO:root:Epoch 2:
INFO:root:Accuracy: 0.0734773561410553 Training Loss: 4.637417316436768
INFO:root:Evaluating accuracy on dev set
INFO:root:Train loss: 4.637417316436768 - acc: 0.0734773561410553 -- Validation loss: 0.5248135924339294 - acc: 0.029531897582795247
 15%|█▌        | 3/20 [00:11<01:05,  3.

__Comments__
1. We use the *Pearson correlation coefficient* as the evaluation metric for this task. In statistics, it is defined to represent the extent to which two variables are linearly dependent on each other. It's simplicity is useful to easily calculate the correlation between two vectors and so it is widely used in the sub-field of semantic similarity and textual entailment to find similarity between word or sentence embeddings of sentence pairs. Therefore it has been chosen as the primary metric in the SemEval task in the Mueller-Thygarajan AAAI paper and is also our preferred choice.

2. Another performance metric that can be used is also from the SemEval tasks, which is the *Spearman's rank correlation*. It differs from Pearson correlation in the fact that it focuses on the monotonicity of the relationship of variables and used a rank based scheme to calculate correlation, whereas Pearson contains linearity assumptions. It is therefore also considered alongside Pearson's as a secondary evaluation metric.

## Part 4. Evaluation and Analysis (2 points)  
Implement function evaluate_test_set to calculate the final accuracy of the performance evaluation metric on the test data.  
Compare the result with the original AAAI paper. Сomment on effect of penalty loss on model capacity. Did the inclusion of the self-attention block improve the results? If yes, then how? Can you think of additional techniques to improve the results? Briefly answer these questions in the markdown cells.

In [None]:
siamese_lstm_attention.load_state_dict(torch.load('siamese_lstm_attention.pth'))
siamese_lstm_attention.eval()
evaluate_test_set(
    model=siamese_lstm_attention,
    data_loader=sick_dataloaders,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)

INFO:root:Evaluating accuracy on test set
Accuracy: 0.5606613688378107 Test Loss: 3.543844699859619


The performance of the similarity task here results in Pearson coefficient of 0.56 compared to the MaLSTM result from the AAAI paper of 0.8345. <br> The penalty loss helps us in providing meaningful predictions for the sentence embeddings. In our implementation of a model without penalty and the default parameters, the predictions were found to have similar values for each sentence. Introducing the penalty helped us overcome this redundancy. <br>
We were unable to make any distinct observations on whether the model performs better in isolation without self-attention. However, on setting the attention parameters $r$ to 1 (becomes normal vector form) and penalty to zero, the model performance reduces to a produce results close to 0.48, giving us an inidcation that the self-attention probably works to the benefit for us <br> To improve the model performance, we might consider replacing the LSTM block with an unidirectional GRU. This is based on the results from another paper on SICK dataset for similarity task, which produces better results than the MaLSTM model. [1] 

[1] https://aclanthology.org/R19-1116.pdf

# References:

Text Processing: <br>
[1] https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing <br>
Batching of embeddings <br>
[2] https://github.com/ngarneau/understanding-pytorch-batching-lstm/blob/master/Understanding%20Pytorch%20Batching.ipynb <br>
Training <br>
[3] https://pytorch.org/tutorials/beginner/basics/data_tutorial.html <br>
Implementation <br>
[4] https://www.kaggle.com/ashishlepcha/semantic-text-similarity <br>
[5] https://github.com/simonjisu/nsmc_study <br>
[6] https://github.com/yufengm/SelfAttentive <br>
Project Organization <br>
[7] https://github.com/shahrukhx01/nnti_hindi_bengali_sentiment_analysis <br>
Hyperparameter Optimization (Optuna) <br>
[8] https://optuna.org