## Introduction
Semantic textual similarity deals with determining how similar a pair of text documents are. The goal of the first task is to implement a new architecture by combining the ideas from papers
- Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al (will be referred as the AAAI paper)
- A Structured Self-Attentive Sentence Embedding, Zhouhan Lin et. al (will be referred as the ICLR paper) <br/><br/>
Furthermore, you'd be evaluating whether the new architecture improves the results of **Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al**. Your overall network architecture should look similar to the following figure. 
![Untitled%20Diagram.drawio%20%281%29.png](https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/download.png)
<br/><br/>


Moreover, you'd be required to implement further helper functions that these papers propose i.e., attention penalty term for loss, etc.

### SICK dataset
We will use SICK dataset throughout the project (at least in the first two tasks). To get more information about the dataset you can refer to the original [paper](http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf) on the dataset. You can download the dataset using one of the following links:
- [dataset page 1](https://marcobaroni.org/composes/sick.html)
- [dataset page 2](https://huggingface.co/datasets/sick)    

The relevant columns for the project are `sentence_A`, `sentence_B`, `relatedness_score`, where `relatedness_score` is the label. <br><br>
**Hint: For each task make sure to decide whether the label should be normalized or not.**<br><br>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch
from test import evaluate_test_set
import sts_data
import siamese_lstm_attention
import train
import test
from importlib import reload

## Part 1. Data pipeline (3 points)
Before starting working on the model, we must configure the data pipeline to load the data in the correct format. Please, implement the functions for processing the data.

### Part 1.1 Loading and preprocessing the data (1 point)
Download the SICK dataset and store it in [pandas](https://pandas.pydata.org/docs/index.html) `Dataframe`'s. You should use the official data split.  

Implement `load_data` method of `STSData` class in `sts_data.py`. The method must download the dataset and perform basic preprocessing. Minimal preprocessing required:  
1. normalize text to lower case
2. remove punctuations  
3. remove [stopwords](https://en.wikipedia.org/wiki/Stop_word) - we provided you with the list of English stopwords.
4. Optionally, any other preprocessing that you deem necessary.

All the preprocessing code must be contained in the `preprocessing.py` file.  
You can use Hugginface's [datasets library](https://huggingface.co/docs/datasets/) for easy dataset download.

### Part 1.2 Building vocabulary (1 point)
Before we can feed our text to the model it must be vectorized. We use 300 dimensional pretrained [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) for mapping words to vectors. To know more general information about embeddings you can refer to [this video](https://www.youtube.com/watch?v=ERibwqs9p38) (even though, we use different types of embeddings - FastText vs Word2Vec described in the video - the general purpose of them is the same).  
In order to apply the embedding, we must first construct the vocabulary for data. Complete the `create_vocab` method of `STSData` class in `sts_data.py` where you concatenate each sentence pair, tokenize it and construct the vocabulary for the whole training data. You should use [torchtext](https://torchtext.readthedocs.io/en/latest/data.html
) for processing the data. For tokenization, you can use any library (or write your own tokenizer), but we recommend you to use tokenizer by [spacy](https://spacy.io/). Use the `fasttext.simple.300d` as pretrained vectors.  
In the end, you must have a vocabulary object capable of mapping your input to corresponding vectors. Remember that the vocabulary is created using only training data (not touching validation or test data).

### Part 1.3 Creating DataLoader (1 point)
Implement `get_data_loader` method of `STSData` class in `sts_data.py`. It must perform the following operations on each of the data splits:
1. vectorize each pair of the sentences by replacing all tokens with their index in vocabulary
2. normalize labels
3. convert everything to PyTorch tensors
4. pad every sentence so that all of them have the same length
5. create `STSDataset` from `dataset.py`
6. create PyTorch DataLoader out of the created dataset. 


We have provided you with the interfaces of possible helper functions, but you can change them as you need.   
In the end, you must have 3 data loaders for each of the splits.

In [3]:
reload(sts_data)
from sts_data import STSData

columns_mapping = {
        "sent1": "sentence_A",
        "sent2": "sentence_B",
        "label": "relatedness_score",
    }
dataset_name = "sick"
sick_data = STSData(
    dataset_name=dataset_name,
    columns_mapping=columns_mapping,
    normalize_labels=True,
    normalization_const=5.0,
)
batch_size = 64
sick_dataloaders = sick_data.get_data_loader(batch_size=batch_size) # returns 3 dataloaders

INFO:root:loading and preprocessing data...
INFO:root:reading and preprocessing data completed...
INFO:root:creating vocabulary...
INFO:torchtext.vocab:Loading vectors from .vector_cache\wiki.simple.vec.pt
INFO:root:creating vocabulary completed...


## Part 2. Model Configuration & Hyperparameter Tuning (3 points)
In this part, you are required to define a model capable of learning self-attentive sentence embeddings described in [this ICLR paper](https://arxiv.org/pdf/1703.03130.pdf). The sentence embedding learned by this model will be used for computing the similarity score instead of the simpler embeddings described in the original AAAI paper.  
Please familiarize yourself with the model described in the ICLR paper and implement `SiameseBiLSTMAttention` and `SelfAttention` classes in `siamese_lstm_attention.py`. Remember that you must run the model on each sentence in the sentence pair to calculate the similarity between them. You can use `similarity_score` from `utils.py` to compute the similarity score between two sentences. 
  
To get more theoretical information about attention mechanisms you can refer to [this chapter](https://web.stanford.edu/~jurafsky/slp3/10.pdf) of ["Speech and Language Processing" book](https://web.stanford.edu/~jurafsky/slp3/) by Dan Jurafsky and James H. Martin, where the attention mechanism is described in the context of the machine translation task. 

Finally, once your implementation works on the default parameters stated below, make sure to perform **hyperparameter tuning** to find the best combination of hyperparameters.

In [22]:
# hyperparameters selected after hyperparam. tuning

output_size = 1
hidden_size = 128
vocab_size = len(sick_data.vocab)
embedding_size = 300
embedding_weights = sick_data.vocab.vectors
lstm_layers = 5
learning_rate = 1e-2
fc_hidden_size = 32
max_epochs = 5
bidirectional = True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## self attention config
self_attention_config = {
    "hidden_size": 350,  ## refers to variable 'da' in the ICLR paper
    "output_size": 30,  ## refers to variable 'r' in the ICLR paper
    "penalty": 1,  ## refers to penalty coefficient term in the ICLR paper
}

In [23]:
#reload(siamese_lstm_attention)
from siamese_lstm_attention import SiameseBiLSTMAttention
## init siamese lstm
siamese_lstm_attention_model = SiameseBiLSTMAttention(
    batch_size=batch_size,
    output_size=output_size,
    hidden_size=hidden_size,
    vocab_size=vocab_size,
    embedding_size=embedding_size,
    embedding_weights=embedding_weights,
    lstm_layers=lstm_layers,
    self_attention_config=self_attention_config,
    fc_hidden_size=fc_hidden_size,
    device=device,
    bidirectional=bidirectional,
)
## move model to device
siamese_lstm_attention_model.to(device)

300


SiameseBiLSTMAttention(
  (embeddings): Embedding(2052, 300)
  (bilstm): LSTM(300, 128, num_layers=5, bidirectional=True)
  (W_s1): Linear(in_features=256, out_features=350, bias=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (W_s2): Linear(in_features=350, out_features=30, bias=True)
  (fc_layer): Linear(in_features=7680, out_features=32, bias=True)
)

## Part 3. Training (2 points)  
Perform the final training of the model by implementing functions in `train.py` after setting values of your best-chosen hyperparameters. Note you can use the same training function when performing hyperparameter tuning.
- **What is a good choice of performance metric here for evaluating your model?** [Max 2-3 lines]
- **What other performance evaluation metric can we use here for this task? Motivate your answer.**[Max 2-3 lines]

### Part 3 Answers

1) A good choice of performance metric is the mean squared error, as we are comparing the difference in relatedness scores of the 2 sentences which is a quantitative value. Hence, we have used (1 - MSE) over the normalised label values as the accuracy metric. Hence, the larger the (1 - MSE) value, the better is the accuracy. 

2) Another good performance evaluation metric could be the Pearson correlation coefficient. This metric lies between 0 and 1, and would predict value 1 for complete correlation between sentences, and value 0 for no correlation between sentences.

In [24]:
reload(train)
from train import train_model
import torch.optim as optim


optimizer = torch.optim.Adam(siamese_lstm_attention_model.parameters(), lr=learning_rate, betas=(0.9, 0.98))

siamese_lstm_attention = train_model(
    model=siamese_lstm_attention_model,
    optimizer=optimizer,
    dataloader=sick_dataloaders,
    data=sick_data,
    max_epochs=max_epochs,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)

  0%|                                                                                            | 0/5 [00:00<?, ?it/s]

Running EPOCH 1
Running loss:  11.115558910369874
Training set accuracy: 0.8248226478695869
Running loss:  10.977493572235108
Training set accuracy: 0.9495640270411968
Running loss:  10.832866001129151
Training set accuracy: 0.9558994088321924
Running loss:  10.707572174072265
Training set accuracy: 0.95419748313725
Running loss:  10.331046009063721
Training set accuracy: 0.9337305121123791
Running loss:  9.7387225151062
Training set accuracy: 0.9081865184009075


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.798
Validation set accuracy: 0.903
Validation loss: 8.809
Validation set accuracy: 0.892
Validation loss: 8.785
Validation set accuracy: 0.916
Validation loss: 8.775
Validation set accuracy: 0.927
Validation loss: 8.823
Validation set accuracy: 0.878
Validation loss: 8.788
Validation set accuracy: 0.913


INFO:root:new model saved
INFO:root:Train loss: 10.428025245666504 - acc: 0.9163627695778142 -- Validation loss: 8.775415420532227 - acc: 0.9080489522644452
 20%|████████████████▌                                                                  | 1/5 [05:15<21:02, 315.61s/it]

Validation loss: 8.775
Validation set accuracy: 0.926
Finished Training
Running EPOCH 2
Running loss:  8.834084033966064
Training set accuracy: 0.8863781534135342
Running loss:  8.493098640441895
Training set accuracy: 0.9065636806190014
Running loss:  8.321365928649902
Training set accuracy: 0.9294439375400543
Running loss:  8.1313325881958
Training set accuracy: 0.9536877188831567
Running loss:  8.084994792938232
Training set accuracy: 0.9588215101510287
Running loss:  8.060502433776856
Training set accuracy: 0.9607950186356902


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.105
Validation set accuracy: 0.904
Validation loss: 8.123
Validation set accuracy: 0.886
Validation loss: 8.085
Validation set accuracy: 0.925
Validation loss: 8.083
Validation set accuracy: 0.927
Validation loss: 8.127
Validation set accuracy: 0.882
Validation loss: 8.096
Validation set accuracy: 0.913


INFO:root:new model saved
INFO:root:Train loss: 8.285820007324219 - acc: 0.9364562571534644 -- Validation loss: 8.081256866455078 - acc: 0.909367635846138
 40%|█████████████████████████████████▏                                                 | 2/5 [11:19<16:30, 330.23s/it]

Validation loss: 8.081
Validation set accuracy: 0.928
Finished Training
Running EPOCH 3
Running loss:  8.055578136444092
Training set accuracy: 0.9558352651074529
Running loss:  8.05319766998291
Training set accuracy: 0.9537899188697339
Running loss:  8.049210262298583
Training set accuracy: 0.954549840092659
Running loss:  8.0414888381958
Training set accuracy: 0.9627291116863489
Running loss:  8.03413667678833
Training set accuracy: 0.9681146085262299
Running loss:  8.031125259399413
Training set accuracy: 0.9706734491512179


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.061
Validation set accuracy: 0.939
Validation loss: 8.070
Validation set accuracy: 0.931
Validation loss: 8.050
Validation set accuracy: 0.951
Validation loss: 8.050
Validation set accuracy: 0.951
Validation loss: 8.063
Validation set accuracy: 0.938
Validation loss: 8.063
Validation set accuracy: 0.938


INFO:root:new model saved
INFO:root:Train loss: 8.042633056640625 - acc: 0.9619960479060377 -- Validation loss: 8.044320106506348 - acc: 0.9436368202524525
 60%|█████████████████████████████████████████████████▊                                 | 3/5 [17:04<11:08, 334.48s/it]

Validation loss: 8.044
Validation set accuracy: 0.956
Finished Training
Running EPOCH 4
Running loss:  8.031366634368897
Training set accuracy: 0.9700438816100359
Running loss:  8.027882194519043
Training set accuracy: 0.9734340833500028
Running loss:  8.028804492950439
Training set accuracy: 0.972716435790062
Running loss:  8.027166175842286
Training set accuracy: 0.9742087926715612
Running loss:  8.033147144317628
Training set accuracy: 0.9683932859450579
Running loss:  8.031132793426513
Training set accuracy: 0.9710642091929913


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.036
Validation set accuracy: 0.965
Validation loss: 8.037
Validation set accuracy: 0.963
Validation loss: 8.034
Validation set accuracy: 0.967
Validation loss: 8.046
Validation set accuracy: 0.955
Validation loss: 8.041
Validation set accuracy: 0.959
Validation loss: 8.039
Validation set accuracy: 0.962


INFO:root:new model saved
INFO:root:Train loss: 8.029426574707031 - acc: 0.9721133809374727 -- Validation loss: 8.02437686920166 - acc: 0.9636766527380262
 80%|██████████████████████████████████████████████████████████████████▍                | 4/5 [21:55<05:21, 321.46s/it]

Validation loss: 8.024
Validation set accuracy: 0.976
Finished Training
Running EPOCH 5
Running loss:  8.023860836029053
Training set accuracy: 0.9777472160756588
Running loss:  8.023838233947753
Training set accuracy: 0.9770548658445477
Running loss:  8.024714374542237
Training set accuracy: 0.9762678811326623
Running loss:  8.024611854553223
Training set accuracy: 0.9766347996890545
Running loss:  8.025888919830322
Training set accuracy: 0.975245013460517
Running loss:  8.024997901916503
Training set accuracy: 0.976062455959618


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.039
Validation set accuracy: 0.961
Validation loss: 8.048
Validation set accuracy: 0.953
Validation loss: 8.042
Validation set accuracy: 0.958
Validation loss: 8.041
Validation set accuracy: 0.960
Validation loss: 8.050
Validation set accuracy: 0.951
Validation loss: 8.045
Validation set accuracy: 0.955


INFO:root:Train loss: 8.024439811706543 - acc: 0.9766973451427792 -- Validation loss: 8.034225463867188 - acc: 0.9577606873852866
100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [26:34<00:00, 318.97s/it]

Validation loss: 8.034
Validation set accuracy: 0.966
Finished Training





## Part 4. Evaluation and Analysis (2 points)  
Implement function evaluate_test_set to calculate the final accuracy of the performance evaluation metric on the test data.  
Compare the result with the original AAAI paper. Сomment on effect of penalty loss on model capacity. Did the inclusion of the self-attention block improve the results? If yes, then how? Can you think of additional techniques to improve the results? Briefly answer these questions in the markdown cells.

In [25]:
reload(test)
evaluate_test_set(
    model=siamese_lstm_attention,
    data_loader=sick_dataloaders,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)



INFO:root:Evaluating accuracy on test set
INFO:root:Evaluating accuracy on test set


Finished testing..............
Total test set accuracy: 0.576


### Part 4 Answers

We are getting an MSE of (1 - 0.576) = 0.424, whereas in the AAAI paper, the MSE is 0.2286. Hence, we do not achieve a better result only by addition of self-attention block. Note: this may be due to the fact that in the paper different techniques of regularisation are used as compared to what we used. In general, we expect the self-attention block to improve results as it will assign higher importance to words inferred from context i.e. semantically similar word pairs will have higher weights.

The penalty term acts as a regularisation term for the self-attention block in the context of the MSE loss function and prevents the model to some extent from overfitting. Without the penalty term, the self-attention block may put a lot of weight on words that are not very relevant to the context. In our analysis we observed that the performance/accuracy increases upon addition of penalty term, i.e. final accuracy is higher when penalty term of 1 is included.

To improve the results, various other forms of regularisation and initialization techniques can be used for eg. initialisation of hidden states of the BiLSTM with random Gaussian entries. 
Also training can be increased over multiple epochs, and early stopping can be employed to get the best hyperparameter combination that results in a lower validation set error. Also additional dropout layers can be added to prevent the model from overfitting.

### Code References:

1. Self-attentive sentence embeddings (for Task 1 model reference): https://github.com/prakashpandey9/Text-Classification-Pytorch
2.  Self-attentive sentence embeddings (for Task 1 utility functions reference i.e. frobenius norm calculation etc.): https://github.com/kaushalshetty/Structured-Self-Attention
3. HuggingFace - SICK dataset loading

### Extra libraries used:
1. Gensim - for easy stopword removal
2. HuggingFace - to load the SICK dataset directly with the official train, valid, test partitions

