## Introduction
Semantic textual similarity deals with determining how similar a pair of text documents are. The goal of the first task is to implement a new architecture by combining the ideas from papers
- Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al (will be referred as the AAAI paper)
- A Structured Self-Attentive Sentence Embedding, Zhouhan Lin et. al (will be referred as the ICLR paper) <br/><br/>
Furthermore, you'd be evaluating whether the new architecture improves the results of **Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al**. Your overall network architecture should look similar to the following figure. 
![Untitled%20Diagram.drawio%20%281%29.png](https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/download.png)
<br/><br/>


Moreover, you'd be required to implement further helper functions that these papers propose i.e., attention penalty term for loss, etc.

### SICK dataset
We will use SICK dataset throughout the project (at least in the first two tasks). To get more information about the dataset you can refer to the original [paper](http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf) on the dataset. You can download the dataset using one of the following links:
- [dataset page 1](https://marcobaroni.org/composes/sick.html)
- [dataset page 2](https://huggingface.co/datasets/sick)    

The relevant columns for the project are `sentence_A`, `sentence_B`, `relatedness_score`, where `relatedness_score` is the label. <br><br>
**Hint: For each task make sure to decide whether the label should be normalized or not.**<br><br>

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
import torch
from test import evaluate_test_set
import sts_data
import siamese_lstm_attention
import train
import test
from importlib import reload

## Part 1. Data pipeline (3 points)
Before starting working on the model, we must configure the data pipeline to load the data in the correct format. Please, implement the functions for processing the data.

### Part 1.1 Loading and preprocessing the data (1 point)
Download the SICK dataset and store it in [pandas](https://pandas.pydata.org/docs/index.html) `Dataframe`'s. You should use the official data split.  

Implement `load_data` method of `STSData` class in `sts_data.py`. The method must download the dataset and perform basic preprocessing. Minimal preprocessing required:  
1. normalize text to lower case
2. remove punctuations  
3. remove [stopwords](https://en.wikipedia.org/wiki/Stop_word) - we provided you with the list of English stopwords.
4. Optionally, any other preprocessing that you deem necessary.

All the preprocessing code must be contained in the `preprocessing.py` file.  
You can use Hugginface's [datasets library](https://huggingface.co/docs/datasets/) for easy dataset download.

### Part 1.2 Building vocabulary (1 point)
Before we can feed our text to the model it must be vectorized. We use 300 dimensional pretrained [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) for mapping words to vectors. To know more general information about embeddings you can refer to [this video](https://www.youtube.com/watch?v=ERibwqs9p38) (even though, we use different types of embeddings - FastText vs Word2Vec described in the video - the general purpose of them is the same).  
In order to apply the embedding, we must first construct the vocabulary for data. Complete the `create_vocab` method of `STSData` class in `sts_data.py` where you concatenate each sentence pair, tokenize it and construct the vocabulary for the whole training data. You should use [torchtext](https://torchtext.readthedocs.io/en/latest/data.html
) for processing the data. For tokenization, you can use any library (or write your own tokenizer), but we recommend you to use tokenizer by [spacy](https://spacy.io/). Use the `fasttext.simple.300d` as pretrained vectors.  
In the end, you must have a vocabulary object capable of mapping your input to corresponding vectors. Remember that the vocabulary is created using only training data (not touching validation or test data).

### Part 1.3 Creating DataLoader (1 point)
Implement `get_data_loader` method of `STSData` class in `sts_data.py`. It must perform the following operations on each of the data splits:
1. vectorize each pair of the sentences by replacing all tokens with their index in vocabulary
2. normalize labels
3. convert everything to PyTorch tensors
4. pad every sentence so that all of them have the same length
5. create `STSDataset` from `dataset.py`
6. create PyTorch DataLoader out of the created dataset. 


We have provided you with the interfaces of possible helper functions, but you can change them as you need.   
In the end, you must have 3 data loaders for each of the splits.

In [5]:
reload(sts_data)
from sts_data import STSData

columns_mapping = {
        "sent1": "sentence_A",
        "sent2": "sentence_B",
        "label": "relatedness_score",
    }
dataset_name = "sick"
sick_data = STSData(
    dataset_name=dataset_name,
    columns_mapping=columns_mapping,
    normalize_labels=True,
    normalization_const=5.0,
)
batch_size = 64
sick_dataloaders = sick_data.get_data_loader(batch_size=batch_size)

INFO:root:loading and preprocessing data...
INFO:root:reading and preprocessing data completed...
INFO:root:creating vocabulary...
INFO:torchtext.vocab:Loading vectors from .vector_cache\wiki.simple.vec.pt
INFO:root:creating vocabulary completed...


## Part 2. Model Configuration & Hyperparameter Tuning (3 points)
In this part, you are required to define a model capable of learning self-attentive sentence embeddings described in [this ICLR paper](https://arxiv.org/pdf/1703.03130.pdf). The sentence embedding learned by this model will be used for computing the similarity score instead of the simpler embeddings described in the original AAAI paper.  
Please familiarize yourself with the model described in the ICLR paper and implement `SiameseBiLSTMAttention` and `SelfAttention` classes in `siamese_lstm_attention.py`. Remember that you must run the model on each sentence in the sentence pair to calculate the similarity between them. You can use `similarity_score` from `utils.py` to compute the similarity score between two sentences. 
  
To get more theoretical information about attention mechanisms you can refer to [this chapter](https://web.stanford.edu/~jurafsky/slp3/10.pdf) of ["Speech and Language Processing" book](https://web.stanford.edu/~jurafsky/slp3/) by Dan Jurafsky and James H. Martin, where the attention mechanism is described in the context of the machine translation task. 

Finally, once your implementation works on the default parameters stated below, make sure to perform **hyperparameter tuning** to find the best combination of hyperparameters.

In [48]:
output_size = 1
hidden_size = 256
vocab_size = len(sick_data.vocab)
embedding_size = 300
embedding_weights = sick_data.vocab.vectors
lstm_layers = 4
learning_rate = 1e-2
fc_hidden_size = 1024
max_epochs = 5
bidirectional = True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## self attention config
self_attention_config = {
    "hidden_size": 350,  ## refers to variable 'da' in the ICLR paper
    "output_size": 30,  ## refers to variable 'r' in the ICLR paper
    "penalty": 1,  ## refers to penalty coefficient term in the ICLR paper
}

In [49]:
#reload(siamese_lstm_attention)
from siamese_lstm_attention import SiameseBiLSTMAttention
## init siamese lstm
siamese_lstm_attention_model = SiameseBiLSTMAttention(
    batch_size=batch_size,
    output_size=output_size,
    hidden_size=hidden_size,
    vocab_size=vocab_size,
    embedding_size=embedding_size,
    embedding_weights=embedding_weights,
    lstm_layers=lstm_layers,
    self_attention_config=self_attention_config,
    fc_hidden_size=fc_hidden_size,
    device=device,
    bidirectional=bidirectional,
)
## move model to device
siamese_lstm_attention_model.to(device)

300


  torch.nn.init.xavier_uniform(self.W_s1.weight)
  torch.nn.init.xavier_uniform(self.W_s2.weight)
  torch.nn.init.xavier_uniform(self.fc_layer.weight)


SiameseBiLSTMAttention(
  (embeddings): Embedding(2052, 300)
  (bilstm): LSTM(300, 256, num_layers=4, bidirectional=True)
  (W_s1): Linear(in_features=512, out_features=350, bias=True)
  (W_s2): Linear(in_features=350, out_features=30, bias=True)
  (fc_layer): Linear(in_features=15360, out_features=1024, bias=True)
)

## Part 3. Training (2 points)  
Perform the final training of the model by implementing functions in `train.py` after setting values of your best-chosen hyperparameters. Note you can use the same training function when performing hyperparameter tuning.
- **What is a good choice of performance metric here for evaluating your model?** [Max 2-3 lines]
- **What other performance evaluation metric can we use here for this task? Motivate your answer.**[Max 2-3 lines]

In [50]:
reload(train)
from train import train_model
import torch.optim as optim


optimizer = torch.optim.Adam(siamese_lstm_attention_model.parameters(), lr=learning_rate, betas=(0.9, 0.98))

siamese_lstm_attention = train_model(
    model=siamese_lstm_attention_model,
    optimizer=optimizer,
    dataloader=sick_dataloaders,
    data=sick_data,
    max_epochs=max_epochs,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)

  0%|                                                                                            | 0/5 [00:00<?, ?it/s]

Running EPOCH 1
Running loss:  11.384491634368896
Training set accuracy: 0.5808462861925363
Running loss:  11.147420024871826
Training set accuracy: 0.7710365246981382
Running loss:  10.833740234375
Training set accuracy: 0.941816332936287
Running loss:  10.64198179244995
Training set accuracy: 0.9463532537221908
Running loss:  10.370060920715332
Training set accuracy: 0.953131376951933
Running loss:  10.058574867248534
Training set accuracy: 0.9604000316932797


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 9.756
Validation set accuracy: 0.959
Validation loss: 9.755
Validation set accuracy: 0.961
Validation loss: 9.755
Validation set accuracy: 0.960
Validation loss: 9.756
Validation set accuracy: 0.960
Validation loss: 9.761
Validation set accuracy: 0.954
Validation loss: 9.763
Validation set accuracy: 0.951


INFO:root:new model saved


Validation loss: 9.756
Validation set accuracy: 0.956


INFO:root:Train loss: 10.61784553527832 - acc: 0.8715602141953465 -- Validation loss: 9.755903244018555 - acc: 0.9573043526283332
 20%|████████████████▌                                                                  | 1/5 [05:44<22:59, 344.82s/it]

Finished Training
Running EPOCH 2
Running loss:  9.511654949188232
Training set accuracy: 0.9640319215133786
Running loss:  9.088527393341064
Training set accuracy: 0.964389693364501
Running loss:  9.480271339416504
Training set accuracy: 0.9517931638285517
Running loss:  8.932265758514404
Training set accuracy: 0.9578387599438429
Running loss:  8.651158905029297
Training set accuracy: 0.9587541466578842
Running loss:  8.320283126831054
Training set accuracy: 0.9624825693666935


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.210
Validation set accuracy: 0.955
Validation loss: 8.216
Validation set accuracy: 0.949
Validation loss: 8.208
Validation set accuracy: 0.957
Validation loss: 8.200
Validation set accuracy: 0.965
Validation loss: 8.208
Validation set accuracy: 0.957
Validation loss: 8.210
Validation set accuracy: 0.956


INFO:root:new model saved


Validation loss: 8.195
Validation set accuracy: 0.978


INFO:root:Train loss: 8.894986152648926 - acc: 0.9605846849904545 -- Validation loss: 8.194538116455078 - acc: 0.9595961552113295
 40%|█████████████████████████████████▏                                                 | 2/5 [11:31<17:16, 345.36s/it]

Finished Training
Running EPOCH 3
Running loss:  8.161473751068115
Training set accuracy: 0.9663184307515621
Running loss:  8.122446823120118
Training set accuracy: 0.9689879164099693
Running loss:  8.104555320739745
Training set accuracy: 0.9690468534827232
Running loss:  8.087018299102784
Training set accuracy: 0.9690792009234428
Running loss:  8.067243289947509
Training set accuracy: 0.9686214379966259
Running loss:  8.06061315536499
Training set accuracy: 0.9655605919659138


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.056
Validation set accuracy: 0.966
Validation loss: 8.049
Validation set accuracy: 0.973
Validation loss: 8.056
Validation set accuracy: 0.965
Validation loss: 8.061
Validation set accuracy: 0.961
Validation loss: 8.049
Validation set accuracy: 0.973
Validation loss: 8.059
Validation set accuracy: 0.963


INFO:root:new model saved


Validation loss: 8.041
Validation set accuracy: 0.980


INFO:root:Train loss: 8.094213485717773 - acc: 0.9683446197406105 -- Validation loss: 8.041289329528809 - acc: 0.9687460120767355
 60%|█████████████████████████████████████████████████▊                                 | 3/5 [17:26<11:36, 348.26s/it]

Finished Training
Running EPOCH 4
Running loss:  8.04365701675415
Training set accuracy: 0.975509144179523
Running loss:  8.045783710479736
Training set accuracy: 0.9697848709300161
Running loss:  8.035254192352294
Training set accuracy: 0.9762087542563677
Running loss:  8.034348392486573
Training set accuracy: 0.9733700772747398
Running loss:  8.030977725982666
Training set accuracy: 0.9742924038320779
Running loss:  8.031161022186279
Training set accuracy: 0.9749891463667154


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.031
Validation set accuracy: 0.972
Validation loss: 8.034
Validation set accuracy: 0.969
Validation loss: 8.035
Validation set accuracy: 0.968
Validation loss: 8.034
Validation set accuracy: 0.968
Validation loss: 8.029
Validation set accuracy: 0.973
Validation loss: 8.034
Validation set accuracy: 0.969


INFO:root:new model saved


Validation loss: 8.019
Validation set accuracy: 0.983


INFO:root:Train loss: 8.035984992980957 - acc: 0.9740953121496283 -- Validation loss: 8.019402503967285 - acc: 0.9717866685241461
 80%|██████████████████████████████████████████████████████████████████▍                | 4/5 [23:17<05:49, 349.07s/it]

Finished Training
Running EPOCH 5
Running loss:  8.025213527679444
Training set accuracy: 0.9767908871173858
Running loss:  8.023203659057618
Training set accuracy: 0.9779900879599154
Running loss:  8.024989700317382
Training set accuracy: 0.976160966604948
Running loss:  8.0289457321167
Training set accuracy: 0.9728824308142066
Running loss:  8.028114509582519
Training set accuracy: 0.9742569733411074
Running loss:  8.026698970794678
Training set accuracy: 0.9749311190098524


INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.032
Validation set accuracy: 0.971
Validation loss: 8.032
Validation set accuracy: 0.971
Validation loss: 8.037
Validation set accuracy: 0.966
Validation loss: 8.035
Validation set accuracy: 0.968
Validation loss: 8.027
Validation set accuracy: 0.976
Validation loss: 8.033
Validation set accuracy: 0.971


INFO:root:new model saved


Validation loss: 8.023
Validation set accuracy: 0.980


INFO:root:Train loss: 8.026457786560059 - acc: 0.9752995442410094 -- Validation loss: 8.023056983947754 - acc: 0.972019100561738
100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [28:55<00:00, 347.01s/it]

Finished Training





## Part 4. Evaluation and Analysis (2 points)  
Implement function evaluate_test_set to calculate the final accuracy of the performance evaluation metric on the test data.  
Compare the result with the original AAAI paper. Сomment on effect of penalty loss on model capacity. Did the inclusion of the self-attention block improve the results? If yes, then how? Can you think of additional techniques to improve the results? Briefly answer these questions in the markdown cells.

In [52]:
reload(test)
evaluate_test_set(
    model=siamese_lstm_attention_model,
    data_loader=sick_dataloaders,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)



INFO:root:Evaluating accuracy on test set
INFO:root:Evaluating accuracy on test set


Finished testing..............
Total test set accuracy: 0.457
