## Introduction
Semantic textual similarity deals with determining how similar a pair of text documents are. The goal of the first task is to implement a new architecture by combining the ideas from papers
- Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al (will be referred as the AAAI paper)
- A Structured Self-Attentive Sentence Embedding, Zhouhan Lin et. al (will be referred as the ICLR paper) <br/><br/>
Furthermore, you'd be evaluating whether the new architecture improves the results of **Siamese Recurrent Architectures for Learning Sentence Similarity, Jonas Mueller et. al**. Your overall network architecture should look similar to the following figure. 
![Untitled%20Diagram.drawio%20%281%29.png](https://raw.githubusercontent.com/shahrukhx01/ocr-test/main/download.png)
<br/><br/>


Moreover, you'd be required to implement further helper functions that these papers propose i.e., attention penalty term for loss, etc.

### SICK dataset
We will use SICK dataset throughout the project (at least in the first two tasks). To get more information about the dataset you can refer to the original [paper](http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf) on the dataset. You can download the dataset using one of the following links:
- [dataset page 1](https://marcobaroni.org/composes/sick.html)
- [dataset page 2](https://huggingface.co/datasets/sick)    

The relevant columns for the project are `sentence_A`, `sentence_B`, `relatedness_score`, where `relatedness_score` is the label. <br><br>
**Hint: For each task make sure to decide whether the label should be normalized or not.**<br><br>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import torch
from test import evaluate_test_set
import sts_data
import siamese_lstm_attention
import sample_class
import sample_train
import train
import test
from importlib import reload

## Part 1. Data pipeline (3 points)
Before starting working on the model, we must configure the data pipeline to load the data in the correct format. Please, implement the functions for processing the data.

### Part 1.1 Loading and preprocessing the data (1 point)
Download the SICK dataset and store it in [pandas](https://pandas.pydata.org/docs/index.html) `Dataframe`'s. You should use the official data split.  

Implement `load_data` method of `STSData` class in `sts_data.py`. The method must download the dataset and perform basic preprocessing. Minimal preprocessing required:  
1. normalize text to lower case
2. remove punctuations  
3. remove [stopwords](https://en.wikipedia.org/wiki/Stop_word) - we provided you with the list of English stopwords.
4. Optionally, any other preprocessing that you deem necessary.

All the preprocessing code must be contained in the `preprocessing.py` file.  
You can use Hugginface's [datasets library](https://huggingface.co/docs/datasets/) for easy dataset download.

### Part 1.2 Building vocabulary (1 point)
Before we can feed our text to the model it must be vectorized. We use 300 dimensional pretrained [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) for mapping words to vectors. To know more general information about embeddings you can refer to [this video](https://www.youtube.com/watch?v=ERibwqs9p38) (even though, we use different types of embeddings - FastText vs Word2Vec described in the video - the general purpose of them is the same).  
In order to apply the embedding, we must first construct the vocabulary for data. Complete the `create_vocab` method of `STSData` class in `sts_data.py` where you concatenate each sentence pair, tokenize it and construct the vocabulary for the whole training data. You should use [torchtext](https://torchtext.readthedocs.io/en/latest/data.html
) for processing the data. For tokenization, you can use any library (or write your own tokenizer), but we recommend you to use tokenizer by [spacy](https://spacy.io/). Use the `fasttext.simple.300d` as pretrained vectors.  
In the end, you must have a vocabulary object capable of mapping your input to corresponding vectors. Remember that the vocabulary is created using only training data (not touching validation or test data).

### Part 1.3 Creating DataLoader (1 point)
Implement `get_data_loader` method of `STSData` class in `sts_data.py`. It must perform the following operations on each of the data splits:
1. vectorize each pair of the sentences by replacing all tokens with their index in vocabulary
2. normalize labels
3. convert everything to PyTorch tensors
4. pad every sentence so that all of them have the same length
5. create `STSDataset` from `dataset.py`
6. create PyTorch DataLoader out of the created dataset. 


We have provided you with the interfaces of possible helper functions, but you can change them as you need.   
In the end, you must have 3 data loaders for each of the splits.

In [3]:
reload(sts_data)
from sts_data import STSData

columns_mapping = {
        "sent1": "sentence_A",
        "sent2": "sentence_B",
        "label": "relatedness_score",
    }
dataset_name = "sick"
sick_data = STSData(
    dataset_name=dataset_name,
    columns_mapping=columns_mapping,
    normalize_labels=True,
    normalization_const=5.0,
)
batch_size = 16
sick_dataloaders = sick_data.get_data_loader(batch_size=batch_size)

INFO:root:loading and preprocessing data...
INFO:root:reading and preprocessing data completed...
INFO:root:creating vocabulary...
INFO:torchtext.vocab:Loading vectors from .vector_cache\wiki.simple.vec.pt
INFO:root:creating vocabulary completed...


## Part 2. Model Configuration & Hyperparameter Tuning (3 points)
In this part, you are required to define a model capable of learning self-attentive sentence embeddings described in [this ICLR paper](https://arxiv.org/pdf/1703.03130.pdf). The sentence embedding learned by this model will be used for computing the similarity score instead of the simpler embeddings described in the original AAAI paper.  
Please familiarize yourself with the model described in the ICLR paper and implement `SiameseBiLSTMAttention` and `SelfAttention` classes in `siamese_lstm_attention.py`. Remember that you must run the model on each sentence in the sentence pair to calculate the similarity between them. You can use `similarity_score` from `utils.py` to compute the similarity score between two sentences. 
  
To get more theoretical information about attention mechanisms you can refer to [this chapter](https://web.stanford.edu/~jurafsky/slp3/10.pdf) of ["Speech and Language Processing" book](https://web.stanford.edu/~jurafsky/slp3/) by Dan Jurafsky and James H. Martin, where the attention mechanism is described in the context of the machine translation task. 

Finally, once your implementation works on the default parameters stated below, make sure to perform **hyperparameter tuning** to find the best combination of hyperparameters.

In [4]:
output_size = 1
hidden_size = 128
vocab_size = len(sick_data.vocab)
embedding_size = 300
embedding_weights = sick_data.vocab.vectors
lstm_layers = 3
learning_rate = 1e-2
fc_hidden_size = 16
max_epochs = 5
bidirectional = True
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## self attention config
self_attention_config = {
    "hidden_size": 350,  ## refers to variable 'da' in the ICLR paper
    "output_size": 30,  ## refers to variable 'r' in the ICLR paper
    "penalty": 1,  ## refers to penalty coefficient term in the ICLR paper
}

In [5]:
reload(siamese_lstm_attention)
from siamese_lstm_attention import SiameseBiLSTMAttention
## init siamese lstm
siamese_lstm_attention_model = SiameseBiLSTMAttention(
    batch_size=batch_size,
    output_size=output_size,
    hidden_size=hidden_size,
    vocab_size=vocab_size,
    embedding_size=embedding_size,
    embedding_weights=embedding_weights,
    lstm_layers=lstm_layers,
    self_attention_config=self_attention_config,
    fc_hidden_size=fc_hidden_size,
    device=device,
    bidirectional=bidirectional,
)
## move model to device
siamese_lstm_attention_model.to(device)

300


  torch.nn.init.xavier_uniform(self.W_s1.weight)
  torch.nn.init.xavier_uniform(self.W_s2.weight)


SiameseBiLSTMAttention(
  (embeddings): Embedding(2052, 300)
  (bilstm): LSTM(300, 128, num_layers=3, bidirectional=True)
  (W_s1): Linear(in_features=256, out_features=350, bias=True)
  (W_s2): Linear(in_features=350, out_features=30, bias=True)
)

## Part 3. Training (2 points)  
Perform the final training of the model by implementing functions in `train.py` after setting values of your best-chosen hyperparameters. Note you can use the same training function when performing hyperparameter tuning.
- **What is a good choice of performance metric here for evaluating your model?** [Max 2-3 lines]
- **What other performance evaluation metric can we use here for this task? Motivate your answer.**[Max 2-3 lines]

In [6]:
reload(train)
from train import train_model
import torch.optim as optim


optimizer = torch.optim.Adam(siamese_lstm_attention_model.parameters(), lr=learning_rate, betas=(0.9, 0.98))

siamese_lstm_attention = train_model(
    model=siamese_lstm_attention_model,
    optimizer=optimizer,
    dataloader=sick_dataloaders,
    data=sick_data,
    max_epochs=max_epochs,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)

  0%|                                                                                            | 0/5 [00:00<?, ?it/s]

Running EPOCH 1
Running loss:  11.028406238555908
Training set accuracy: 0.9401179673150182
Running loss:  10.837737560272217
Training set accuracy: 0.9497564282268286
Running loss:  10.269200706481934
Training set accuracy: 0.9404192171990872
Running loss:  9.737455463409423
Training set accuracy: 0.9369120050221682
Running loss:  9.078691864013672
Training set accuracy: 0.9398130616173148
Running loss:  8.445305728912354
Training set accuracy: 0.9478476714342833
Running loss:  8.275097274780274
Training set accuracy: 0.9638353981077671
Running loss:  8.473273944854736
Training set accuracy: 0.9516985159367323
Running loss:  8.243950080871581
Training set accuracy: 0.9692144321277738
Running loss:  8.19719295501709
Training set accuracy: 0.9663635104894638
Running loss:  8.187867450714112
Training set accuracy: 0.9642333652824163
Running loss:  8.146709156036376
Training set accuracy: 0.9649798331782222
Running loss:  8.131890201568604
Training set accuracy: 0.9665213270112872
Running

INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.021
Validation set accuracy: 0.987
Validation loss: 8.041
Validation set accuracy: 0.966
Validation loss: 8.034
Validation set accuracy: 0.974
Validation loss: 8.041
Validation set accuracy: 0.967
Validation loss: 8.062
Validation set accuracy: 0.946
Validation loss: 8.059
Validation set accuracy: 0.949
Validation loss: 8.042
Validation set accuracy: 0.966
Validation loss: 8.043
Validation set accuracy: 0.965
Validation loss: 8.052
Validation set accuracy: 0.956
Validation loss: 8.054
Validation set accuracy: 0.954
Validation loss: 8.040
Validation set accuracy: 0.968
Validation loss: 8.036
Validation set accuracy: 0.972
Validation loss: 8.031
Validation set accuracy: 0.976
Validation loss: 8.030
Validation set accuracy: 0.978
Validation loss: 8.043
Validation set accuracy: 0.964
Validation loss: 8.050
Validation set accuracy: 0.958
Validation loss: 8.066
Validation set accuracy: 0.942
Validation loss: 8.036
Validation set accuracy: 0.9

INFO:root:new model saved
INFO:root:Train loss: 8.504547119140625 - acc: 0.9597649516921074 -- Validation loss: 8.018549919128418 - acc: 0.9686752651197215
 20%|████████████████▌                                                                  | 1/5 [11:07<44:30, 667.57s/it]

Validation loss: 8.019
Validation set accuracy: 0.989
Finished Training
Running EPOCH 2
Running loss:  8.03251028060913
Training set accuracy: 0.9758011074736714
Running loss:  8.029551982879639
Training set accuracy: 0.975326725654304
Running loss:  8.026163959503174
Training set accuracy: 0.9771888297982514
Running loss:  8.031509971618652
Training set accuracy: 0.9711782420985401
Running loss:  8.035736656188964
Training set accuracy: 0.9669879799708724
Running loss:  8.037574577331544
Training set accuracy: 0.9647220017388463
Running loss:  8.075278091430665
Training set accuracy: 0.9668622859753668
Running loss:  8.414382076263427
Training set accuracy: 0.9756791025400162
Running loss:  8.40735149383545
Training set accuracy: 0.9748215552419424
Running loss:  9.746352481842042
Training set accuracy: 0.9676217459142208
Running loss:  8.640986442565918
Training set accuracy: 0.9644347619265318
Running loss:  8.726861095428466
Training set accuracy: 0.9621427973732353
Running loss:  

INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.016
Validation set accuracy: 0.988
Validation loss: 8.039
Validation set accuracy: 0.965
Validation loss: 8.033
Validation set accuracy: 0.971
Validation loss: 8.018
Validation set accuracy: 0.985
Validation loss: 8.047
Validation set accuracy: 0.957
Validation loss: 8.033
Validation set accuracy: 0.971
Validation loss: 8.030
Validation set accuracy: 0.974
Validation loss: 8.034
Validation set accuracy: 0.970
Validation loss: 8.059
Validation set accuracy: 0.945
Validation loss: 8.043
Validation set accuracy: 0.960
Validation loss: 8.045
Validation set accuracy: 0.959
Validation loss: 8.043
Validation set accuracy: 0.960
Validation loss: 8.033
Validation set accuracy: 0.971
Validation loss: 8.030
Validation set accuracy: 0.974
Validation loss: 8.043
Validation set accuracy: 0.960
Validation loss: 8.060
Validation set accuracy: 0.944
Validation loss: 8.051
Validation set accuracy: 0.953
Validation loss: 8.041
Validation set accuracy: 0.9

INFO:root:Train loss: 8.19404125213623 - acc: 0.9697354319253231 -- Validation loss: 8.048758506774902 - acc: 0.9679052067299684
 40%|█████████████████████████████████▏                                                 | 2/5 [22:27<33:33, 671.26s/it]

Validation loss: 8.049
Validation set accuracy: 0.955
Finished Training
Running EPOCH 3
Running loss:  8.024634552001952
Training set accuracy: 0.9785029808990657
Running loss:  8.02640495300293
Training set accuracy: 0.976735769957304
Running loss:  8.025613403320312
Training set accuracy: 0.97648866167292
Running loss:  8.027535820007325
Training set accuracy: 0.9756338371895253
Running loss:  8.020456027984618
Training set accuracy: 0.9817224589176476
Running loss:  8.021339416503906
Training set accuracy: 0.9798331483267247
Running loss:  8.031108856201172
Training set accuracy: 0.9718410425819457
Running loss:  8.028542518615723
Training set accuracy: 0.9760130140930414
Running loss:  8.029377555847168
Training set accuracy: 0.9735537130385638
Running loss:  8.028589344024658
Training set accuracy: 0.9731653150171041
Running loss:  8.029176330566406
Training set accuracy: 0.9722711866721511
Running loss:  8.028875350952148
Training set accuracy: 0.9724428296089173
Running loss:  8

INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.021
Validation set accuracy: 0.980
Validation loss: 8.016
Validation set accuracy: 0.985
Validation loss: 8.020
Validation set accuracy: 0.981
Validation loss: 8.019
Validation set accuracy: 0.982
Validation loss: 8.035
Validation set accuracy: 0.966
Validation loss: 8.029
Validation set accuracy: 0.972
Validation loss: 8.031
Validation set accuracy: 0.970
Validation loss: 8.028
Validation set accuracy: 0.973
Validation loss: 8.032
Validation set accuracy: 0.969
Validation loss: 8.047
Validation set accuracy: 0.954
Validation loss: 8.040
Validation set accuracy: 0.961
Validation loss: 8.030
Validation set accuracy: 0.971
Validation loss: 8.021
Validation set accuracy: 0.980
Validation loss: 8.017
Validation set accuracy: 0.984
Validation loss: 8.031
Validation set accuracy: 0.970
Validation loss: 8.049
Validation set accuracy: 0.952
Validation loss: 8.038
Validation set accuracy: 0.963
Validation loss: 8.021
Validation set accuracy: 0.9

INFO:root:new model saved
INFO:root:Train loss: 8.026447296142578 - acc: 0.9752485567623151 -- Validation loss: 8.013274192810059 - acc: 0.9756390636476377
 60%|█████████████████████████████████████████████████▊                                 | 3/5 [33:23<22:13, 666.78s/it]

Validation loss: 8.013
Validation set accuracy: 0.988
Finished Training
Running EPOCH 4
Running loss:  8.021738147735595
Training set accuracy: 0.9790239358320832
Running loss:  8.017763137817383
Training set accuracy: 0.9828557658940553
Running loss:  8.022573566436767
Training set accuracy: 0.9778769560158252
Running loss:  8.016219806671142
Training set accuracy: 0.9841144321486354
Running loss:  8.019940853118896
Training set accuracy: 0.9818643192294985
Running loss:  8.020362854003906
Training set accuracy: 0.980427265400067
Running loss:  8.023970794677734
Training set accuracy: 0.9763091379776597
Running loss:  8.0176212310791
Training set accuracy: 0.9827798295766115
Running loss:  8.02188024520874
Training set accuracy: 0.9784409699030221
Running loss:  8.026035499572753
Training set accuracy: 0.9745792727917433
Running loss:  8.020104312896729
Training set accuracy: 0.9810724340379238
Running loss:  8.022263050079346
Training set accuracy: 0.9787470900453628
Running loss:  8

INFO:root:Evaluating accuracy on dev set


Evaluating validation set ....
Validation loss: 8.010
Validation set accuracy: 0.990
Validation loss: 8.023
Validation set accuracy: 0.978
Validation loss: 8.019
Validation set accuracy: 0.982
Validation loss: 8.022
Validation set accuracy: 0.978
Validation loss: 8.047
Validation set accuracy: 0.953
Validation loss: 8.023
Validation set accuracy: 0.977
Validation loss: 8.031
Validation set accuracy: 0.969
Validation loss: 8.034
Validation set accuracy: 0.966
Validation loss: 8.034
Validation set accuracy: 0.967
Validation loss: 8.042
Validation set accuracy: 0.958
Validation loss: 8.038
Validation set accuracy: 0.963
Validation loss: 8.033
Validation set accuracy: 0.968
Validation loss: 8.020
Validation set accuracy: 0.980
Validation loss: 8.019
Validation set accuracy: 0.981
Validation loss: 8.040
Validation set accuracy: 0.960
Validation loss: 8.052
Validation set accuracy: 0.948
Validation loss: 8.029
Validation set accuracy: 0.972
Validation loss: 8.020
Validation set accuracy: 0.9

INFO:root:Train loss: 8.022141456604004 - acc: 0.978386455336552 -- Validation loss: 8.019641876220703 - acc: 0.9734700704924762
 80%|██████████████████████████████████████████████████████████████████▍                | 4/5 [43:46<10:53, 653.43s/it]

Validation loss: 8.020
Validation set accuracy: 0.981
Finished Training
Running EPOCH 5
Running loss:  8.017658996582032
Training set accuracy: 0.9826558765955269
Running loss:  8.01674289703369
Training set accuracy: 0.9834580713883042
Running loss:  8.02255744934082
Training set accuracy: 0.9776028745807708
Running loss:  8.018605136871338
Training set accuracy: 0.9815856418572366
Running loss:  8.01716661453247
Training set accuracy: 0.9830677167512476
Running loss:  8.025026226043702
Training set accuracy: 0.9752738393377512


 80%|██████████████████████████████████████████████████████████████████▍                | 4/5 [46:03<11:30, 690.80s/it]


KeyboardInterrupt: 

## Part 4. Evaluation and Analysis (2 points)  
Implement function evaluate_test_set to calculate the final accuracy of the performance evaluation metric on the test data.  
Compare the result with the original AAAI paper. Сomment on effect of penalty loss on model capacity. Did the inclusion of the self-attention block improve the results? If yes, then how? Can you think of additional techniques to improve the results? Briefly answer these questions in the markdown cells.

In [7]:
reload(test)
evaluate_test_set(
    model=siamese_lstm_attention_model,
    data_loader=sick_dataloaders,
    config_dict={
        "device": device,
        "model_name": "siamese_lstm_attention",
        "self_attention_config": self_attention_config,
    },
)



INFO:root:Evaluating accuracy on test set
INFO:root:Evaluating accuracy on test set


Finished testing..............
Total test set accuracy: 0.630
