# **HOMEWORK 2**

*   **Run following cells to mount drive and import all necessary dependencies**

In [1]:
# Mount Drive
#from google.colab import drive
#drive.mount('/content/drive')

# path to store the data
#%cd /content/drive/My Drive/DL4NLP_2024/

In [2]:
# Import Dependencies

!pip install datasets
from datasets import load_dataset

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import numpy as np

from tqdm import tqdm
from nltk import word_tokenize
import nltk
nltk.download('punkt')

from scipy.stats import pearsonr, kendalltau

from pprint import pprint

import matplotlib.pyplot as plt

import gdown



  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/steinerj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Specify random seed

def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch

    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

# **Task 1: Getting to Know Pytorch: Semantic Textual Similarity**

In this task, we define semantic textual similarity (STS) as a supervised regression task in which the semantic similarity of two pieces of text (typically sentences) should be determined.

### **Task 1.1: Data Preparation**

First we should load the dataset for this task. Each entry of this dataset contains one English sentence pair and their similarity score. The data is structured like both Python dictionary and Pandas DataFrame.
*   **Run the following cell to load the dataset:**


In [4]:
train_set = load_dataset("stsb_multi_mt", "en", split='train')
dev_set = load_dataset("stsb_multi_mt", "en", split='dev')

dev_set

Dataset({
    features: ['sentence1', 'sentence2', 'similarity_score'],
    num_rows: 1500
})

**a)** To get familiar with the data format, print following information:
*  **The first entry of `train_set`**
*  **the size of `dev_set`**
*  **the first 3 `sentence1` in `train_set`**


In [5]:
# TODO: YOUR CODE HERE
print(f"First entry of train_set: {train_set[0]}")
print(f"Size of dev_set: {len(dev_set)}")
print(f"First 3 sentence1 in train_set: {train_set[:3]['sentence1']}")

First entry of train_set: {'sentence1': 'A plane is taking off.', 'sentence2': 'An air plane is taking off.', 'similarity_score': 5.0}
Size of dev_set: 1500
First 3 sentence1 in train_set: ['A plane is taking off.', 'A man is playing a large flute.', 'A man is spreading shreded cheese on a pizza.']


## **Task 1.1.1: Sentence Embedding with FastText**
We will use the averages of the words using [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) to embed both sentences.

* **Run the following cell to download the embeddings.**

In [6]:
# download word emebddings to your drive and unzip the file (run this cell only when you haven't downloaded the emb file yet.)
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
!unzip wiki-news-300d-1M.vec.zip



7[1A[1G[27G[Files: 0  Bytes: 0  [0 B/s] Re]87[2A[1G[27G[https://dl.fbaipublicfiles.com]87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]  547.11K    --.-KB/s87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]    1.48M    7.62MB/s87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]    2.59M    8.21MB/s87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]    3.92M    9.01MB/s87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]    5.23M    9.38MB/s87[2A[1Gwiki-news-300d-1M.ve   0% [>                             ]    6.40M    9.38MB/s87[2A[1Gwiki-news-300d-1M.ve   1% [>                             ]    7.58M    9.38MB/s87[2A[1Gwiki-news-300d-1M.ve   1% [>                             ]    8.69M    9.31MB/s87[2A[1Gwiki-news-300d-1M.ve   1% [>                             ]   10.00M    9.45MB/s87[2A[1Gwiki-news-300d-1M.ve   1% [>                             ]   11.26M    

**b)**    **Implement the function** `load _embeddings` to read the word embeddings into a Python dictionary that maps every token to the corresponding vector and returns word embeddings. Represent the vectors as Numpy arrays. Only load the embeddings of the first 30,000 tokens in the file

In [23]:
def load_embeddings(file="wiki-news-300d-1M.vec"):
	with open(file, 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
		_ = f.readline()
		data = {}
		for line in f:
			tokens = line.rstrip().split(' ')
			data[tokens[0]] = np.asanyarray(tokens[1:], dtype=np.float16)
			if len(data) == 30000:
				break
		
		return data


**c)**   **Print** the size of the dictionary and the first 10 dimensions of the embedding for the word "Frequency".

In [24]:
# TODO: YOUR CODE HERE
emb = load_embeddings()


**d)**  **Implement a function** `tokenize` that tokenizes a sentence using nltk.word_tokenize and returns a list of tokens for given sentence.



In [45]:
def tokenize(sentence):
    return nltk.word_tokenize(sentence)
        

**e)**  **Print** the tokenized `sentence1` and `sentence2` of the 20th entry in the training set.

In [52]:
# TODO: YOUR CODE HERE
sentence1 = train_set[:20]['sentence1']
sentence2 = train_set[:20]['sentence2']

sentence1_tokenized = [tokenize(s1) for s1 in sentence1]
sentence2_tokenized = [tokenize(s2) for s2 in sentence2]

pprint(sentence1_tokenized)
pprint(sentence2_tokenized)

[['A', 'plane', 'is', 'taking', 'off', '.'],
 ['A', 'man', 'is', 'playing', 'a', 'large', 'flute', '.'],
 ['A', 'man', 'is', 'spreading', 'shreded', 'cheese', 'on', 'a', 'pizza', '.'],
 ['Three', 'men', 'are', 'playing', 'chess', '.'],
 ['A', 'man', 'is', 'playing', 'the', 'cello', '.'],
 ['Some', 'men', 'are', 'fighting', '.'],
 ['A', 'man', 'is', 'smoking', '.'],
 ['The', 'man', 'is', 'playing', 'the', 'piano', '.'],
 ['A', 'man', 'is', 'playing', 'on', 'a', 'guitar', 'and', 'singing', '.'],
 ['A',
  'person',
  'is',
  'throwing',
  'a',
  'cat',
  'on',
  'to',
  'the',
  'ceiling',
  '.'],
 ['The', 'man', 'hit', 'the', 'other', 'man', 'with', 'a', 'stick', '.'],
 ['A', 'woman', 'picks', 'up', 'and', 'holds', 'a', 'baby', 'kangaroo', '.'],
 ['A', 'man', 'is', 'playing', 'a', 'flute', '.'],
 ['A', 'person', 'is', 'folding', 'a', 'piece', 'of', 'paper', '.'],
 ['A', 'man', 'is', 'running', 'on', 'the', 'road', '.'],
 ['A', 'dog', 'is', 'trying', 'to', 'get', 'bacon', 'off', 'his', 'b

**f)** **Implement a function** `embed_sentence` that maps a sentence to its embedding. The sentence-level embedding should be the average of the embeddings of its tokens. If a token does not exist in the vocabulary of FastText, embed this token as a 0-vector with the same dimensions as the FastText embeddings.



In [56]:
def embed_sentence(sentence, word2emb):
    embedding = np.zeros(300)
    l = 0
    for w in tokenize(sentence):
		if w in word2emb:
			l += 1
			embedding += word2emb[w]

	return (embedding / l)


IndentationError: unindent does not match any outer indentation level (<string>, line 9)

**g)**  **Print** the shape and the first 10 dimensions of `sentence1`'s embedding of the 20th entry in the training set.

In [None]:
# TODO: YOUR CODE HERE

## **Task 1.1.2: Build Custom Dataset**


* **Implement a custom dataset class** `MLPDataset` inheriting `torch.utils.data.Dataset` and override the following methods:
  1.   `__len__`: which returns the size of the dataset.
  2.   `__getitem__`: to support the indexing such that dataset[i] can be used to get ith sample.
  
      The ith sample should be a Python dict with two entries:
    * ` encoding`: the encoding of one sentence pair, which is the concatenation of the embeddings of the two sentences of a pair. E.g., sent1 = [1,2], sent2 = [3,4], the encoding for sent1 and sent2 should be [1,2,3,4].
    *   `score `: the similarity score between the two sentences.
Hint: examples can be found here: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html





In [None]:
class MLPDataset(Dataset):
  def __init__(self, sents_1, sents_2, scores):
    """
    Arguments:
      sents_1 (List[string]): the list of the first sentences.
      sents_2 (List[string]): the list of the second sentences.
      scores (List[float]): the list of the similarity scores.
    """
    # TODO: YOUR CODE HERE
    pass

  def __getitem__(self, idx):
    # TODO: YOUR CODE HERE
    pass

  def __len__(self):
    # TODO: YOUR CODE HERE
    pass

**h)**  **Instantiate** the above class for our `train_set` and `dev_set`.

In [None]:
# TODO: YOUR CODE HERE

**i)**  **Print** the size of `dev_dataset` and the shape of the encoding of the first example.

In [None]:
# TODO: YOUR CODE HERE

## **Task 1.2: Scoring the Similarity**
We will train a simple multi-layer perceptron (MLP) to score the similarity of the two sentences.

### **Task 1.2.1: Build MLP using Pytorch**

We will use [`pytorch.nn`](https://pytorch.org/docs/stable/nn.html) to build our MLP.

**a)** **Implement a class** `MLP` inheriting [`pytorch.nn.Module`]() for our MLP, which has the following components:

- A [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) with 900 dimensions and [relu activation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) , which takes the encoding of one sentence pair as the input.
- A [dropout layer](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout) with probability 0.2.
- A linear layer with 600 dimensions and relu activation.
- A dropout layer with probability 0.2.
- A linear layer with 600 dimensions and relu activation.
- A dropout layer with probability 0.2.
- A linear layer with 1 dimension (output layer).

**Hint**:
- You need to override the method `forward` in this class
- Use `nn.Sequential` to sequentialize the layers.
- You may want to see a quick example: https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html?highlight=sequential


In [None]:
class MLP(nn.Module):
  def __init__(self):
    super(MLP, self).__init__()
    # TODO: YOUR CODE HERE
    pass

  def forward(self, x):
    # TODO: YOUR CODE HERE
    pass

**b)**  **Initialize the** `model`of this class and **print** it

In [None]:
# TODO : YOUR CODE HERE

### **Task 1.2.2: Train MLP with Pytorch**


The method for training is provided below, which returns the list of the train loss at all epochs and the trained model.
*  **Run the code below**

In [None]:
def train(model, train_dataloader, eval_dataloader, optimizer, loss_fn, num_epochs, device='cuda'):

  train_losses = []

  for epoch in range(num_epochs):

    if epoch == 0:
      model.eval()
      loss_per_epoch = 0
      for batch_data in train_dataloader:
        with torch.no_grad():
          predictions = model(batch_data['encoding'].to(device))
          targets = batch_data['score'].to(device) # only if device='cuda'
          train_loss = loss_func(predictions.squeeze(), targets)
          loss_per_epoch += train_loss.item()
      loss_per_epoch = loss_per_epoch/len(train_dataloader)
      train_losses.append(loss_per_epoch)
      print(f'\ninital train loss: {loss_per_epoch}')

    model.train()
    loss_per_epoch = 0
    for batch_data in train_dataloader:
      predictions = model(batch_data['encoding'].to(device))
      targets = batch_data['score'].to(device) # only if device='cuda'
      train_loss = loss_func(predictions.squeeze(), targets)
      loss_per_epoch += train_loss.item()

      optimizer.zero_grad()
      train_loss.backward()
      optimizer.step()

    loss_per_epoch = loss_per_epoch/len(train_dataloader)
    train_losses.append(loss_per_epoch)
    print(f'\n Epoch {epoch+1} train loss: {loss_per_epoch}')
    #evaluate(model, eval_dataloader, loss_func)

  return train_losses, model

**c)** **Define the corresponding hyperparameters**
  *   Number of training epochs are 15
  *   Batch Size is 128
  *   Learning Rate is 2e-03



In [None]:
# Set random seeds; do not change this!
seed_everything(seed=999)

# TODO: YOUR CODE BELOW



**d)**   **Create Dataloaders** considering following information
  * `train_dataloader` : reshuffle at every epoch
  * `dev_dataloader` : batch size is 512

**Hint:** batch size of train dataloader is different than dev dataloader


In [None]:
# TODO: YOUR CODE HERE


**e)**   **Define** Optimizer as AdmW and Loss function as Mean Square Error

In [None]:
# Initialize the model
device = 'cuda' # "cpu"
model = MLP()
model.to(device)

# TODO: YOUR CODE BELOW




**f)**   Use the train function with the hyperparameters above to **store training losses and the model** in variables called:
  * `train_losses`
  * `model`



In [None]:
# TODO: YOUR CODE HERE

**g)** **Plot** the training loss using `matplotlib.pyplot.plot` (Plotting takes time - consider waiting!)


In [None]:
# TODO: YOUR CODE HERE

**h)** **Implement** another MLP architecture using [`pytorch.nn.Module`](), which has the following components:
- A [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) with 1,200 dimensions and [relu activation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) , which takes the encoding of one sentence pair as the input.
- A linear layer with 300 dimensions and relu activation.
- A linear layer with 300 dimensions and relu activation.
- A linear layer with 1 dimension (output layer).

In [None]:
class MLP(nn.Module):
  def __init__(self):
    super(MLP, self).__init__()
    # TODO: YOUR CODE HERE
    pass

  def forward(self, x):
    # TODO: YOUR CODE HERE
    pass

**i)** **Initialize and train** the new class with the function from above

In [None]:
# TODO: YOUR CODE HERE
modle = #
model = model.to("cuda")
train_losses_second_model, model = train(model, train_dataloader, dev_dataloader, optimizer, loss_func, num_epochs, 'cuda')

**j)** **Plot** the loss curve both the first and the second model (you can reuse your code from above)

In [None]:
# YOUR CODE FOR PLOTTING HERE

**k)** **Compare** the difference in the architectures and their corresponding losses. Describe what might have caused these changes.

In [None]:
# TODO: YOUR COMMENT HERE