# Task 1: Word Embeddings (10 points)

This notebook will guide you through all steps necessary to train a word2vec model (Detailed description in the PDF).

## Imports

This code block is reserved for your imports. 

You are free to use the following packages: 

(List of packages)

In [1]:
# Imports
import torch
from torch import nn
import pandas as pd
import re
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from tqdm import tqdm
from torchsummary import summary
import matplotlib.pylab as plt

# 1.1 Get the data (0.5 points)

The Hindi portion HASOC corpus from [github.io](https://hasocfire.github.io/hasoc/2019/dataset.html) is already available in the repo, at data/hindi_hatespeech.tsv . Load it into a data structure of your choice. Then, split off a small part of the corpus as a development set (~100 data points).

If you are using Colab the first two lines will let you upload folders or files from your local file system.

In [15]:
#TODO: implement!

#from google.colab import files
#uploaded = files.upload()

data = pd.read_csv('data/bengali_hatespeech_.csv')


In [16]:
hindi_data.task_1.value_counts()

HOF    2469
NOT    2196
Name: task_1, dtype: int64

In [11]:
stopwords_bengali_file = open('data/stopwords-bn.txt', 'r')
stopwords_bengali = [line.replace('\n','') for line in stopwords_bengali_file.readlines()]

In [4]:
data.head()

Unnamed: 0,sentence,hate,category
0,যত্তসব পাপন শালার ফাজলামী!!!!!,1,sports
1,পাপন শালা রে রিমান্ডে নেওয়া দরকার,1,sports
2,জিল্লুর রহমান স্যারের ছেলে এতো বড় জারজ হবে এটা...,1,sports
3,শালা লুচ্চা দেখতে পাঠার মত দেখা যায়,1,sports
4,তুই তো শালা গাজা খাইছচ।তুর মার হেডায় খেলবে সাকিব,1,sports


In [25]:
def subset_data(n_positive, n_negative, df, split_col, random_seed=42):
    ## sampling the required samples for each label
    positive_df = df[df[split_col] == 1].sample(n_positive, random_state=random_seed)
    negative_df = df[df[split_col] == 0].sample(n_negative, random_state=random_seed)
    
    ## creating list of frames for concatenation
    frames = [positive_df, negative_df]
    
    return pd.concat(frames)

In [30]:
bengali_subset_df = subset_data(n_positive=2469, n_negative=2196, df=data, split_col='hate')

In [7]:
data.hate.value_counts()

0    20000
1    10000
Name: hate, dtype: int64

In [None]:
data = data.sample(1000)
data.shape

(1000, 6)

In [36]:
V = list(bengali_subset_df.sentence.str.split(expand=True).stack().value_counts().keys())
V2 = list(hindi_data.text.str.split(expand=True).stack().value_counts().keys())
len(V) , len(V2)

(17059, 25017)

## 1.2 Data preparation (0.5 + 0.5 points)

* Prepare the data by removing everything that does not contain information. 
User names (starting with '@') and punctuation symbols clearly do not convey information, but we also want to get rid of so-called [stopwords](https://en.wikipedia.org/wiki/Stop_word), i. e. words that have little to no semantic content (and, but, yes, the...). Hindi stopwords can be found [here](https://github.com/stopwords-iso/stopwords-hi/blob/master/stopwords-hi.txt) Then, standardize the spelling by lowercasing all words.
Do this for the development section of the corpus for now.

## * ?? What about hashtags (starting with '#') and emojis? Should they be removed too? Justify your answer in the report, and explain how you accounted for this in your implementation.

In [33]:
USERNAME_PATTERN = r'@([A-Za-z0-9_]+)'
PUNCTUATION_PATTERN = '\'’|!@$%^&*()_+<>?:.,;-'

In [34]:
#TODO: implement!
def remove_punctuations(text):
  return "".join([c for c in text if c not in PUNCTUATION_PATTERN])

def remove_stopwords(text):
  return " ".join([word for word in text.split() if word not in stopwords_hindi])

def remove_usernames(text):  
  return re.sub(USERNAME_PATTERN, '', text)
  

In [None]:
## normalizing text to lower case
data['clean_text'] = data.text.apply(lambda text: text.lower())

## removing usernames
data['clean_text'] = data.clean_text.apply(remove_usernames)

## removing punctuations
data['clean_text'] = data.clean_text.apply(remove_punctuations)

## removing stopwords
data['clean_text'] = data.clean_text.apply(remove_stopwords)


## 1.3 Build the vocabulary (0.5 + 0.5 points)

The input to the first layer of word2vec is an one-hot encoding of the current word. The output od the model is then compared to a numeric class label of the words within the size of the skip-gram window. Now

* Compile a list of all words in the development section of your corpus and save it in a variable ```V```.

In [None]:
#TODO: implement!
V = list(data.clean_text.str.split(expand=True).stack().value_counts().keys())

* Then, write a function ```word_to_one_hot``` that returns a one-hot encoding of an arbitrary word in the vocabulary. The size of the one-hot encoding should be ```len(v)```.

In [None]:
## vocabulary for mapping words to index
word2index = {word:index for index,word in enumerate(V)}

## vocabulary for mapping index to words
index2word = {index:word for index,word in enumerate(V)}

In [None]:
#TODO: implement!
def word_to_one_hot(word):
  one_hot_encoding = [0]*len(V)
  one_hot_encoding[word2index[word]] = 1.0
  return one_hot_encoding


## 1.4 Subsampling (0.5 points)

The probability to keep a word in a context is given by:

$P_{keep}(w_i) = \Big(\sqrt{\frac{z(w_i)}{0.001}}+1\Big) \cdot \frac{0.001}{z(w_i)}$

Where $z(w_i)$ is the relative frequency of the word $w_i$ in the corpus. Now,
* Calculate word frequencies
* Define a function ```sampling_prob``` that takes a word (string) as input and returns the probabiliy to **keep** the word in a context.

In [None]:
#TODO: implement!
word_frequencies = dict(data.clean_text.str.split(expand=True).stack().value_counts())
total_frequency = sum(word_frequencies.values())

def sampling_prob(word):
  relative_frequency = word_frequencies[word]/total_frequency
  return (np.sqrt(relative_frequency / .001) + 1 ) * (.001/relative_frequency)

# 1.5 Skip-Grams (1 point)

Now that you have the vocabulary and one-hot encodings at hand, you can start to do the actual work. The skip gram model requires training data of the shape ```(current_word, context)```, with ```context``` being the words before and/or after ```current_word``` within ```window_size```. 

* Have closer look on the original paper. If you feel to understand how skip-gram works, implement a function ```get_target_context``` that takes a sentence as input and [yield](https://docs.python.org/3.9/reference/simple_stmts.html#the-yield-statement)s a ```(current_word, context)```.

* Use your ```sampling_prob``` function to drop words from contexts as you sample them. 

In [None]:
#TODO: implement!
def get_target_context(sentence, window_size):
  tokens = sentence.split()
  for current_word_index, current_word in enumerate(tokens):
    context = []
    for context_word_index in range(current_word_index-window_size, current_word_index + window_size + 1):
      ## check wthether context word index is within sequence and is not the current word itself.
      if current_word_index != context_word_index and context_word_index <= len(tokens) -1 and context_word_index >=0:
        
        # increase sampling chances of infrequent words in context
        if np.random.random() < sampling_prob(tokens[context_word_index]):
          context.append(tokens[context_word_index])

    yield (current_word, context)


In [None]:

"""[(current_word, context) for (current_word, context) in 
        get_target_context(data.clean_text.values[4643], window_size=1)]"""

'[(current_word, context) for (current_word, context) in \n        get_target_context(data.clean_text.values[4643], window_size=1)]'

## 1.5a HOSAC Dataloader

In [None]:
class HOSACDataset:
    # A custom dataset class for holding data for word2vec embeddings (using skip-gram)
    def __init__(self, data, window_size, batch_size=32):
        """
        add docs here....
        """
        self.data = data
        self.window_size = window_size
        self.batch_size = batch_size

    def load_data(self):
          for i in tqdm(range(len(self.data.clean_text.values))):
            self.transform_data(i)


        
    def transform_data(self, index):

      X, Y = [], []

      ## get the text sequence from dataframe
      sentence = self.data.clean_text.values[index]

       ## fetch context words within the context window
      for current_word, context in get_target_context(sentence, window_size=self.window_size):
        current_word_onehot = word_to_one_hot(current_word)

    ## iterate over context list and one hot encode them and align them with input
        for context_word in context:
          context_word_onehot = word2index[context_word]

          X.append(current_word_onehot)
          Y.append(context_word_onehot)

      ## casting the lists to tensors, as forward pass expects float tensor and loss function expects long tensor
      self.inputs, self.labels = torch.FloatTensor(X), torch.LongTensor(Y)
    

    def batchify(self):
      index = 0
      for index in range(0, len(self.inputs), self.batch_size):
        yield (self.inputs[index:index+self.batch_size], self.labels[index:index+self.batch_size])
   

    

# 1.6 Hyperparameters (0.5 points)

According to the word2vec paper, what would be a good choice for the following hyperparameters? 

* Embedding dimension
* Window size

Initialize them in a dictionary or as independent variables in the code block below. 

In [None]:
# Set hyperparameters
window_size = 5
embedding_size = 300
input_size = len(V)
batch_size = 32

# More hyperparameters
learning_rate = 0.05
epochs = 500

In [None]:
## instantiate HOSAC dataset 
## window size changes over here
print('loading and transforming data...')
hosac_dataset = HOSACDataset(data, window_size, batch_size=batch_size)
hosac_dataset.load_data()

  1%|          | 12/1000 [00:00<00:19, 49.90it/s]

loading and transforming data...


100%|██████████| 1000/1000 [00:26<00:00, 38.08it/s]


# 1.7 Pytorch Module (0.5 + 0.5 + 0.5 points)

Pytorch provides a wrapper for your fancy and super-complex models: [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The code block below contains a skeleton for such a wrapper. Now,

* Initialize the two weight matrices of word2vec as fields of the class.

* Override the ```forward``` method of this class. It should take a one-hot encoding as input, perform the matrix multiplications, and finally apply a log softmax on the output layer.

* Initialize the model and save its weights in a variable. The Pytorch documentation will tell you how to do that.

In [68]:
# Create model 

class Word2Vec(nn.Module):
  def __init__(self, input_size, hidden_size):
    super().__init__()
    self.layer1 = nn.Linear(in_features=input_size, out_features=hidden_size, bias=False)
    self.layer2 = nn.Linear(in_features=hidden_size, out_features=input_size, bias=False)
    #self.log_softmax = nn.LogSoftmax(dim=0)

  def forward(self, one_hot):
    x = self.layer1(one_hot)
    x = self.layer2(x)
    ## omitted logsoftmax since we use CrossEntropyLoss which has implicit NLL + Logsoftmax
    #y = self.log_softmax(x)
    return x

def init_weights(m):
  if type(m) == nn.Linear:
    nn.init.normal_(m.weight)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 1.8 Loss function and optimizer (0.5 points)

Initialize variables with [optimizer](https://pytorch.org/docs/stable/optim.html#module-torch.optim) and loss function. You can take what is used in the word2vec paper, but you can use alternative optimizers/loss functions if you explain your choice in the report.

In [69]:
# Define optimizer and loss
word2vec_model = Word2Vec(input_size=input_size, hidden_size=embedding_size) 
is_untrained = True

if is_untrained:
  word2vec_model.apply(init_weights)
else:
  word2vec_model.load_state_dict(torch.load('/content/word2vec_ws10.pth'.format(window_size)))

word2vec_model = word2vec_model.to(device)
word2vec_model.train(True)

optimizer = torch.optim.Adam(word2vec_model.parameters(), lr=learning_rate)
#criterion = nn.NLLLoss()

criterion = nn.CrossEntropyLoss()

# 1.9 Training the model (3 points)

As everything is prepared, implement a training loop that performs several passes of the data set through the model. You are free to do this as you please, but your code should:

* Load the weights saved in 1.6 at the start of every execution of the code block
* Print the accumulated loss at least after every epoch (the accumulate loss should be reset after every epoch)
* Define a criterion for the training procedure to terminate if a certain loss value is reached. You can find the threshold by observing the loss for the development set.

You can play around with the number of epochs and the learning rate.

In [70]:
# Define train procedure

def train():
  min_loss = 1e3
  print("Training started for lr {}".format(learning_rate))
   
  for epoch in range(epochs):
    loss_val = []
    for (X,y) in hosac_dataset.batchify():
        X= X.to(device)
        y = y.to(device)

        ## forward pass
        output = word2vec_model(X)
        loss = criterion(output, y)
        ## backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        loss_val.append(loss.item())
    
    if np.mean(loss_val) < min_loss:
      min_loss = np.mean(loss_val)
      print('new model saved with epoch loss {}'.format(min_loss))
      torch.save(word2vec_model.state_dict(), '/content/models/word2vec_ws{}.pth'.format(window_size))

    if (epoch+1) % 1 == 0:
      print (f'Epoch [{epoch+1}/{epochs}], Loss: {np.mean(loss_val):.4f}')

train()
print("Training finished")


Training started for lr 0.05
new model saved with epoch loss 64.49352863856724
Epoch [1/500], Loss: 64.4935
new model saved with epoch loss 32.67517607552664
Epoch [2/500], Loss: 32.6752
new model saved with epoch loss 20.72067928314209
Epoch [3/500], Loss: 20.7207
Epoch [4/500], Loss: 20.7969
Epoch [5/500], Loss: 22.1497
Epoch [6/500], Loss: 24.6651
Epoch [7/500], Loss: 26.7806
Epoch [8/500], Loss: 26.0376
Epoch [9/500], Loss: 25.9239
Epoch [10/500], Loss: 24.3989
Epoch [11/500], Loss: 22.9670
Epoch [12/500], Loss: 21.2584
new model saved with epoch loss 19.657137870788574
Epoch [13/500], Loss: 19.6571
Epoch [14/500], Loss: 20.6576
new model saved with epoch loss 18.904375076293945
Epoch [15/500], Loss: 18.9044
Epoch [16/500], Loss: 19.9128
new model saved with epoch loss 18.64438383919852
Epoch [17/500], Loss: 18.6444
new model saved with epoch loss 17.24529184613909
Epoch [18/500], Loss: 17.2453
Epoch [19/500], Loss: 17.9939
Epoch [20/500], Loss: 17.7212
Epoch [21/500], Loss: 17.898

# 1.10 Train on the full dataset (0.5 points)

Now, go back to 1.1 and remove the restriction on the number of sentences in your corpus. Then, reexecute code blocks 1.2, 1.3 and 1.6 (or those relevant if you created additional ones). 

* Then, retrain your model on the complete dataset.

* Now, the input weights of the model contain the desired word embeddings! Save them together with the corresponding vocabulary items (Pytorch provides a nice [functionality](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for this).

In [71]:
word2vec_model.eval()

Word2Vec(
  (layer1): Linear(in_features=7348, out_features=300, bias=False)
  (layer2): Linear(in_features=300, out_features=7348, bias=False)
  (log_softmax): LogSoftmax(dim=0)
)

For **window_size** = 1, **embedding_size** = 300, **Best Epoch loss**: ?????<br/>
For **window_size** = 2, **embedding_size** = 300, **Best Epoch loss**: 1.555 <br/>
For **window_size** = 10, **embedding_size** = 300, **Best Epoch loss**: 1.8762

In [90]:
## inferencing outputs for the word: bad(खराब) 
predictions = word2vec_model(torch.unsqueeze(torch.tensor(word_to_one_hot('modi')), 0).cuda())



## sampling the top k neighbors for the input word
for i in torch.topk(predictions, 15)[1][0]:
  print(index2word[i.item()])


print('-------')
predictions2 = word2vec_model(torch.unsqueeze(torch.tensor(word_to_one_hot('स्कूल')), 0).cuda())
## sampling the top k neighbors for the input word
for i in torch.topk(predictions2, 15)[1][0]:
  print(index2word[i.item()])


हरी
करवाई
विलाप
।तू
byculla
केंद्र
also
लगता
हैक्योंकि
जुट
20
स्वरा
खबरों
शायद
गजवाए
-------
#modisarkaar2
छेद
140
#
मजेदार
#amethi
भाजपाचा
लागू
#iccwc2019
कथनी
जलता
मसरूफ़
[meridies
अखलाक
मौजूदा


In [82]:
word2index['modi']

1937

In [88]:
embedding_weights = word2vec_model.layer1.weight.data


input1 = torch.unsqueeze(torch.tensor(word_to_one_hot('india')), 0).cuda()

print(embedding_weights.shape, input1.T.shape)
embedding1 = torch.matmul(embedding_weights, input1.T)


input2 = torch.unsqueeze(torch.tensor(word_to_one_hot('modi')), 0).cuda()
embedding2 = torch.matmul(embedding_weights, input2.T)


torch.dot(embedding1.squeeze(1), embedding2.squeeze(1))


torch.Size([300, 7348]) torch.Size([7348, 1])


tensor(-3.6434, device='cuda:0')