# GloVe Embedding

GloVe aims to learn word vectors such that their dot product equals the logarithm of the word co-occurrence probability.

To make things cool, I will just drop the formula below:

$$
J=∑ 
i,j
​	
 f(X 
ij
​	
 )(w 
i
T
​	
  
w
~
  
j
​	
 +b 
i
​	
 + 
b
~
  
j
​	
 −log(X 
ij
​	
 )) 
2
 
$$

## Implementing GloVe

There are two types of GloVe implementations. Since it is based-on counting global word co-occurence statistic to determine word vector, the implementation of this method is to:

1. Training the GloVe model first before use

```mermaid
graph TD;
    A[Initialize GloVe model]-->B[Feed forward data to GloVe];
    B[Feed forward data to GloVe]-->C[Train the GloVe model];
    C[Train the GloVe model]-->D[Produce dense word vectors];
    D[Produce dense word vectors]-->E[Load GloVe embeddings];
```

2. Use Pre-Trained GloVe model.

```mermaid
graph TD;
    A[Load pre-trained GloVe model]-->B[Load GloVe embeddings];
```

However, since I think there are no GloVe's model for the provided dataset, we may need to take the first approach, and then for the example of the second approach will the load of the previously trained GloVe model.

## Data Preparation

This part is following these methodologies below:

1. Load data using Pandas
2. Loop through `Tweet` data, get the longest sentence length
3. If the sentence length is < 1000, then vocab length is 1000. Else, 10000
4. Tokenize the Tweet data with Tokenizer
5. Find Co-Occurence etc.

In [1]:
import pandas;
from tqdm import tqdm;
from collections import Counter;
import torch;

import tensorflow;

from tensorflow.keras.preprocessing.sequence import pad_sequences;
from tensorflow.keras.preprocessing.text import Tokenizer;

In [2]:
torch.cuda.is_available()

True

In [3]:
# Call out step 1
dataframe = pandas.read_csv("../data.csv");

# Call out step 2
longest_characters = 0;
vocab_length = 0;

for i in tqdm(dataframe["Tweet"]):
    if len(i) > longest_characters:
        longest_characters = len(i);

# Call out step 3

if(longest_characters < 1000):
    vocab_length = 1000;
elif(longest_characters > 1000 and longest_characters < 10000):
    vocab_length = 10000;

print(f"Vocab Length {vocab_length}");

100%|██████████| 10806/10806 [00:00<00:00, 1959179.09it/s]

Vocab Length 1000





In [4]:
# Call out step 4
tokenizer = Tokenizer(num_words = vocab_length, oov_token = "<OOV>");
tokenizer.fit_on_texts(dataframe["Tweet"]);

word_index = tokenizer.word_index

In [5]:
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(dataframe["Tweet"]);
dataframe["Tweet"] = pad_sequences(sequences, maxlen = 1000, padding = "post", truncating = "post");

dataframe.head()

Unnamed: 0,sentimen,Tweet,Unnamed: 2
0,-1,435,
1,-1,6,
2,1,292,
3,1,705,
4,-1,2,


In [6]:
# Call out step 5

# Let's say the context_size is 5 since I took chance to see the dataset and find the context after reading 5 words
context_size = 5;

# Iterate over corpus to extract context and compute co-occurrence count
context_pairs = []
print("Extracting context and computing co-occurance count");
for sequence in tqdm(sequences):
    for i, target_word_index in enumerate(sequence):
        for j in range(max(0, i - context_size), min(len(sequence), i + context_size + 1)):
            if j != i:
                context_word_index = sequence[j];
                context_pairs.append((target_word_index, context_word_index));

# Compute co-occurrence count for each word pair
co_occurrence_counts = Counter(context_pairs);

print("\n\n");

# Representation of data for GloVe training
print("Converting tokenized target words to tensor of int64");
target_words = [torch.tensor(pair[0], dtype = torch.int64) for pair in tqdm(context_pairs)];

print("Converting tokenized contexts to tensor of int64");
context_words = [torch.tensor(pair[1], dtype = torch.int64) for pair in tqdm(context_pairs)];

print("Processing co occurance counts and converting it to tensor of float32");
co_occurrence_counts = [torch.tensor(co_occurrence_counts[pair], dtype = torch.float32) for pair in tqdm(context_pairs)];

Extracting context and computing co-occurance count


100%|██████████| 10806/10806 [00:00<00:00, 21741.56it/s]





Converting tokenized target words to tensor of int64


100%|██████████| 1172736/1172736 [00:06<00:00, 182543.88it/s]


Converting tokenized contexts to tensor of int64


100%|██████████| 1172736/1172736 [00:05<00:00, 212945.68it/s]


Processing co occurance counts and converting it to tensor of float32


100%|██████████| 1172736/1172736 [00:06<00:00, 192010.71it/s]


## Training GloVe model

In [7]:
from torch.nn import Module, Embedding, MSELoss;
from torch.optim import SGD;

In [8]:
class GloveNN(Module):
    def __init__(self, vocab_length: int, embedding_dimension: int = 100):
        super(GloveNN, self).__init__();
        self.embedding = Embedding(vocab_length, embedding_dimension);
        # self.bias_target = Embedding(vocab_length, 1);
        # self.bias_context = Embedding(vocab_length, 1);

    def init_weights(self):
        self.embedding.weight.data.uniform_(-0.5, 0.5)
        # self.bias_target.weight.data.zero_()
        # self.bias_context.weight.data.zero_()

    def forward(self, target, context):
        embed_target = self.embedding(target)
        embed_context = self.embedding(context)
        # bias_target = self.bias_target(target).squeeze(1)
        # bias_context = self.bias_context(context).squeeze(1)
        dot_product = torch.sum(embed_target * embed_context)
        # return dot_product + bias_target + bias_context
        return dot_product

model = GloveNN(vocab_length);
loss_fn = MSELoss();
optimizer = SGD(model.parameters(), lr = 1e-3);

In [9]:
len(dataframe["Tweet"]), len(target_words), len(context_words), len(co_occurrence_counts)

(10806, 1172736, 1172736, 1172736)

In [10]:
for epoch in range(25):
    total_loss = 0;
    for i in tqdm(range(len(target_words))):
        optimizer.zero_grad();
        output = model(target_words[i], context_words[i]);
        loss = loss_fn(output, co_occurrence_counts[i]);

        loss.backward();
        optimizer.step();
        total_loss += loss.item();
    print(f"Epoch: {epoch + 1}, Loss: {total_loss}");

100%|██████████| 1172736/1172736 [10:33<00:00, 1851.92it/s]


Epoch: 1, Loss: nan


100%|██████████| 1172736/1172736 [10:17<00:00, 1899.17it/s]


Epoch: 2, Loss: nan


100%|██████████| 1172736/1172736 [10:46<00:00, 1812.63it/s]


Epoch: 3, Loss: nan


100%|██████████| 1172736/1172736 [11:00<00:00, 1776.83it/s]


Epoch: 4, Loss: nan


 45%|████▌     | 532764/1172736 [04:42<05:39, 1887.36it/s] 


KeyboardInterrupt: 