# GloVe Embedding

GloVe aims to learn word vectors such that their dot product equals the logarithm of the word co-occurrence probability.

To make things cool, I will just drop the formula below:

$$
J=∑ 
i,j
​	
 f(X 
ij
​	
 )(w 
i
T
​	
  
w
~
  
j
​	
 +b 
i
​	
 + 
b
~
  
j
​	
 −log(X 
ij
​	
 )) 
2
 
$$

## Implementing GloVe

There are two types of GloVe implementations. Since it is based-on counting global word co-occurence statistic to determine word vector, the implementation of this method is to:

1. Training the GloVe model first before use

```mermaid
graph TD;
    A[Initialize GloVe model]-->B[Feed forward data to GloVe];
    B[Feed forward data to GloVe]-->C[Train the GloVe model];
    C[Train the GloVe model]-->D[Produce dense word vectors];
    D[Produce dense word vectors]-->E[Load GloVe embeddings];
```

2. Use Pre-Trained GloVe model.

```mermaid
graph TD;
    A[Load pre-trained GloVe model]-->B[Load GloVe embeddings];
```

However, since I think there are no GloVe's model for the provided dataset, we may need to take the first approach, and then for the example of the second approach will the load of the previously trained GloVe model.

## Data Preparation

This part is following these methodologies below:

1. Load data using Pandas
2. Loop through `Tweet` data, get the longest sentence length
3. If the sentence length is < 1000, then vocab length is 1000. Else, 10000
4. Tokenize the Tweet data with Tokenizer
5. Find Co-Occurence etc.

In [8]:
import pandas;
from tqdm import tqdm;
from collections import Counter;
import tensorflow;

from tensorflow.keras.preprocessing.sequence import pad_sequences;
from tensorflow.keras.preprocessing.text import Tokenizer;

2.16.1


In [None]:
# Call out step 1
dataframe = pandas.read_csv("../data.csv");

# Call out step 2
longest_characters = 0;
vocab_length = 0;

for i in tqdm(dataframe["Tweet"]):
    if len(i) > longest_characters:
        longest_characters = len(i);

# Call out step 3

if(longest_characters < 1000):
    vocab_length = 1000;
elif(longest_characters > 1000 and longest_characters < 10000):
    vocab_length = 10000;

print(f"Vocab Length {vocab_length}");

100%|██████████| 10806/10806 [00:00<00:00, 561645.26it/s]

Vocab Length 1000





In [None]:
# Call out step 4
tokenizer = Tokenizer(num_words = vocab_length, oov_token = "<OOV>");
tokenizer.fit_on_texts(dataframe["Tweet"]);

word_index = tokenizer.word_index

In [None]:
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(dataframe["Tweet"]);
dataframe["Tweet"] = pad_sequences(sequences, maxlen = 1000, padding = "post", truncating = "post");

dataframe.head()

Unnamed: 0,sentimen,Tweet,Unnamed: 2
0,-1,435,
1,-1,6,
2,1,292,
3,1,705,
4,-1,2,


In [None]:
# Call out step 5

# Let's say the context_size is 5 since I took chance to see the dataset and find the context after reading 5 words
context_size = 5;

# Iterate over corpus to extract context and compute co-occurrence count
context_pairs = []
for sequence in sequences:
    for i, target_word_index in enumerate(sequence):
        for j in range(max(0, i - context_size), min(len(sequence), i + context_size + 1)):
            if j != i:
                context_word_index = sequence[j];
                context_pairs.append((target_word_index, context_word_index));

# Compute co-occurrence count for each word pair
co_occurrence_counts = Counter(context_pairs);

# Representation of data for GloVe training
target_words = [pair[0] for pair in context_pairs];
context_words = [pair[1] for pair in context_pairs];
co_occurrence_counts = [co_occurrence_counts[pair] for pair in context_pairs];

## Training GloVe model

In [None]:
from torch.nn import Module, Embedding, MSELoss;
from torch.optim import SGD;

In [None]:
class GloveNN(Module):
    def __init__(self, vocab_length: int, embedding_dimension: int = 100):
        super(GloveNN, self).__init__();
        elf.embedding = Embedding(vocab_length, embedding_dimension);
        self.bias_target = Embedding(vocab_length, 1);
        self.bias_context = Embedding(vocab_length, 1);

    def init_weights(self):
        self.embedding.weight.data.uniform_(-0.5, 0.5)
        self.bias_target.weight.data.zero_()
        self.bias_context.weight.data.zero_()

    def forward(self, target, context):
        embed_target = self.embedding(target)
        embed_context = self.embedding(context)
        bias_target = self.bias_target(target).squeeze(1)
        bias_context = self.bias_context(context).squeeze(1)
        dot_product = torch.sum(embed_target * embed_context, dim=1)
        return dot_product + bias_target + bias_context

model = GloveNN(vocab_length);
loss_fn = MSELoss();
optimizer = SGD(model.parameters(), lr = 1e-3);

NameError: name 'vocab_length' is not defined

In [None]:
for epoch in range(25):
    total_loss = 0;
    for target, context, co_occurrence_count in tqdm(dataframe["Tweet"]):
        optimizer.zero_grad();
        output = model(target, context);
        loss = loss_fn(output, co_occurrence_count);
        loss.backward();
        optimizer.step();
        total_loss += loss.item();
    print(f"Epoch: {epoch + 1}, Loss: {total_loss}");