# GloVe Embedding

GloVe aims to learn word vectors such that their dot product equals the logarithm of the word co-occurrence probability.

To make things cool, I will just drop the formula below:

$$
J=∑ 
i,j
​	
 f(X 
ij
​	
 )(w 
i
T
​	
  
w
~
  
j
​	
 +b 
i
​	
 + 
b
~
  
j
​	
 −log(X 
ij
​	
 )) 
2
 
$$

## Implementing GloVe

There are two types of GloVe implementations. Since it is based-on counting global word co-occurence statistic to determine word vector, the implementation of this method is to:

1. Training the GloVe model first before use

```mermaid
graph TD;
    A[Initialize GloVe model]-->B[Feed forward data to GloVe];
    B[Feed forward data to GloVe]-->C[Train the GloVe model];
    C[Train the GloVe model]-->D[Produce dense word vectors];
    D[Produce dense word vectors]-->E[Load GloVe embeddings];
```

2. Use Pre-Trained GloVe model.

```mermaid
graph TD;
    A[Load pre-trained GloVe model]-->B[Load GloVe embeddings];
```

The implementation of GloVe model will be implemented with TF Glove by [GradySimon](https://github.com/GradySimon/tensorflow-glove/blob/master/tf_glove.py)
Big thanks to GradySimon for providing TF Glove helper function.

## Data Preparation

This part is following these methodologies below:

1. Load data using Pandas
2. Loop through `Tweet` data, get the longest sentence length
3. If the sentence length is < 1000, then vocab length is 1000. Else, 10000
4. Tokenize the Tweet data with Tokenizer

In [1]:
import pandas;
from tqdm import tqdm;
import nltk;

from tf_glove import GloVeModel;


Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
# Call out step 1
dataframe = pandas.read_csv("../raw_dataset.csv", sep = "\\t+");

# Call out step 2
longest_characters = 0;
vocab_length = 0;

for i in tqdm(dataframe["Tweet"]):
    if len(i) > longest_characters:
        longest_characters = len(i);

# Call out step 3

if(longest_characters < 1000):
    vocab_length = 1000;
elif(longest_characters > 1000 and longest_characters < 10000):
    vocab_length = 10000;

print(f"Vocab Length {vocab_length}");

  dataframe = pandas.read_csv("../raw_dataset.csv", sep = "\\t+");
100%|██████████| 10806/10806 [00:00<00:00, 1656626.67it/s]

Vocab Length 1000





In [3]:
def tokenize_corpus(corpus):
    return nltk.wordpunct_tokenize(corpus.lower());


tqdm.pandas();
dataframe["Tweet"] = dataframe["Tweet"].progress_apply(lambda x: tokenize_corpus(x));

100%|██████████| 10806/10806 [00:00<00:00, 129180.20it/s]


## Training GloVe model



In [4]:
model = GloVeModel(embedding_size = 1000, context_size = 5);
model.fit_to_corpus(dataframe["Tweet"]);
model.train(num_epochs = 10);

Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Instructions for updating:
Use `tf.cast` instead.

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor



In [8]:
model.embedding_for("aku")

array([ 0.36739987,  1.2969656 , -0.3027271 ,  0.05119056, -0.9190519 ,
        0.51998466,  0.18125236, -0.48618796, -0.23000172, -0.08274281,
        0.06127553,  0.6205626 ,  0.8424106 ,  0.36313778, -0.5676478 ,
       -0.93488157,  1.0218089 , -0.59509873,  0.17597777,  0.13305256,
       -0.36728984,  0.7447316 ,  0.22728398,  0.9279646 ,  1.0747355 ,
        1.3750172 ,  0.00245342,  0.6356756 ,  0.5055693 ,  0.7495794 ,
        0.6845776 ,  0.27315468, -0.8640884 , -0.3232136 ,  0.25747523,
       -0.39579502, -0.19610685,  0.02571405, -1.1796508 , -0.68697745,
       -1.0049883 , -0.3549552 , -1.1043639 , -0.49495208,  0.32934785,
        0.30109826,  0.87725556, -0.6226714 , -0.3456232 ,  0.54201734,
        1.1347576 ,  0.18764961,  0.41242504,  0.41194054, -0.21827313,
        0.15250611,  1.0225685 , -1.3533379 ,  1.3144627 , -0.44066808,
        0.60559297, -0.81925744,  0.15231593,  0.75875187, -0.40349457,
       -0.64702666, -0.26469386, -0.21108013, -0.35439715, -0.50