# GloVe Embedding

GloVe aims to learn word vectors such that their dot product equals the logarithm of the word co-occurrence probability.

To make things cool, I will just drop the formula below:

$$
J=∑ 
i,j
​	
 f(X 
ij
​	
 )(w 
i
T
​	
  
w
~
  
j
​	
 +b 
i
​	
 + 
b
~
  
j
​	
 −log(X 
ij
​	
 )) 
2
 
$$

## Implementing GloVe

There are two types of GloVe implementations. Since it is based-on counting global word co-occurence statistic to determine word vector, the implementation of this method is to:

1. Training the GloVe model first before use

```mermaid
graph TD;
    A[Initialize GloVe model]-->B[Feed forward data to GloVe];
    B[Feed forward data to GloVe]-->C[Train the GloVe model];
    C[Train the GloVe model]-->D[Produce dense word vectors];
    D[Produce dense word vectors]-->E[Load GloVe embeddings];
```

2. Use Pre-Trained GloVe model.

```mermaid
graph TD;
    A[Load pre-trained GloVe model]-->B[Load GloVe embeddings];
```

The implementation of GloVe model will be implemented with TF Glove by [GradySimon](https://github.com/GradySimon/tensorflow-glove/blob/master/tf_glove.py)
Big thanks to GradySimon for providing TF Glove helper function.

## Data Preparation

This part is following these methodologies below:

1. Load data using Pandas
2. Loop through `Tweet` data, get the longest sentence length
3. If the sentence length is < 1000, then vocab length is 1000. Else, 10000
4. Tokenize the Tweet data with Tokenizer

In [1]:
import pandas;
from tqdm import tqdm;
import nltk;

from tf_glove import GloVeModel;


Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
# Call out step 1
dataframe = pandas.read_csv("../raw_dataset.csv", sep = "\\t+");

# Call out step 2
longest_characters = 0;
vocab_length = 0;

for i in tqdm(dataframe["Tweet"]):
    if len(i) > longest_characters:
        longest_characters = len(i);

# Call out step 3

if(longest_characters < 1000):
    vocab_length = 1000;
elif(longest_characters > 1000 and longest_characters < 10000):
    vocab_length = 10000;

print(f"Vocab Length {vocab_length}");

  dataframe = pandas.read_csv("../raw_dataset.csv", sep = "\\t+");
100%|██████████| 10806/10806 [00:00<00:00, 2703468.48it/s]

Vocab Length 1000





In [3]:
def tokenize_corpus(corpus):
    return nltk.wordpunct_tokenize(corpus.lower());


tqdm.pandas();
dataframe["Tweet"] = dataframe["Tweet"].progress_apply(lambda x: tokenize_corpus(x));

100%|██████████| 10806/10806 [00:00<00:00, 161866.71it/s]


## Training GloVe model



In [4]:
model = GloVeModel(embedding_size = 1000, context_size = 5);
model.fit_to_corpus(dataframe["Tweet"]);
model.train(num_epochs = 10);

Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Instructions for updating:
Use `tf.cast` instead.

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor



In [5]:
model.embedding_for("aku")

array([ 1.21333450e-01, -1.32528770e+00,  4.14777100e-02, -1.64152980e-02,
       -9.81633961e-01, -6.78149819e-01,  1.15415549e+00,  1.03544205e-01,
        9.02565241e-01,  2.04794914e-01, -7.17062414e-01,  5.55853128e-01,
       -9.23180223e-01,  1.36843920e-02,  7.19226301e-01, -6.36990190e-01,
        3.06786038e-02,  3.39919835e-01,  5.66846430e-01,  4.29743111e-01,
        3.19342017e-01, -8.68039727e-01, -1.00644445e+00,  4.65747863e-01,
        5.18741071e-01, -1.33658603e-01,  2.65532821e-01, -6.00757420e-01,
       -9.20075715e-01,  2.33205482e-02, -9.01953697e-01, -2.48577103e-01,
       -9.41582084e-01,  1.61590502e-01, -1.00619555e+00,  5.06463408e-01,
       -4.10894513e-01,  2.55994439e-01,  9.66291130e-02,  7.11476743e-01,
        4.62767899e-01,  7.60513306e-01, -1.32661736e+00, -1.45455432e+00,
       -1.77167535e-01, -6.34779930e-02, -4.10540938e-01, -3.95744890e-01,
        7.00006664e-01,  7.61516094e-02,  9.62969065e-01, -1.72387868e-01,
       -1.88607574e-01,  