# GloVe Text Representation Model

*Glove+Gensim software example.*

**Prerequisites:** Skills in tokenization with nltk, conceptual knowledge about GloVe Text Representation model.

**data:** Gutenberg Corpus

**Install:**
pip install glove_python

## Outline

**Main Goal:** To practice how to create GloVe model with glove-python and Gensim packages, using NLTK preprocessing. Then introduce how to extract features from this text representation model, and finally how to measure text similarity using the previous result.

- Acquiring and wrangling data for model initialization. 
- GloVe model generation/loading example
- Sklearn, Scipy text similarity measures examples
- GloVe original measures examples

## What is GloVe?

...[(XXX)](#XXX).

**Note**: About Gensim and NLTK software please read the introductions notes about them in [2.01-TfIdf Notebook](2.01-TfIdf.ipynb)

In [1]:
from glove import Corpus, Glove
import gensim
from six import iteritems
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import time
import nltk
import os

## Wrangling Data

From txt to iterable of lists of strings.

In [2]:
doc_collection = []
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read())
            
#Wrangling the data from list of doc-strings -> list of word-list by sentences
sentences = []
for doc in range(len(doc_collection)):
    for sent in nltk.sent_tokenize(doc_collection[doc]):
        sent_words = []
        for word in nltk.word_tokenize(sent):
            sent_words.append(word)
        sentences.append(sent_words)
            
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

texts = [[word for word in sentence if word not in stopwords] for sentence in sentences]

id2word = gensim.corpora.Dictionary(texts)

## Generating GloVe Model

In [3]:
init = time.time()
corpus_model = Corpus()
corpus_model.fit(texts, window=10)
end = time.time()-init
corpus_model.save('gensim_data/glove_Gutenberg_corpus.model')
print('GloVe Model Generated in %d seconds' % end)

GloVe Model Generated in 4 seconds


In [4]:
glove = Glove(no_components=300, learning_rate=0.05)
glove.fit(corpus_model.matrix, epochs=0, no_threads=2, verbose=True)
glove.add_dictionary(corpus_model.dictionary)
glove.save('gensim_data/glove_Gutenberg.model')

Performing 0 training epochs with 2 threads


In [5]:
#Showing a word vector, dictionary-key is needed
glove.word_vectors[glove.dictionary['girl']][:10]

array([ 9.97149822e-04, -6.69243747e-04, -1.99461371e-05,  1.65652432e-03,
        7.86864625e-04, -1.22957514e-03,  1.40204706e-03,  8.97689592e-04,
        9.06579444e-04, -1.13992907e-03])

## Sklearn GlobVe-Cosine sentence similarity

### Wrangling Data

In [6]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

In [7]:
import numpy as np

def preproc_data(sent1, sent2, model):
    sentence1 = sent1.split()
    sentence2 = sent2.split()

    globvec_sent1 = []
    globvec_sent2 = []

    for word in sentence1:
        try:
            glove.dictionary[word]
            globvec_sent1.append(glove.word_vectors[glove.dictionary[word]])
        except:
            pass

    for word in sentence2:
        try:
            glove.dictionary[word]
            globvec_sent2.append(glove.word_vectors[glove.dictionary[word]])
        except:
            pass


    globvec_sent1 = sum(np.asarray(globvec_sent1))
    globvec_sent2 = sum(np.asarray(globvec_sent2))
    
    A = globvec_sent1.reshape(1,-1)
    B = globvec_sent2.reshape(1,-1)
    
    return A, B

In [8]:
globvec_sent1, globvec_sent2 = preproc_data(sentence1,sentence2,glove)
print(globvec_sent1.shape)
print(globvec_sent2[0][:10])

(1, 300)
[-3.77823606e-04  5.52589552e-04 -2.03766040e-04 -2.25560992e-04
 -2.46303425e-03  1.05855900e-03 -2.60252426e-03 -5.67222960e-04
  1.31821014e-05  2.79117383e-03]


### Applying Similarity

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(globvec_sent1,globvec_sent2)[0][0]

0.6076353473981132

In [10]:
#Filtering stopwords
sent1s = 'girl run hall'
sent2s = 'Alice run hall'
globvec_sent1s, globvec_sent2s = preproc_data(sent1s,sent2s,glove)
cosine_similarity(globvec_sent1s,globvec_sent2s)[0][0]

0.7088825862303931

## Scipy Cosine Similarity

In [11]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(globvec_sent1,globvec_sent2))
print(cosine_scipy(globvec_sent1s,globvec_sent2s)) #Filtering stopwords

0.3923646526018869
0.29111741376960687


## Glove transform_paragraph method

Still remains experimental doesn't works.

Other methods like Harmonic mean of word-similarity are not possible with this package because this methods are implemented in every Representation Text model in Gensim, but Gensim doesn't have an implementation of GloVe.

# Conclusions

1. Like word2vec this model allows parallelism training.
2. Like word2vec this model generates a vector for every word appearing in the corpus with length = 'no_components'.
3. Applying cumulative sum of ndarrays it is possible to obtain a sentence vector of the same length.
4. Then using a regular scipy, sklearn, textsim tokendist or vector similarity distance the similarity between 2 sentences is possible.
5. Using this mechanism it is possible to reproduce some **word embedding features** used in Paraphrase Recognition task.
6. Generate GlobVe model with English Wikipedia it is no possible in this computer (i7 8Gb RAM).

# Recommendations

- Test GlobVe with spanish wikipedia.