# GlobVe Model

*Glove+Gensim software example.*

**Install:**
pip install glove_python

In [1]:
from glove import Corpus, Glove
import gensim
from six import iteritems
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
import time
import nltk

#Wiki corpus path
corpus_path = '/media/DATA/wiki_es/'
wiki_corpus = corpus_path+'dump/eswiki-20161201-pages-articles-multistream.xml.bz2'

## Wrangling Data 

From Wikipedia dump to glove.Corpus object.

This process take many hours as you can read in *~/gensim/scripts/make_wikicorpus.py* docstring. The wole process of generating .mm corpus from the dump of wikipedia takes 10 ours or more in a powerfull machine. The next cells show how to generate de Glove model from the dump of wikipedia, and in the next session this will be done with .mm corpus obtained with Gensim. To run a simple example look at GlobVe notebook made with Gutenberg raw text corpus from NLTK library.

In [2]:
def read_wikipedia_corpus(filename, dictionary, article_min_tokens, article_max_tokens):

    # We don't want to do a dictionary construction step.
    corpus = gensim.corpora.WikiCorpus(filename, 
                                       dictionary=dictionary,
                                       article_min_tokens=article_min_tokens,
                                       article_max_tokens=article_max_tokens,
                                       lemmatize=None)

    for text in corpus.get_texts():
        yield text

In [3]:
dictionary = Dictionary.load_from_text(corpus_path+'_wordids.txt.bz2')

#Test this cell in the cluster, to very slow in a laptop
corpus_model = Corpus()
init = time.time()
print(init)
corpus_model.fit(read_wikipedia_corpus(wiki_corpus, 
                                       dictionary=dictionary, 
                                       article_min_tokens=50,
                                       article_max_tokens=2000), window=10, )
end = time.time() - init
corpus_model.save(corpus_path+'wiki-glove.model')
print('TfIdf Model Generated in %d seconds' % end)



1521606059.1279578
----------- 1000
1521606072.001308
----------- 2000
1521606094.1524615
----------- 3000
1521606112.4316704
----------- 4000
1521606126.350563
----------- 5000
1521606139.8059845
----------- 6000
1521606151.0000122
----------- 7000
1521606156.9325473
----------- 8000
1521606167.9938378
----------- 9000
1521606182.402235
----------- 10000
1521606196.7842052
----------- 11000
1521606211.248341
----------- 12000
1521606227.6211655
----------- 13000
1521606245.197114
----------- 14000
1521606262.9224286
----------- 15000
1521606281.8131185
----------- 16000
1521606300.8992255
----------- 17000
1521606319.0168703
----------- 18000
1521606333.1823945
----------- 19000
1521606350.817441
----------- 20000
1521606368.8233101
----------- 21000
1521606385.2184923
----------- 22000
1521606404.8467731
----------- 23000
1521606423.9464397


KeyboardInterrupt: 

After 4 hours was impossible to get the model for the Spanish dump of wikipedia.

### How to do this from .mm corpus? 

From Wikipedia.mm to a Gensim bow format corpus, an later to a text with one line per sentence (objective: iterate over lines without overload the computer memory).

In [3]:
wiki_id2word = Dictionary.load_from_text('/media/D/wiki_en/_wordids.txt.bz2')
bow_corpus = MmCorpus('/media/D/wiki_en/_bow.mm')



In [3]:
#print the word with index 1000 in the dictionary
wiki_id2word[1000]

'murska'

In [4]:
len(bow_corpus.index)

4181821

In [4]:
stopwords = nltk.corpus.stopwords.words('en')
stopword_set = set(stopwords)
init = time.time()
#Recoveryn .mm text in a list of words by doc structure. 
with open('/media/D/wiki_en/wiki_text.txt','a') as model:
    for i,doc in enumerate(bow_corpus):
        wiki_text = ''
        if i%1000000 ==0:
            print('processing doc', i)
            print(time.time()-init)

        wiki_text = ' '.join([wiki_id2word[id2w] for id2w,frec in doc if wiki_id2word[id2w] not in stopword_set])
        model.write(wiki_text+'\n')  

processing doc 0
0.04697132110595703
processing doc 1000000
756.9963593482971
processing doc 2000000
1481.9766130447388
processing doc 3000000
2140.7227380275726
processing doc 4000000
3317.592767238617


In [14]:
def read_wiki_text(path, init):
    with open(path) as doc:
        for i,line in enumerate(doc):
            if i%10000==0:
                print(i, time.time()-init)
            yield line.split()

In [None]:
init = time.time()
wiki_corpus_model = Corpus()
wiki_corpus_model.fit(read_wiki_text('/media/D/wiki_en/wiki_text.txt', init), window=10)
wiki_corpus_model.save('gensim_data/wiki_glove_corpus.model')
wiki_glove = Glove(no_components=300, learning_rate=0.05)
wiki_glove.fit(wiki_corpus_model.matrix, epochs=2, no_threads=2, verbose=True)
wiki_glove.add_dictionary(wiki_corpus_model.dictionary)
wiki_glove.save('gensim_data/wiki_glove_glove.model')
print(wiki_glove.word_vectors[wiki_glove.dictionary['girl']][:10])
end = time.time()-init
print('Fit GloVe model, adding dict and saving took:',end)

0 0.00039386749267578125
10000 6.154743194580078
20000 14.900368213653564
30000 25.731306076049805
40000 37.31264615058899
50000 49.834702014923096
60000 62.695298194885254
70000 75.88489246368408
80000 89.51990628242493
90000 103.76476693153381
100000 118.72444653511047


## Pretrained Models

To generate some of this models are very computing expensive, so it is better to load pre-trained models to calculate the correspondent word vector.

[Wikipedia Stanford Glove Corpus](http://nlp.stanford.edu/data/glove.6B.zip)

# Conclusions

1. Like word2vec this model allows parallelism training.
2. Like word2vec this model generates a vector for every word appearing in the corpus with length = 'no_components'.
3. Applying cumulative sum of ndarrays it is possible to obtain a sentence vector of the same length.
4. Then using a regular scipy, sklearn, textsim tokendist or vector similarity distance the similarity between 2 sentences is possible.
5. Using this mechanism it is possible to reproduce some **word embedding features** used in Paraphrase Recognition task.
6. Generate GlobVe model with English Wikipedia it is no possible in this computer (i7 8Gb RAM).

# Recommendations

- Test GlobVe with spanish wikipedia.