# Gensim Doc2vec

### Overview

As previously we saw the use of Word2vec which takes in the input and represents the text in the form of word vectors but ignores the word order and hence Quoc Le and Tomas Mikolov came up with the Doc2Vec method that represents not only words, but entire sentences and documents. This method is almost identical to Word2Vec, except it now generalize the method by adding a paragraph/document vector. An optimized version of Doc2vec is available in Python. 

This notebook attempts to study Doc2vec method for learning how to train and evaluate any model. This helps in studying the relationships between documents. 

The data used is Lee Corpus which is built specially for applying Doc2vec in Python using gensim for study purposes that gives clear idea of how things work and one can play around with this data to get results. 

### Data

Lee corpus has been introduced by Dr. Michael Lee http://faculty.sites.uci.edu/mdlee/ that contains 50 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

In [1]:
import gensim
import os
import collections
import smart_open
import random



In [2]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

Location of this corpus in python

In [3]:
print(lee_train_file)

C:\Users\Prajakta\Anaconda3\lib\site-packages\gensim\test\test_data\lee_background.cor


In [4]:
with open(lee_train_file) as f:
    for n, l in enumerate(f):
        if n < 5:
            print([l])

['Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year\'s Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at t

Gensim provides a pre-processing method that tokenizes, remove punctuation, set to lowercase, etc, and return a list of words.
The below function makes use of this method and also uses TaggedDocument to tag/label each document with training corpus. Doc2vec represents each word in terms of vectors and assigns tags/labels to them.

In [5]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [6]:
#Make lists of train and test corpus
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [7]:
train_corpus[:2]

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

In [8]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'

From above, the test corpus does not contain any tags.
Now, we instantiate the Doc2vec object and set a vector size with 50 words and iterating over the training corpus 55 times. The minimum word count is set to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes.

In [9]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)

### Vocabulary building

Essentially, the vocabulary is a dictionary (accessible via model.wv.vocab) of all of the unique words extracted from the training corpus along with the count (e.g., model.wv.vocab['penalty'].count for counts for the word penalty).

In [10]:
model.build_vocab(train_corpus)

### Infer Vector

Gensim provides infer_vector by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.

In [12]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forrest', 'fires'])

array([-0.0488959 , -0.04925137,  0.0702027 , -0.08176951, -0.19408338,
        0.03227499,  0.16560279, -0.11836297, -0.21711209, -0.16258545,
       -0.01218748,  0.07400796, -0.14070468, -0.01611147, -0.03651289,
       -0.04847134, -0.05478801,  0.01811546, -0.2016954 ,  0.09516598,
        0.08637386,  0.01689524,  0.24296169, -0.10003939,  0.11473741,
        0.04977448,  0.0958508 ,  0.12221827,  0.02935479, -0.16181894,
        0.18574406, -0.08980884, -0.03306618,  0.03097287,  0.00084509,
        0.07294434,  0.04371838,  0.18647136,  0.13067466, -0.10793348,
        0.05411503, -0.02300423,  0.00850911, -0.08327397, -0.02406243,
       -0.15011084, -0.08370403,  0.07207834,  0.16361427,  0.07315579], dtype=float32)

### Calculate ranks for evaluation

For evaluating this new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents. This will help in judging if the model is trained properly. 

In [13]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    second_ranks.append(sims[1])

In [14]:
#Counting how each document is ranked w.r.t training corpus
collections.Counter(ranks) 

Counter({0: 291, 1: 9})

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. This checking of an inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:

Let us calculate the most, median and least similar document for a given document to check. 

In [15]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

We can see that Document 299 is most similar with a score of 95% and Document 76 is the least similar of all.

Let us use this over test corpus and print the most similar document from train corpus and also eventually find the similar document from ranking. Pick any random document from test corpus using random function.

In [16]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (76): «the death toll in argentina food riots has risen to local media reports say four more people died this morning in clashes between police and protesters near the presidential palace in the capital buenos aires president fernando de la rua has called on the opposition to take part in government of national unity and apparently will resign if it does not looting and rioting has generally given way to more peaceful demonstrations against the faltering government blamed for month recession heavily armed police using powers under day state of siege decree are attempting to prevent large public gatherings but union leaders say workers and the unemployed will not stop until the government is removed and living standards restored with argentina discredited economy minister now gone the government hopes to approve new budget acceptable to the international monetary fund imf to avoid default on the billion foreign debt the presidents of neighbouring brazil and chile say they

Finally, we calculate infer_vector from our test corpus and do the similar procedure as we did for train corpus and compare the results.

In [17]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))


SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (21, 0.5991822481155396): «the nation road toll has risen to after another death on new south wales roads year old man who was injured in crash on the mid north coast last week has died in hospital year old boy who was passenger in the car when it hit telephone pole remains in critical condition new south wales has recorded holiday deaths seven people have died on queensland roads five in victoria three in the northern territory two each in western australia and south australia the act and tasmania remain fatality free»

MEDIAN (10, 0.37200823426246643): «work is continuing this morning to restore power supplies to tens of thousands of homes that were blacked out during wild storms that struck south east queensland last night gale force winds uprooted trees and brought down power lines damaging homes and cars energex and ergon energy have had every available person working through the night to restore power

From the test corpus, we can see Document 21 is most similar with a score ~ 60%. 

This gives an idea for classification of documents and finding relationships between them in an easy way which is good to go.