# Introduction

There are tons of articles explaining how Doc2Vec works but sadly I haven't found any practical implementation of this model. I had a really hard time while finding the implementation of Doc2Vec. So, In this article, we will be writing about how Doc2Vec works and also a practical implementation of this model too 🙌

## Getting Started

We will build an article recommender by using Gensim Doc2Vec so without further ado let's get started. First of let us do some housekeeping for our Article Recommender by importing python libraries

In [1]:
# Import libraries and modules modules.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd

## Reading data into memory
I am reading a dataset file from my local machine which has multiple columns but I'll be using only abstract for this tutorial. You can find a good dataset from Kaggle (https://www.kaggle.com/benhamner/nips-papers?select=papers.csv). The process for this dataset will be pretty much the same.

In [2]:
file = pd.read_csv("articles.csv")
common_texts_pre = file["abstract"]
common_texts = common_texts_pre[:80]
common_texts

0     In this paper we develop a Bayesian nonparamet...
1     Q-learning and other linear dynamic learning a...
2     A more effective vision of machine learning sy...
3     Dealing with multiple labels is a supervised l...
4     This paper addresses the problem of learning e...
                            ...                        
75    Consider optimization problems, where a target...
76    To better understand and analyze text corpora,...
77    A number of intriguing decision scenarios, suc...
78    High rate of correctness of the information in...
79    In this paper, we are dealing with the automat...
Name: abstract, Length: 80, dtype: object

## Data Pre-processing
To use this data we need to pre-process it first. What do I mean with pre-processing? Well in this case we will convert all the words to lowercase, remove stop words and tokenize it. To remove stopwords we will be using gensim and you might have guessed while importing libraries and modules.

In [3]:
common_texts = [word_tokenize(sw_removed.lower()) for sw_removed in common_texts if not sw_removed in stopwords.words()]
common_texts

[['in',
  'this',
  'paper',
  'we',
  'develop',
  'a',
  'bayesian',
  'nonparametric',
  'inverse',
  'reinforcement',
  'learning',
  'technique',
  'for',
  'switched',
  'markov',
  'decision',
  'processes',
  '(',
  'mdp',
  ')',
  '.',
  'similar',
  'to',
  'switched',
  'linear',
  'dynamical',
  'systems',
  ',',
  'switched',
  'mdp',
  '(',
  'smdp',
  ')',
  'can',
  'be',
  'used',
  'to',
  'represent',
  'complex',
  'behaviors',
  'composed',
  'of',
  'temporal',
  'transitions',
  'between',
  'simpler',
  'behaviors',
  'each',
  'represented',
  'by',
  'a',
  'standard',
  'mdp',
  '.',
  'we',
  'use',
  'sticky',
  'hierarchical',
  'dirichlet',
  'process',
  'as',
  'a',
  'nonparametric',
  'prior',
  'on',
  'the',
  'smdp',
  'model',
  'space',
  ',',
  'and',
  'describe',
  'a',
  'markov',
  'chain',
  'monte',
  'carlo',
  'method',
  'to',
  'efficiently',
  'learn',
  'the',
  'posterior',
  'on',
  'the',
  'smdp',
  'models',
  'given',
  'the',


Lastly, we will tag our data. Tagging data means each sentence or document (you can say) is mapped with a unique index. These tagged data will be the input for our model. 

In [4]:
# Tagged documents are input for doc2vec model. 
tagged_data = []
for i, doc in enumerate(common_texts):
    tagged = TaggedDocument(doc, [i])
    tagged_data.append(tagged)

tagged_data

[TaggedDocument(words=['in', 'this', 'paper', 'we', 'develop', 'a', 'bayesian', 'nonparametric', 'inverse', 'reinforcement', 'learning', 'technique', 'for', 'switched', 'markov', 'decision', 'processes', '(', 'mdp', ')', '.', 'similar', 'to', 'switched', 'linear', 'dynamical', 'systems', ',', 'switched', 'mdp', '(', 'smdp', ')', 'can', 'be', 'used', 'to', 'represent', 'complex', 'behaviors', 'composed', 'of', 'temporal', 'transitions', 'between', 'simpler', 'behaviors', 'each', 'represented', 'by', 'a', 'standard', 'mdp', '.', 'we', 'use', 'sticky', 'hierarchical', 'dirichlet', 'process', 'as', 'a', 'nonparametric', 'prior', 'on', 'the', 'smdp', 'model', 'space', ',', 'and', 'describe', 'a', 'markov', 'chain', 'monte', 'carlo', 'method', 'to', 'efficiently', 'learn', 'the', 'posterior', 'on', 'the', 'smdp', 'models', 'given', 'the', 'behavior', 'data', '.', 'we', 'demonstrate', 'the', 'effectiveness', 'of', 'smdp', 'models', 'for', 'learning', ',', 'prediction', 'and', 'classification'

## Training the model
Once we have data ready for our model, we can start training it. We will create an object which is in this case 'model' and we will provide hyperparameters to it. You can tweak with these hyperparameters to enhance the efficiency of your model. Lastly, we will save our model because it can be cumbersome to train the model and yes you can deploy your model using that saved file.

In [6]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
               alpha=alpha, 
               min_alpha=0.00025,
               min_count=1,
               dm=1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration{0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    
    # Decrease the learning rate
    model.alpha -= 0.0002
    
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("articles.model")
print("Model Saved")

iteration0
iteration1
iteration2
iteration3
iteration4
iteration5
iteration6
iteration7
iteration8
iteration9
iteration10
iteration11
iteration12
iteration13
iteration14
iteration15
iteration16
iteration17
iteration18
iteration19
iteration20
iteration21
iteration22
iteration23
iteration24
iteration25
iteration26
iteration27
iteration28
iteration29
iteration30
iteration31
iteration32
iteration33
iteration34
iteration35
iteration36
iteration37
iteration38
iteration39
iteration40
iteration41
iteration42
iteration43
iteration44
iteration45
iteration46
iteration47
iteration48
iteration49
iteration50
iteration51
iteration52
iteration53
iteration54
iteration55
iteration56
iteration57
iteration58
iteration59
iteration60
iteration61
iteration62
iteration63
iteration64
iteration65
iteration66
iteration67
iteration68
iteration69
iteration70
iteration71
iteration72
iteration73
iteration74
iteration75
iteration76
iteration77
iteration78
iteration79
iteration80
iteration81
iteration82
iteration83
it

## Evaluating our model
So now we have trained our model it's time to see our model in action. We will be using cosine similarity measure for evaluating our model. Let's import our libraries and modules again and load our model that we saved. 

In [7]:
from gensim.models.doc2vec import Doc2Vec
from nltk.tokenize import word_tokenize

model = Doc2Vec.load("articles.model")

Now let's add a sentence and infer it to a vector in order to feed it to our model. 

In [8]:
# Create new sentence and vectorize it. 
new_sentence = "this is a new sentence".split(" ")
new_sentence_vectorized = model.infer_vector(new_sentence)

# Calculate cosine similarity. 
similar_sentences = model.docvecs.most_similar(positive=[new_sentence_vectorized])

In [9]:
similar_sentences

[(0, 0.7291395664215088),
 (17, 0.7216114401817322),
 (10, 0.7141540050506592),
 (7, 0.6756813526153564),
 (6, 0.6751018762588501),
 (14, 0.6680215001106262),
 (13, 0.667238712310791),
 (5, 0.6636887788772583),
 (18, 0.6495362520217896),
 (62, 0.6347824335098267)]

## The Results
To visualize our results we will use pandas to convert the output to a data frame so it will make our output pretty much meaningful.

In [10]:
import pandas as pd

# Output
output = []
for i, v in enumerate(similar_sentences):
    index = v[0]
    output.append([common_texts_pre[index], v[1]])

pd.DataFrame(output, columns=["common_texts", "cosine_similarity"])

Unnamed: 0,common_texts,cosine_similarity
0,In this paper we develop a Bayesian nonparamet...,0.72914
1,"In this paper, we consider the adaptation of t...",0.721611
2,Human action recognition is an important compo...,0.714154
3,Facial expression recognition is an active are...,0.675681
4,"The Restricted Boltzmann Machine (RBM), a spec...",0.675102
5,Activities capture vital facts for the semanti...,0.668022
6,This paper proposes a neurobiology-based exten...,0.667239
7,"In this paper, we proposed a deep convolutiona...",0.663689
8,Association is widely used to find relations a...,0.649536
9,Building one's vocabulary in a language is an ...,0.634782


## Conclusion
That's pretty much it for this tutorial. Now you have your model that you can deploy and make more useful projects.