# Finding similar documents using Doc2Vec

We just learned how PV-DM and PV-DBOW convert the documents to a vector. Now, we will see how to perform document classification using Doc2Vec.

In this section, we will use the 20 newsgroups dataset. It consists of 20,000 documents over 20 different
news categories. We will use only four categories: Computer, Politics, Science, and Sports. We have 1000 documents under each of these four categories.

We rename the documents with a prefix, category_. For example, all science documents are renamed as
Science_1, Science_2, and so on. After renaming them, we combine all the documents and
place them in a single folder. The combined data is available in the data folder has new_dataset.zip.


## Import the required libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
import gensim
from gensim.models.doc2vec import TaggedDocument

from nltk import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'\w+')
stopWords = set(stopwords.words('english'))

## Data Preparation

Load all the documents and save the document names in docLabels list and document content in a list called data:

In [2]:
docLabels = []
docLabels = [f for f in os.listdir('data/news_dataset') if  f.endswith('.txt')]

data = []
for doc in docLabels:
      data.append(open('data/news_dataset/'+doc).read())   

As shown below, docLabels has names of our documents:

In [3]:
docLabels[:5]

['Electronics_827.txt',
 'Electronics_848.txt',
 'Science_377.txt',
 'Science_24.txt',
 'Politics_38.txt']

Define a class called DocIterator which acts as an iterator to runs over all the documents:

In [4]:
class DocIterator(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list

    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])

Create an object called 'it' to the DocIterator class:

In [5]:
it = DocIterator(data, docLabels)

## Build the model

Now let us build the model. Let us define some of the important hyperparameters of the model.

* Size represents our embedding size. 

* alpha represents our learning rate.

* min_alpha implies that our learning rate alpha will decay to min_alpha during training.

* dm=1 implies we use ‘distributed memory’ (PV-DM) and if we set dm =0 it implies we use ‘distributed bag of words’ (PV-DBOW) for training.

* min_count represents the minimum frequcy of words. i.e if the paritcular word's occrurence is less than a min_count than we can simply ignore that word.

In [6]:
size = 100
alpha = 0.025
min_alpha = 0.025
dm = 1
min_count = 1

Define the model:

In [7]:
model = gensim.models.Doc2Vec(size=size, min_count=min_count, alpha=alpha, min_alpha=min_alpha, dm=dm)
model.build_vocab(it)

Train the model:

In [None]:
for epoch in range(100):
    model.train(it,total_examples=120,epochs = model.iter)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

Save the model:

In [10]:
model.save('model/doc2vec.model')


We can load the saved model using load function:

In [11]:
d2v_model = gensim.models.doc2vec.Doc2Vec.load('model/doc2vec.model')

## Evaluate the model

After training, we evaluate the model performance. As shown below, when we feed Electronics_724.txt document as an input, it returns all the related documents with their corresponding scores:

In [12]:
model.docvecs.most_similar('Electronics_724.txt')

[('Electronics_407.txt', 0.9127770662307739),
 ('Electronics_163.txt', 0.8796253800392151),
 ('Science_480.txt', 0.8787260055541992),
 ('Science_769.txt', 0.8782669305801392),
 ('Science_627.txt', 0.8712874054908752),
 ('Science_737.txt', 0.8702232241630554),
 ('Electronics_461.txt', 0.8684250116348267),
 ('Science_377.txt', 0.8677175045013428),
 ('Electronics_786.txt', 0.867066502571106),
 ('Politics_167.txt', 0.8663994669914246)]

We learned how to generate embeddings for the documents using doc2vec algorithms, in the next section, we will learn how to generate sentence embeddings using skip-thoughts and quick-thoughts algorithms. 