# <center> <font size = 24 color = 'steelblue'> <b> Doc2Vec

<div class="alert alert-block alert-info">
    
<font size = 4>

- Demonstration of Doc2Vec using a custom corpus

# <a id= 'dv0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Install and import the requirements](#dv1)<br>
[2. Preparing the data](#dv2)<br>
[3. Distributed bag of words version of paragraph vector (DBoW)](#dv3)<br>
[4. Distributed memory version of paragraph vector (PV-DM)](#dv4)<br>

##### <a id = 'dv1'>
<font size = 10 color = 'midnightblue'> <b>Install and import the requirements

In [1]:
!pip install gensim
!pip install spacy
!pip install nltk



<font size = 5 color = seagreen> <b>Import necessary packages

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
# To suppress warning messages
import warnings
warnings.filterwarnings('ignore')

<font size = 5 color = seagreen><b> Download the necessary corpora.

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/toddwalters/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[top](#dv0)

##### <a id = 'dv2'>
<font size = 10 color = 'midnightblue'> <b>Preparing the data

<font size = 5 color = pwdrblue><b> Define the documents

In [4]:
documents = ["The analyst reviews the dataset to identify trends and patterns."
             "Data analysis helps businesses make informed decisions based on facts and figures.",
             "In a research project the team gathers data for subsequent analysis.",
             "Charts and graphs are used to visually represent the results of data analysis.",
             "Analyzing customer feedback data provides valuable insights for product improvement."]

<font size = 5 color = pwdrblue><b>Create tagged documents:
<div class="alert alert-block alert-success">
    
<font size = 4>
    
- The TaggedDocument function represents document along with a tag.
- This generates data in the acceptable input format for the Doc2Vec function.

In [5]:
tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(documents)]

In [6]:
print(tagged_data[1])

TaggedDocument<['in', 'a', 'research', 'project', 'the', 'team', 'gathers', 'data', 'for', 'subsequent', 'analysis', '.'], ['1']>


[top](#dv0)

##### <a id = 'dv3'>
<font size = 10 color = 'midnightblue'> <b> Distributed bag of words version of paragraph vector (DBoW)

<div class="alert alert-block alert-success">
    
<font size = 4>
    
- The model is trained to predict words randomly sampled from the paragraph (document) it is processing, without using the word order information.

<font size = 5 color = pwdrblue><b>  Create the model object with tagged data

In [7]:
dbow_model = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)

<font size = 5 color = pwdrblue><b>  Get feature vector for : "***Data analysis identifies trends and patterns.***"

In [8]:
print(dbow_model.infer_vector(["Data", "analysis", "identifies", "trends", "and", "patterns"]))

[ 0.00942159  0.02201645 -0.02475953 -0.01763088 -0.01945561 -0.00756865
 -0.0089284   0.01868463 -0.01877881 -0.01570377  0.0041464  -0.02134602
  0.00803086 -0.00548086  0.00700091  0.02061669 -0.00221698 -0.0096117
 -0.01813458 -0.01563823]


<font size = 5 color = pwdrblue><b>  Get top 5 most simlar words.

In [9]:
dbow_model.wv.most_similar("analysis", topn=5)

[('subsequent', 0.5829257965087891),
 ('reviews', 0.38686245679855347),
 ('dataset', 0.3179265558719635),
 ('insights', 0.26625779271125793),
 ('research', 0.19855979084968567)]

<font size = 5 color = pwdrblue><b>  Get the cosine similarity between the two sets of documents.

In [10]:
dbow_model.wv.n_similarity(["data", "analysis"],["insights"])

0.20002559

[top](#dv0)

##### <a id = 'dv4'>
<font size = 10 color = 'midnightblue'> <b> Distributed memory version of paragraph vector (PV-DM)

<font size = 5 color = pwdrblue><b>  Create model object

In [11]:
dm_model = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

<font size = 5 color = pwdrblue><b>  Get feature vector for : "***Data analysis identifies trends and patterns.***"

In [12]:
print(dm_model.infer_vector(["Data", "analysis", "identifies", "trends", "and", "patterns"]))


[ 0.00942171  0.02201657 -0.02475939 -0.01763096 -0.01945573 -0.00756857
 -0.00892823  0.01868476 -0.01877866 -0.01570369  0.0041464  -0.02134612
  0.00803107 -0.00548067  0.00700104  0.02061677 -0.00221697 -0.00961185
 -0.01813435 -0.01563825]


<font size = 5 color = pwdrblue><b>  Get top5 most similar keys to given word

In [13]:
dm_model.wv.most_similar("analysis",topn=5)


[('subsequent', 0.5829257965087891),
 ('reviews', 0.386778861284256),
 ('dataset', 0.31784698367118835),
 ('insights', 0.26625844836235046),
 ('research', 0.1984594762325287)]

In [14]:
dm_model.wv.n_similarity(["data", "analysis"],["insights"])

0.20011896

<div class="alert alert-block alert-success">
    
<font size = 4>

<center> <b> What happens when we compare between words which are not in the vocabulary?

In [15]:
dm_model.wv.n_similarity(['covid'],['data'])

0.0

<div class="alert alert-block alert-success">
    
<font size = 4>
    
<center> <b>If the word is not in vocabulary the similarity score with other words will be zero.


[top](#dv0)