<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-3-text-representation/8_training_document_vectors_using_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distributed Representations Beyond Words and Characters

So far, we’ve seen two approaches to coming up with text representations using embeddings. 

1. Word2vec learned representations for words, and we aggregated them
to form text representations. 
2. fastText learned representations for character n-grams, which were aggregated to form word representations and then text representations. 

A potential problem with both approaches is that they do not take the context of words into account. 

Take, for example, the sentences “dog bites man” and “man bites dog.”Both receive the same representation in these approaches, but they obviously have very different meanings. 

Let’s look at another approach, Doc2vec, which allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences,
paragraphs, and documents) by taking the context of words in the text into account.

Doc2vec is based on the paragraph vectors framework and is implemented in gensim. This is similar to Word2vec in terms of its general architecture, except that, in addition to the word vectors, it also learns a “paragraph vector” that learns a representation for the full text (i.e., with words in context). 

When learning with a large corpus of many texts, the paragraph vectors are unique for a given text (where “text” can mean any piece of text of arbitrary length), while word vectors will be shared across all texts. 

The shallow neural networks used to learn Doc2vec embeddings are very similar to the CBOW and SkipGram architecture of Word2vec. The two architectures are called distributed memory (DM) and distributed
bag of words (DBOW).

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/doc2vec-architecture.png?raw=1' width='800'/>


## Document Vectors

In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [5]:
import warnings
warnings.filterwarnings('ignore')

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
# Assume each sentence in documents corresponds to a separate document.
data = [
  "Dog bites man.",
  "Man bites dog.",
  "Dog eats meat.",
  "Man eats food."           
]

tagged_data  = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man', '.'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog', '.'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat', '.'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food', '.'], tags=['3'])]

In [9]:
# dbow
model_dbow = Doc2Vec(tagged_data, vector_size=20, min_count=1, epochs=2, dm=0)

In [10]:
# feature vector of man eats food
model_dbow.infer_vector(['man','eats','food'])

array([-0.00573573, -0.00158627, -0.01403498,  0.0076449 , -0.00503669,
       -0.01186671, -0.01365232, -0.00052146,  0.01362064, -0.00897493,
       -0.01328038,  0.01102936, -0.01257596, -0.02315849, -0.01362393,
       -0.00459146,  0.01593641,  0.00050824, -0.02132892, -0.01697199],
      dtype=float32)

In [11]:
# top 5 most simlar words.
model_dbow.wv.most_similar("man", topn=5)

[('food', 0.4122661054134369),
 ('eats', 0.11905860155820847),
 ('dog', 0.10442954301834106),
 ('meat', 0.024429846554994583),
 ('bites', -0.01625148206949234)]

In [12]:
model_dbow.wv.n_similarity(["dog"], ["man"])

0.10442954

In [13]:
# dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2, dm=1)
print("Inference Vector of man eats food\n ", model_dm.infer_vector(['man','eats','food']))

Inference Vector of man eats food
  [-0.00573572 -0.00158626 -0.01403498  0.00764489 -0.00503669 -0.01186671
 -0.01365232 -0.00052146  0.01362063 -0.00897494 -0.01328037  0.01102937
 -0.01257597 -0.02315848 -0.01362392 -0.00459146  0.0159364   0.00050822
 -0.02132893 -0.01697199]


In [14]:
print("Most similar words to man in our corpus\n", model_dm.wv.most_similar("man",topn=5))

Most similar words to man in our corpus
 [('food', 0.4122661054134369), ('eats', 0.11905860155820847), ('dog', 0.10442954301834106), ('meat', 0.024429846554994583), ('bites', -0.01625148206949234)]


In [15]:
print("Similarity between man and dog: ", model_dm.wv.n_similarity(["dog"],["man"]))

Similarity between man and dog:  0.10442954


What happens when we compare between words which are not in the vocabulary?

In [16]:
model_dm.wv.n_similarity(['covid'],['man'])

KeyError: ignored

Once the Doc2vec model is trained, paragraph vectors for new texts are inferred
using the common word vectors from training. Doc2vec was perhaps the first widely
accessible implementation for getting an embedding representation for the full text
instead of using a combination of individual word vectors. Since it models some form
of context and can encode texts of arbitrary length into a fixed, low-dimensional,
dense vector, it has found application in a wide range of NLP applications, such as
text classification, document tagging, text recommendation systems, and simple chatbots
for FAQs.