# What is Doc2Vec?

Doc2Vec is an extension of the Word2Vec algorithm, designed to create vector representations (embeddings) of entire documents rather than just individual words. It was introduced by Quoc Le and Tomas Mikolov in 2014.

Key features of Doc2Vec:
1. Document-level embeddings: It can generate fixed-length feature vectors for variable-length pieces of text, such as sentences, paragraphs, or entire documents.
2. Unsupervised learning: It learns from unlabeled text data.
3. Semantic understanding: The resulting embeddings capture semantic similarities between documents.
4. Versatility: It can be used for various NLP tasks like document classification, clustering, and information retrieval.

Doc2Vec has two main training algorithms:
1. Distributed Memory (DM)
2. Distributed Bag of Words (DBOW)

These embeddings can be used as features for machine learning models or for measuring document similarity.


# Distributed Memory (DM)

The Distributed Memory (DM) is one of the two main approaches used in Doc2Vec, alongside the Distributed Bag of Words (DBOW). In this model:

1. The document vector is trained to predict the next word given both the document vector and a context of words.
2. It's similar to the Continuous Bag of Words (CBOW) model in Word2Vec, but with an additional document vector.
3. DM preserves word order information within a small context window.
4. It often produces more accurate results than DBOW, especially on larger datasets.
5. The DM model can be computationally more expensive than DBOW.

The DM approach in Doc2Vec allows for a richer representation of documents, capturing both the semantic meaning and some aspects of word order, which can be beneficial for various NLP tasks such as document classification and information retrieval.


# Distributed Bag of Words (DBOW)

The Distributed Bag of Words (DBOW) is one of the two main approaches used in Doc2Vec. In this model:

1. The document vector is trained to predict the words in the document.
2. It's similar to the Skip-gram model in Word2Vec, but instead of using a word to predict surrounding words, it uses the document vector to predict words in the document.
3. DBOW is generally faster and uses less memory than the Distributed Memory (DM) approach.
4. It tends to perform well on small datasets and is less prone to overfitting.

The DBOW model in Doc2Vec captures the overall semantic meaning of the document, allowing for efficient document similarity comparisons and other NLP tasks.


# Example usage of doc2vec

In [None]:
# Import necessary libraries
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Sample documents
documents = [
    "Doc2Vec is an extension of Word2Vec for entire documents.",
    "It creates vector representations of variable-length text.",
    "The algorithm can capture semantic similarities between documents.",
    "Doc2Vec is useful for various NLP tasks like classification and clustering.",
    "It uses unsupervised learning to generate document embeddings."
]

# Preprocess the documents
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]

# Train a Doc2Vec model
model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Example usage: Infer vector for a new document
new_doc = "Doc2Vec is used for document embeddings"
new_vector = model.infer_vector(word_tokenize(new_doc.lower()))

print("Vector for new document:", new_vector)

# Find similar documents
similar_docs = model.dv.most_similar([new_vector])
print("\nSimilar documents:")
for doc, similarity in similar_docs:
    print(f"Document {doc}: Similarity = {similarity:.4f}")


Vector for new document: [-7.3994203e-03  5.6641987e-03 -5.2486341e-03  8.0357371e-03
  3.5742640e-03  4.0831073e-04 -6.3471780e-03  4.2954260e-03
  7.9677692e-03 -2.5655502e-03  1.0256020e-03  6.1734598e-03
  3.4994003e-03  2.0215532e-03  1.2132036e-04 -1.0053399e-02
 -6.3982286e-04 -9.6847489e-03 -7.3886719e-03  7.2929231e-03
 -3.8932411e-03  4.4750838e-04 -3.4880273e-03  2.9103050e-03
  6.3013220e-03  6.6286274e-03  9.2212204e-03 -4.8543504e-03
 -6.3678785e-03 -5.2491724e-03 -4.9779154e-03 -8.0104759e-03
 -2.5848330e-03 -9.3911439e-03  8.8495361e-03 -3.8854422e-03
  8.9726066e-03  6.0794423e-03 -3.5426712e-03 -1.9116531e-03
 -9.6985502e-03 -9.8396381e-03 -6.1360551e-03  9.6534677e-03
  9.4779823e-03  2.5378338e-03  9.5006726e-05 -8.8074617e-03
  6.2138718e-03 -6.3329767e-03]

Similar documents:
Document 3: Similarity = 0.1360
Document 0: Similarity = 0.0904
Document 1: Similarity = -0.0176
Document 2: Similarity = -0.0890
Document 4: Similarity = -0.2382
