# Doc2Vec

DS 5001 Text as Data

**Purpose:** Demonstrate use of Gensim's doc2vec implementation. You can use this to create document retrieval tools.

See https://www.tutorialspoint.com/gensim/gensim_doc2vec_model.htm#

> the Doc2Vec model, as opposed to the Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit.\
> It doesn’t only give the simple average of the words in the sentence.

# Set Up

In [None]:
import pandas as pd
import numpy as np
import gensim
import plotly_express as px

In [None]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

In [None]:
corpus_prefix = 'austen-melville'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']
BAG = OHCO[:1] # BOOKS

# Get Data

In [None]:
LIB = pd.read_csv(f"{output_dir}/{corpus_prefix}-LIB.csv").set_index(['book_id'])
LIB['author_id'] = LIB.author.str.split(', ').str[0]
LIB['book_label'] = LIB.author_id + ' ' + LIB.index.astype('str') + ': ' + LIB.title.str[:20]

In [None]:
CORPUS = pd.read_csv(f"{output_dir}/{corpus_prefix}-CORPUS.csv").set_index(OHCO)[['pos','term_str']]
# DOCS = CORPUS.groupby(BAG)

# Convert to Gensim

We follow the Gensim recipe for converting our data from a dataframe to a TaggedDocument.

Note we use `yield` here. 

`yield` is used inside a function to make it a **generator**.

A generator function doesn’t return all its results at once. 

Instead, it yields them one at a time, pausing between each result and resuming where it left off.

In [None]:
gensim.models.doc2vec.TaggedDocument?

In [None]:
def tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument([str(w) for w in list_of_words], [i])

In [None]:
data =  CORPUS.groupby(BAG).term_str.apply(lambda x: x.to_list()).to_list()
data_for_training = list(tagged_document(data))

In [None]:
data_for_training[6][0][:10]

In [None]:
data_for_training[6][1]

# Generate Model

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

# Document Embedding Matrix

In [None]:
X = pd.DataFrame( model.dv.get_normed_vectors(), index=LIB.book_label)

In [None]:
X.head()

In [None]:
px.imshow(X, color_continuous_scale=px.colors.diverging.Spectral)

In [None]:
import sys
sys.path.append(local_lib)
from hac2 import HAC

In [None]:
dv_tree = HAC(X)
dv_tree.color_thresh = 1.5
dv_tree.plot()

# Try Out

In [None]:
r1 = model.infer_vector("We went sailing on the Pacific".split())
r2 = model.infer_vector("I so enjoyed the visit to Bath".split())

In [None]:
R = pd.DataFrame(dict(r1=r1, r2=r2))

In [None]:
px.imshow(R.T, width=1000, height=200, color_continuous_scale=px.colors.diverging.Spectral, color_continuous_midpoint=0)