# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [3]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [5]:
# Create tagged document objects to prepare to train the model
# tagged_docs = [v, i for i, v in enumerate(X_train)]
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [6]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['ll', 'pick', 'you', 'up', 'at', 'about', 'pm', 'to', 'go', 'to', 'taunton', 'if', 'you', 'still', 'want', 'to', 'come'], tags=[0])

In [8]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2
                            )

In [10]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [11]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([ 2.55125901e-03,  1.18638361e-02,  9.61328112e-03,  2.11844430e-03,
       -7.85729848e-03, -1.96640566e-02,  8.51910841e-03,  3.11424695e-02,
       -1.43427067e-02, -1.96799412e-02, -1.37006759e-03, -2.21764240e-02,
       -8.15948658e-03,  1.23746553e-02,  1.32412405e-03, -1.01742046e-02,
        2.08183847e-05, -1.34120099e-02, -7.40417372e-03, -2.66675595e-02,
        1.37395440e-02,  9.10776388e-03,  4.81781224e-03, -2.20786221e-02,
       -1.43928418e-03,  2.50591012e-03, -1.01464866e-02, -7.01714121e-03,
       -1.83200464e-02, -5.16230846e-03,  1.79592799e-02,  1.02383541e-02,
        2.05909777e-02, -1.65945217e-02, -1.19528482e-02,  2.33237837e-02,
        2.91534397e-03, -1.78704914e-02, -1.78031828e-02, -2.93328520e-02,
        2.23389757e-03, -6.97192550e-03, -7.49399769e-04, -8.88905674e-03,
        1.25945993e-02, -5.76909212e-03, -1.54217407e-02, -3.57724051e-03,
        1.29239438e-02,  1.63594056e-02,  8.67526419e-03, -9.38259531e-03,
       -8.72932188e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!