# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [2]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['lol', 'yes', 'our', 'friendship', 'is', 'hanging', 'on', 'thread', 'cause', 'won', 'buy', 'stuff'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2)

In [None]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

# TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-0.00168596,  0.01508794,  0.01160038, -0.00338361,  0.01132102,
       -0.02389889,  0.02116293,  0.04805538, -0.01647898, -0.01417084,
       -0.01098558, -0.03593954,  0.00255337,  0.00353758,  0.00733562,
       -0.01752469,  0.0067111 , -0.02172901,  0.00240758, -0.04477942,
        0.01149775,  0.00413259,  0.01695592, -0.01012047, -0.0054517 ,
        0.00896844, -0.01013042, -0.0165707 , -0.02167361, -0.00137383,
        0.02361354,  0.00513901,  0.01327512, -0.01172036, -0.00941433,
        0.02568521,  0.00435821, -0.01291472, -0.01023596, -0.02167915,
       -0.00579567, -0.00668968, -0.00233986, -0.00142471,  0.01381543,
       -0.01183296, -0.01288486, -0.00672589,  0.01407277,  0.01379746,
        0.01204773, -0.01390184,  0.00456665, -0.00428431, -0.0160882 ,
        0.00530108,  0.00379643, -0.00118838, -0.01480482,  0.00714468,
        0.00046443, -0.00112335,  0.00092854, -0.00984535, -0.02009081,
        0.02129154,  0.0121479 ,  0.01603481, -0.01980219,  0.02

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!