# Doc2Vec: How too implement it

So instead of creating a vector for each word, this technique will create a vector for each document or collection of text, whether it's a sentence or a paragraph. The goal is the same as `word2vec`. To create a numeric representation of a set of texts to feed to Python to help it better understand the meaning. 

Recall that `word2vec` is a shallow two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors, also known as embeddings. Each vector is a numeric representation of a given word. `doc2vec` is basically the same thing, but instead of returning a numeric vector for each word, it returns a numeric vector for each sentence or paragraph.

The real benefit of `Doc2Vec` is it captures information about a sentence or paragraph, which is what we need, in a much more sophisticated way than creating word vectors and then averaging them. So in `Word2Vec`, we lose information by averaging the word vectors together to create a sentence or text level representation. `Doc2Vec` is able to capture the sentence level representation in a much more sophisticated way.

### Train Our Own Model

In [4]:
!ipython locate profile

C:\Users\lsoares\.ipython\profile_default


In [5]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)


messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))


# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],  # Cleaned text data for training
                                                    messages['label'],       # Corresponding target labels
                                                    test_size=0.2)           # Use 20% of the data for testing

messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


Now, one of the differences between `word2vec` and `doc2vec` is that `doc2vec` requires you to create tagged documents. This tagged document, expects a list of words and a tag for each document.

We're going to iterate through X_train using this enumerate function and that'll return the index and the value for each text message in X_train.

In [8]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['dont', 'thnk', 'its', 'wrong', 'calling', 'between', 'us'], tags=[0])

In [9]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,     # Tagged documents for training
                                 vector_size=100,  # Dimensionality of the document vectors
                                 window=5,         # Maximum distance between the current and predicted word within a sentence
                                 min_count=2)      # Minimum number of occurrences of a word to be considered

In [10]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector(['text'])

array([-0.00798229,  0.01140039, -0.01043457, -0.01564807,  0.0034668 ,
       -0.03537234,  0.01152923,  0.05926319, -0.00906062, -0.0155829 ,
       -0.01331935, -0.02749657, -0.00572696,  0.00810215,  0.00133955,
       -0.03076872,  0.00469946, -0.02882499,  0.0082822 , -0.04647089,
        0.01412934,  0.01810895,  0.0142637 , -0.00232068,  0.00919335,
       -0.00369618, -0.0025273 , -0.01774842, -0.02490121, -0.01018221,
        0.01805401, -0.00242939,  0.01390081, -0.00753134, -0.01360392,
        0.03192316,  0.00762084, -0.0185664 , -0.00654213, -0.0283397 ,
       -0.01166244, -0.01969059, -0.00832936,  0.01400114,  0.01149535,
       -0.0071169 , -0.01649878,  0.00077341,  0.00119541,  0.02045974,
        0.00764342, -0.02728012,  0.01257649,  0.00521487, -0.00386214,
        0.01318282,  0.01741318, -0.00256588, -0.02155077,  0.00336554,
        0.01338103,  0.00300988,  0.00979202, -0.00682402, -0.0317105 ,
        0.02232279,  0.00080215,  0.01393274, -0.02342951,  0.02

In [11]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i','am','learning','nlp'])

array([-5.4133963e-03,  1.1715040e-02, -5.1480252e-03, -9.6279001e-03,
       -1.4430014e-03, -3.2380532e-02,  1.8065844e-02,  4.9476977e-02,
       -6.2905448e-03, -1.6607398e-02, -7.7082855e-03, -2.9892946e-02,
       -5.2800183e-03,  6.7910198e-03,  2.2265767e-03, -1.7420869e-02,
        6.7678122e-03, -2.0868720e-02,  1.1361656e-03, -4.6570800e-02,
        8.9863259e-03,  7.8002485e-03,  1.6005872e-02, -1.1420925e-02,
        1.5199286e-03, -8.6489422e-03, -3.3598340e-03, -1.8516034e-02,
       -2.0690057e-02, -8.1399987e-03,  2.2926476e-02,  4.3311116e-05,
        2.6588293e-02, -1.8358380e-02, -7.4351230e-03,  3.4932412e-02,
        4.0876816e-04, -6.6025234e-03, -1.2406030e-02, -3.4672942e-02,
       -8.1659677e-03, -2.0882104e-02,  1.1908055e-03,  8.4546031e-03,
        1.6867347e-02, -1.1269779e-02, -1.4139514e-02, -1.2567131e-02,
        7.0599164e-03,  2.2530489e-02,  1.1712081e-02, -1.6853958e-02,
        4.2819671e-04, -1.1067763e-02, -1.3313100e-02,  7.2365846e-03,
      

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!

# How To Prep Document Vectors For Modeling

In [14]:
# Read in data, clean it, split it into train/test, and then train a doc2vec model
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)


messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))


# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],  # Cleaned text data for training
                                                    messages['label'],       # Corresponding target labels
                                                    test_size=0.2)           # Use 20% of the data for testing


# Create tagged document objects to prepare to train the model
tagged_docs_tr = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]


# Train a Doc2Vec model on the tagged documents
d2v_model = gensim.models.Doc2Vec(tagged_docs_tr,     # Tagged documents for training the model
                                  vector_size=50,     # Dimensionality of the document vectors
                                  window=2,           # Maximum distance between the current and predicted words within a sentence
                                  min_count=2)        # Minimum number of occurrences of a word to be considered in the model

We'll read in our data, clean it, split it into train and test set, and then we'll train our doc to vec model on our training set. As that's training, recall that we can create a document vector by passing a list of words into the infer_vector method for the trained model. Again, this returns a single vector of length 100 that is prepared to be passed directly into a machine learning model.

In [16]:
# What does a document vector look like again?
d2v_model.infer_vector(['convert', 'words', 'to', 'vectors'])

array([ 0.01124724,  0.01765448, -0.0173507 ,  0.01307189,  0.00111577,
       -0.00714126,  0.0196327 ,  0.04415387, -0.03992124, -0.01049349,
        0.01006596, -0.03300128, -0.00194706,  0.02483031, -0.0105353 ,
        0.01793564,  0.03844629, -0.00639612, -0.02700614, -0.0373323 ,
        0.00343428,  0.01769488,  0.04617324,  0.02278028,  0.03255073,
        0.02028335, -0.02793016, -0.0056128 , -0.02779184,  0.01132937,
       -0.00465862,  0.00612979, -0.01565399,  0.0041665 , -0.0126611 ,
        0.04156687,  0.00331445,  0.00062365,  0.01735365, -0.00391612,
        0.03749568, -0.00858164, -0.00824811,  0.00185709,  0.03762186,
        0.00539917, -0.00381365, -0.02473323, -0.00387513,  0.01493936],
      dtype=float32)

#### Explanation
`d2v_model`: This is your trained Doc2Vec model. It has learned to map documents and words to vectors in a continuous vector space.

`infer_vector`: This method is used to infer a vector for a new document that was not part of the training set. The model generates a vector that represents the content of the document.

`['convert', 'words', 'to', 'vectors']`: This is the list of words (tokens) in the new document you want to infer a vector for. The list should be preprocessed in the same way as the training documents (e.g., tokenized and cleaned).

#### What Happens Internally
**Inference**: The method infer_vector performs inference to generate a vector for the provided list of words. This is done by comparing the new document with the documents the model has seen during training and finding a vector that best represents the new document.

**Result**: The output is a fixed-length vector (with dimensions specified during training) that captures the semantic meaning of the input words. This vector can be used for various tasks, such as document similarity, classification, or clustering.

The code below is aimed at converting text data into numerical vectors that can be used for training and prediction in machine learning models. 

In [19]:
# How do we prepare these vectors to be used in a machine learning model?
vects=[[d2v_model.infer_vector(words)] for words in X_test]

vects[0]

[array([ 0.05731815,  0.01634658,  0.040213  , -0.01886609,  0.04126149,
        -0.00377404,  0.00072359,  0.0173449 ,  0.00927827,  0.00431059,
        -0.02341248, -0.01482392, -0.03337708,  0.02598854,  0.05156513,
         0.00017511,  0.00793516,  0.02812459,  0.07645095, -0.02688686,
         0.01720904, -0.06830827, -0.01210206,  0.04015526,  0.06324875,
         0.02678719,  0.02162048,  0.04558533,  0.04820809,  0.02977867,
        -0.00435448,  0.04377825,  0.02617526,  0.00801478,  0.04583059,
         0.00318846,  0.03561984,  0.01070106,  0.06706019,  0.00595404,
         0.01573588,  0.01055892,  0.04312979, -0.00069565, -0.00442941,
         0.02150692, -0.0069403 , -0.00156944,  0.06485177, -0.04376391],
       dtype=float32)]

These numbers may seem random to the human eye but there's a pattern here that was learned by the doc to vec model as a way to encode the meaning as a set of numbers.

### Explanation
`d2v_model.infer_vector(words)`:

**d2v_model**: This is your trained *Doc2Vec* model.

**infer_vector**: This method infers a vector for the provided document (a list of words in this case) based on what the model has learned during training.

**words**: Each entry in *X_test* is expected to be a list of words (tokens) from a document in your test set.

`[[d2v_model.infer_vector(words)] for words in X_test]`:

* This list comprehension iterates over each document in the *X_test* dataset.
* For each document, it calls *d2v_model.infer_vector(words)*, which generates a vector representation for that document.
* Each inferred vector is wrapped in a list (hence the double brackets [[]]).

`vects[0]`:

* This retrieves the first element of the vects list, which is the vector representation for the first document in *X_test*.