#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science – Summer '22

# Notebook 10: Training Word Embeddings using `gensim`

Since we didn't spend much time on supervised machine learning in Notebook 8, I've rewritten Notebook 10 to use new data and incorporate classifiers trained on different types of features. Rather than simply see how we can use `gensim` to train word embeddings, we are going to see how those word embeddings can be used in a downstream task, namely classifying IMDB reviews as either positive or negative. For many tasks, various types of word embeddings and similar approaches have shown phenomenal performance. In this notebook, we'll compare an approach based on word embeddings to classifiers using word counts, TF-IDF weighting, and an extension of word2vec, doc2vec.

The dataset we are going to use was originally analyzed by [Maas et al. (2011)](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) and is available [here](https://ai.stanford.edu/~amaas/data/sentiment/). The dataset includes 50,000 IMDB reviews. These are evenly split into a training set of 25,000 and a test set of 25,000. The training set and test set each comprise 12,500 positive reviews and 12,500 negative reviews.

In **Part I**, we will load the raw data, combine the reviews into separate dataframes for the training and test data, and preprocess the text. With Dr. Maas's permission, I've uploaded preprocessed versions of the training and test data as JSON files to Canvas. You can load these files at the start of Part II to save time, although you should let me know if you have any remaining questions about what we do in Part I.

In **Part II**, we train classifiers using word frequencies (with <tt>CountVectorizer</tt>) and TF-IDF weights (with <tt>TfidfVectorizer</tt>) to establish baselines against which we will compare our later efforts. The classifiers actually do quite well, and the performance we see is comparable to what we see in the 2011 paper. Notably, these are [bag-of-words (BoW) models](https://en.wikipedia.org/wiki/Bag-of-words_model): We ignore word order and base our models on the frequencies of words only.

In **Part III**, we add a <tt>sentencizer</tt> to our `spacy` pipeline so that we can segment the reviews (which vary in length) in the training set into individual sentences, and we train word embeddings on those sentences using word2vec as implemented in `gensim`. We'll start with a basic model trained with only one pass through the data (i.e., one "epoch"), and we'll take a look at how well the model does at finding words similar to "good" and "bad" as well as completing an analogy. Next, we'll train a better model with more passes through the data ("epochs").

In **Part IV**, we'll use our custom word embeddings to convert the text of each review into a vector we can use for a classifier, and we'll see how this approach compares to our baseline models.

In **Part V**, we'll load pretrained word embeddings and use those to train a classifier. In **Part VI**, we will take a quick look at how **doc2vec**–an extension of word2vec for creating *document embeddings*–compares to our other results.

## I. Setup

If you've run all the previous notebooks on the computer you're using now, the libraries we import in the next cell should already be installed. If you need to install anything, you should first try the <tt>conda</tt> approach (e.g., <tt>conda install -c anaconda gensim</tt>) if you are using Anaconda. If you are not using Anaconda, or if a library is not available using <tt>conda</tt>, replace <tt>conda</tt> with <tt>pip</tt> (e.g., <tt>pip3 install --user gensim</tt>). To run these commands from within a Jupyter Notebook, add an exclamation mark (!) to the beginning. Otherwise, you can run the commands from the Anaconda Prompt or a terminal emulator (i.e., the command line).

We are going to use `gensim` to train word embeddings using word2vec, load pretrained word embeddings, and implement doc2vec. We will use scikit-learn (`sklearn`) to represent movie reviews as vectors of wordcounts, to use TF-IDF weighting, and to train and evaluate our classifiers. We will use a combination of `spacy`, `re`, `num2words`, and `unidecode` for preprocessing the text. We will store the training and test sets in separate dataframes using `pandas`. We will also import `os` for managing file paths and accessing the data Part I. 

In [None]:
import gensim.downloader
import os
import pandas as pd
import numpy as np
import re
import spacy

from gensim.models.callbacks import CallbackAny2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.word2vec import Word2Vec
from num2words import num2words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from spacy.lang.en.stop_words import STOP_WORDS as spacy_stopwords
from unidecode import unidecode

The individual reviews are not labeled positive or negative, but they are in *directories* labeled "pos" (for "positive") and "neg" (for "negative"). We will get these labels from the directories. However, they are also split into training and test sets using separate directories as well. We have four file paths total for the positive training examples, negative training examples, positive test examples, and negative test examples.

In the cell below, we store these file paths as string variables. The following cell uses the built in [`map()` function](https://www.geeksforgeeks.org/python-map-function/) to apply the same function to each file in a particular directory. `map()` takes two arguments: a function and iterable (for example, a `list`). <tt>os.listdir(pos_training_data_path)</tt> returns a list of all the files in <tt>pos_training_data_path</tt> ("data/aclImdb_v1/aclImdb/train/pos/"). The function we supply is a lambda function because this allows us to supply the directory and our arbitrary variable name (<tt>x</tt>) as arguments to the built in `open()` function. This looks a little complicated because we additionally use `read()` and `strip()` and cast the result as a `list`, but this is simply a quick way to say we want a list of the text of each review in each directory. The next line uses the built in `zip()` function to pair each review with a label ("pos" or "neg"); this combines a list that repeats "pos" or "neg" a number of times equal to the length of the list of reviews. We label reviews as "pos" if they're in a directory called "pos"–ditto for the "neg" directories.

Importantly, we are keeping the training and test sets separate. We could put them all in one dataframe with a column indicating whether a given row is part of train or test set, but for teaching purposes it seems clearer to keep them separated entirely.

In [None]:
pos_training_data_path = "data/aclImdb_v1/aclImdb/train/pos/"
neg_training_data_path = "data/aclImdb_v1/aclImdb/train/neg/"

pos_test_data_path = "data/aclImdb_v1/aclImdb/test/pos/"
neg_test_data_path = "data/aclImdb_v1/aclImdb/test/neg/"

In [None]:
%%time

# positive training examples
pos_training_examples = list(map(lambda x: open(os.path.join(pos_training_data_path, x), encoding="utf-8").read().strip(), 
                                 os.listdir(pos_training_data_path)))
pos_training_examples = list(zip(["pos"]*len(pos_training_examples), pos_training_examples))


# negative training examples
neg_training_examples = list(map(lambda x: open(os.path.join(neg_training_data_path, x), encoding="utf-8").read().strip(), 
                                 os.listdir(neg_training_data_path)))
neg_training_examples = list(zip(["neg"]*len(neg_training_examples), neg_training_examples))


# positive test examples
pos_test_examples = list(map(lambda x: open(os.path.join(pos_test_data_path, x), encoding="utf-8").read().strip(), 
                                 os.listdir(pos_test_data_path)))
pos_test_examples = list(zip(["pos"]*len(pos_test_examples), pos_test_examples))


# negative test examples
neg_test_examples = list(map(lambda x: open(os.path.join(neg_test_data_path, x), encoding="utf-8").read().strip(), 
                                 os.listdir(neg_test_data_path)))
neg_test_examples = list(zip(["neg"]*len(neg_test_examples), neg_test_examples))

In [None]:
pos_training_examples[0]

Below, we create dataframes for the train and test sets using `pandas`. Since we put the labels ("pos" or "neg") first when we used `zip()` above, we name the columns <tt>label</tt> and <tt>review</tt>.

In the third and fourth lines, we the `sample()` method with the argument <tt>frac=1</tt> to "sample" 100% of the rows from each dataframe at random. By default, this is sampling *without* replacement, so it just shuffles the order of the rows.

In [None]:
training_df = pd.DataFrame(pos_training_examples + neg_training_examples, columns=["label", "review"])
test_df = pd.DataFrame(pos_test_examples + neg_test_examples, columns=["label", "review"])

training_df = training_df.sample(frac=1)
test_df = test_df.sample(frac=1)

In [None]:
training_df.head()

In [None]:
test_df.head()

The helper functions below are based on the functions we have used previously for preprocessing. <tt>fix_ordinal_nums</tt> looks for tokens like "3rd" or "9th" and, removes the letters, and converts them to the word for the ordinal version of the number ("third" or "ninth").

<tt>preprocess_doc</tt> uses `unidecode()` to handle any encoding issues before lowercasing the text. By default, it will then remove stopwords while lemmatizing, but this argument can be changed. It then applies <tt>fix_ordinal_nums</tt> to each token before joining the tokens into a single string and then removing any non-lowercase, non-alphabetic characters that remain. Finally, it replaces any excessively long stretches of whitespace with a single space using a regular expression (with `re.sub()`) and removes whitespace from the beginning or end of the string using `strip()`.

In [2]:
def fix_ordinal_nums(word: str) -> str:
    """
    Convert, e.g., "3rd" to "third"
    """
    ord_num_reg = r"\d+[(st)(nd)(rd)(th)]"
    try:
        if any(re.findall(ord_num_reg, word)):
            word = re.sub("[(st)(nd)(rd)(th)]", "", word)
            word = num2words(word, lang="en", to="ordinal")
        return word
    except:
        return word

    
def preprocess_doc(doc: str, remove_stop: bool = True) -> str:
    """
    Tokenize, lemmatize, remove stop words, 
    remove non-alphabetic characters.
    """
    doc = unidecode(str(doc).lower())
    if remove_stop:
        doc = [word.lemma_ for word in nlp(doc) if (word.text not in spacy_stopwords) & (len(word.text) > 1)]
    else:
        doc = [word.lemma_ for word in nlp(doc)]
    doc = " ".join([fix_ordinal_nums(word) for word in doc])
    doc = re.sub("[^a-z]", " ", doc)
    return re.sub("\s+", " ", doc).strip()

Below, we load one of `spacy`'s language models, <tt>en_core_web_sm</tt>. You can read more about the different language models available for English language text [here](https://spacy.io/models/en) and see the languages spaCy supports [here](https://spacy.io/usage/models).

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

In [None]:
s = "The cat in the hat"

In [None]:
preprocess_doc(s)

The cell below is not optimized to run quickly, and took about 14 minutes on my laptop. Ideally, we'd use something like the `multiprocessing` library, but this doesn't play nicely with the code that makes Jupyter Notebooks work. Later, we'll see how to do things faster by relying on built-in aspects of `spacy` and `gensim`.

Dr. Maas gave permission to share the preprocessed data, which I have made available in Canvas. If you run the the following cell on your own, you should uncomment the lines in the cell below it so that you can save the dataframes for later use. If you don't have a subdirectory called <tt>data</tt> in your working directory, you will need to change the file path.

In [None]:
%%time

training_df["preprocessed"] = training_df.review.apply(preprocess_doc)
test_df["preprocessed"] = test_df.review.apply(preprocess_doc)

In [None]:
# training_df.to_json("data/imdb_training_df.json")
# test_df.to_json("data/imdb_test_df.json")

## II. Baseline Classifier Performance

In classification problems, two major properties of our models are *precision* and *recall*. The diagram below clarifies the difference between the two. In practice, people often default to using an [F-score](https://en.wikipedia.org/wiki/F-score), such as F<sub>1</sub>, which is the harmonic mean of precision and recall.

<img src="https://raw.githubusercontent.com/soc128d/soc128d.github.io/master/assets/images/precision_recall_wiki_walber_side_by_side.png" width=800 align="left"/> <br>

([Image source](https://en.wikipedia.org/wiki/F-score#/media/File:Precisionrecall.svg))

I have commented out the cell below in case you choose to preprocess the data yourself but haven't uncommented and run the cell above to save the dataframes. If you've saved them above or downloaded them from Canvas, you can uncomment the cell below to load the data, including the preprocessed text. You may need to change the file paths to reflect wherever you have saved or downloaded the JSON files. Once you set this up, you can skip Part I of this notebook.

In [None]:
# training_df = pd.read_json("data/imdb_training_df.json")
# test_df = pd.read_json("data/imdb_test_df.json")

For our first baseline, we'll use word frequencies as the input to our classifier. In Notebook 8, we used `sklearn`'s <tt>train_test_split</tt> method to create train and test sets from our corpus. The IMDB data is already separated into train and test sets, so we are going to use that structure instead.

We are going to use [<tt>CountVectorizer()</tt>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to get represent the reviews as vectors of word frequencies. We could use any available type of model, but we are going to stick with logistic regression for this example.

**Note:** This classifier may not converge. There are a lot of possible explanations, but these go a bit beyond the scope of this notebook. In general, it's not great if a model doesn't converge, but the classifier based on word counts alone is intended to the worse of the two baselines, so this isn't too troubling.

In [None]:
%%time

X_train = training_df.preprocessed.tolist()
y_train = training_df.label.tolist()
X_test = test_df.preprocessed.tolist()
y_test = test_df.label.tolist()

vectorizer = CountVectorizer(min_df=2, max_features=10000)
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

clf_counts = LogisticRegression(fit_intercept=True, solver="liblinear", penalty="l2", max_iter=200)
clf_counts.fit(X_train_counts, y_train)

y_pred = clf_counts.predict(X_test_counts)

print(classification_report(y_test, y_pred))

In the cell below, we train a classifier using TF-IDF weighting instead of raw word frequencies. Specifically, we use <tt>sci-kit learn</tt>'s [<tt>TfidfVectorizer</tt>](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). We see a decent improvement in performance.

In [None]:
%%time

X_train = training_df.preprocessed.tolist()
y_train = training_df.label.tolist()

tfidf_vectorizer = TfidfVectorizer(min_df=2, max_features=10000, sublinear_tf=True)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

clf_tfidf = LogisticRegression(fit_intercept=True, solver="liblinear", penalty="l2", max_iter=200)
clf_tfidf.fit(X_train_tfidf, y_train)

y_pred = clf_tfidf.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))

## III. Training Word Embeddings

At long last, we are going to train our own word embeddings using `gensim`! We are going to stick with [word2vec](https://jalammar.github.io/illustrated-word2vec/) for this example. word2vec can be trained in a couple of ways, but we are going to use the skipgram approach with negative sampling. When we train the model, it will try to predict the context words from a target word. The goal is to maximize the probability of the actual context words (i.e., the other words nearby) while minimizing the probability of the so-called negative samples (random words that are *not* nearby).

Rather than train our word embeddings on the full documents, we are going to segment the reviews into their component sentences. We will then train the word embeddings on the sentences. We are going to use `spacy` to segment each review into its individual sentences, and then we are going to preprocess each sentence. The preprocessing will be similar what we did above, but we aren't going to remove stop words.

The <tt>n_process</tt> argument allows us to use multiprocessing (i.e., make use of multiple cores/threads on a single computer). This can speed things up substantially. You can execute the code <tt>os.cpu_count()</tt> to see how many threads are available. This notebook sets <tt>n_process</tt> equal to <tt>max(1, os.cpu_count()-1)</tt> by default, which will return the greater of 1 or the number of available threads minus 1. You can change this to suit your machine.

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])
nlp.add_pipe("sentencizer") # this will allow us to iterate through sentences in a review

In [None]:
%%time

training_reviews = training_df.review.tolist()
training_reviews = nlp.pipe(training_reviews, n_process=max(1, os.cpu_count()-1))

sentences = []

for review in training_reviews:
    for sent in review.sents:
        sent = [word.lemma_ for word in sent]
        sent = " ".join([fix_ordinal_nums(word) for word in sent])
        sent = re.sub("[^a-z]", " ", sent.lower()).strip()
        sent = re.sub("\s+", " ", sent)
        sentences.append(sent)

In [None]:
sentences[0]

In [None]:
len(sentences)

We can save the preprocessed sentences by simply writing them to a text file, with one on each line.

In [None]:
with open("data/imdb_training_sentences.txt", "w") as writer:
    for sentence in sentences:
        writer.write(sentence + "\n")

We can load the sentences later using code like the cell below.

In [None]:
sentences = open("data/imdb_training_sentences.txt", "r").read().strip().split("\n")

To train our word embeddings, we want to convert each sentence from a string to a list of words.

In [None]:
sentences = [sent.split() for sent in sentences]

In [None]:
sentences[0]

Training the word embeddings is fairly straightforward. You can read more about `gensim`'s implementation of word2vec [here](https://radimrehurek.com/gensim/models/word2vec.html). We are training our embeddings on the preprocessed sentences and using a vector size of 100 (i.e., each vector will be 100 numbers, rather than having one number for each unique word in the vocabulary). The window size is how far away from the target word the algorithm will look for context words during training. The <tt>sg</tt> argument tells the function whether or not we want to use the skipgram method. <tt>negative</tt> tells the function how many negative samples to use, and the documentation recommends 5-20. <tt>epochs</tt> is the number of times we want the algorithm to pass through all of he sentences. First, let's just try one pass through the data. <tt>workers</tt> is equivalent to `spacy`'s <tt>n_process</tt> argument: It allows us to use more cores/threads in a single computer to speed things up.

In [None]:
%%time

basic_custom_embeddings = Word2Vec(sentences, vector_size = 100, window = 7, sg = 1, negative = 5, epochs = 1,
                                   workers = max(1, os.cpu_count()-1), min_count = 5)

Once we're done training our word embeddings, we'll run the line below, which make it easier to access the word vectors.

In [None]:
basic_custom_embeddings = basic_custom_embeddings.wv

In [None]:
type(basic_custom_embeddings)

In [None]:
len(basic_custom_embeddings)

So how did we do? Let's see how this model does at identifying words similar to "good" and "bad"–and we'll also see if it can complete an analogy.

In [None]:
basic_custom_embeddings.most_similar("good")

In [None]:
basic_custom_embeddings.most_similar("bad")

In [None]:
basic_custom_embeddings.most_similar(positive=["king", "woman"], negative=["man"], topn=10)

These results aren't *great*. The quality of the word embeddings will depend on the specific task we want to use them for as well as properties of the training data and the way we specify the model.

Let's train a model for more epochs. While it isn't strictly true that training for long is always better for every task–for example, fitting the training data better can result in [overfitting](https://en.wikipedia.org/wiki/Overfitting)–it *should* offer some improvement in the quality of the embeddings. A bit of randomness is introduced by our use of multiple threads. The trade-off between speed and reproducibility seemed worth it for an in-class example, but it's possible we won't see much improvement. (For what it's worth, `gensim` defaults to five epochs for word2vec.)

We're also going to monitor the loss as we train the model. After each epoch, we will print the current loss. The cell below defines a function that will handle this for us. You can take a look at the StackOverflow thread the function is from for more detail. In brief, by default, gensim will give us a *cumulative* loss, so it will always look like it's increasing; the function below subtracts the previous loss each time, and prints only the new loss instead.

In [None]:
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch.
    from https://stackoverflow.com/a/58515344
    
    The loss returned at each iteration is cumulative, so we subtract the previous loss each time.
    """

    def __init__(self):
        self.epoch = 0
        self.loss_to_be_subed = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        loss_now = loss - self.loss_to_be_subed
        self.loss_to_be_subed = loss
        print(f"Loss after epoch {self.epoch}: {loss_now:,}")
        self.epoch += 1

In [None]:
%%time

better_custom_embeddings = Word2Vec(sentences, vector_size = 100, window = 7, sg = 1, negative = 5, 
                                    workers = max(1, os.cpu_count()-1), min_count = 5, epochs = 40, callbacks=[callback()], 
                                    compute_loss = True)

In [None]:
better_custom_embeddings = better_custom_embeddings.wv

In [None]:
better_custom_embeddings.most_similar("good")

In [None]:
better_custom_embeddings.most_similar("bad")

In [None]:
better_custom_embeddings.most_similar(positive=["king", "woman"], negative=["man"], topn=10)

## IV. Custom Word Embeddings as Inputs to a Classifier

Now let's see how well our word embeddings do as inputs to a classifier. Keep in mind that the embeddings model was trained only on our training set and with the goal of maximizing the predicted probability that actual context words would be near target words in a sentence while minimizing the predicted probability that random other words would be guessed. The embeddings may have picked up on idiosyncratic traits of the training set that are not present in the test additionally. More importantly, the training process was unrelated to any kind of classification task. On a range of tasks, embeddings have been shown to be quite powerful. Here, they may not be.

The function below averages the word embeddings for each word in a review that is represented in the word embedding model; if a word is not represented (e.g., due to low frequency in the training corpus), a vector of zeros is substituted instead. Although we trained the word embeddings on individual sentences without removing stop words, here we are using the preprocessed reviews with stop words removed. Although word2vec can learn syntactic information about words better if we leave in stop words (e.g., prepositions), we don't necessarily want those for our classifier. Instead, we're going to average over the words that are more important for capturing what the sentence is about from a computational standpoint–the semantically important words.

In [None]:
def average_word_embeddings(doc, embedding_model, vector_size):
    """
    This function creates a vector for a review by averaging the word embeddings for each
    word in the review that is represented in the word embedding model. If a word is missing, 
    it uses a vector of zeros.
    """
    if type(doc) == str:
        doc = doc.split()
    vecs = []
    for word in doc:
        if word in embedding_model:
            vec = embedding_model[word]
            vecs.append(vec)
        else:
            vecs.append(np.zeros((vector_size,)))
    return np.mean(vecs, axis=0)

In [None]:
%%time

training_vecs = training_df.preprocessed.apply(lambda x: average_word_embeddings(x, better_custom_embeddings, 100)).tolist()
test_vecs = test_df.preprocessed.apply(lambda x: average_word_embeddings(x, better_custom_embeddings, 100)).tolist()
print(set([vec.shape for vec in training_vecs + test_vecs]))

y_train = training_df.label.tolist()
y_test = test_df.label.tolist()

clf_custom_emb = LogisticRegression(fit_intercept=True, solver="liblinear", penalty="l2", max_iter=200)
clf_custom_emb.fit(training_vecs, y_train)

y_pred = clf_custom_emb.predict(test_vecs)

print(classification_report(y_test, y_pred))

## V. Pretrained Word Embeddings as Inputs to a Classifier

What if we had more data? Would our results improve? We can test this with pretrained embeddings, although they have been trained on data from other domains. That may or may no be an issue. You can read about the embedding models available from `gensim` [here](https://github.com/RaRe-Technologies/gensim-data#models).

In [None]:
pretrained_embeddings = gensim.downloader.load("word2vec-google-news-300")

In [None]:
pretrained_embeddings.most_similar("good")

In [None]:
pretrained_embeddings.most_similar("bad")

In [None]:
pretrained_embeddings.most_similar(positive=["king", "woman"], negative=["man"], topn=10)

In [None]:
%%time

training_vecs = training_df.preprocessed.apply(lambda x: average_word_embeddings(x, pretrained_embeddings, 300)).tolist()
test_vecs = test_df.preprocessed.apply(lambda x: average_word_embeddings(x, pretrained_embeddings, 300)).tolist()
print(set([vec.shape for vec in training_vecs + test_vecs]))

y_train = training_df.label.tolist()
y_test = test_df.label.tolist()

clf_pretrained_emb = LogisticRegression(fit_intercept=True, solver="liblinear", penalty="l2", max_iter=200)
clf_pretrained_emb.fit(training_vecs, y_train)

y_pred = clf_pretrained_emb.predict(test_vecs)

print(classification_report(y_test, y_pred))

## VI. Doc2Vec

An extension of the kind of work that led to the production of word2vec is doc2vec, which is often better suited for representing sentences, paragraphs, or documents than simply averaging individual word embeddings. `gensim`'s implementation of doc2vec requires the data to be in a particular format. Specifically, each document should be paired with a list of tags in the <tt>TaggedDocument</tt> format. Don't worry too much about this for now. The function below iterates through the corpus and put each review in the right format. When we call this function in the following cell, we cast it as a `list`. The reason for this is that the function itself uses the keyword <tt>yield</tt>, which produces a *generator*. It will give us each document one by one, but only once. Since we want to train the model for more than one epoch, we cast it as a `list`.

In [None]:
def taggeddocument_generator(corpus):
    """
    This helper function is meant to convert our training examples to the TaggedDocument format,
    which is used by gensim's implementation of doc2vec
    """
    for i, doc in enumerate(corpus):
        doc = doc.split()
        yield TaggedDocument(doc, [i])

In [None]:
%%time

training_data = list(taggeddocument_generator(training_df.preprocessed))

doc2vec_model = Doc2Vec(vector_size=100, min_count=5, epochs=20, negative=5)
doc2vec_model.build_vocab(training_data)
doc2vec_model.train(training_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

In [None]:
%%time

X_train_doc2vec = [doc2vec_model.infer_vector(doc) for doc in training_df.preprocessed.apply(str.split).tolist()]
y_train = training_df.label.tolist()

X_test_doc2vec = [doc2vec_model.infer_vector(doc) for doc in test_df.preprocessed.apply(str.split).tolist()]
y_test = test_df.label.tolist()

clf_doc2vec = LogisticRegression(fit_intercept=True, solver="liblinear", penalty="l2", max_iter=200)
clf_doc2vec.fit(X_train_doc2vec, y_train)

y_pred = clf_doc2vec.predict(X_test_doc2vec)

print(classification_report(y_test, y_pred))