# Doc2Vec Modeling

The code in the notebook was referenced from [this](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4) Medium post.

In [21]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns; sns.set()
%matplotlib inline
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction import text 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
# model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
# metrics
from sklearn import metrics, model_selection, svm
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, plot_confusion_matrix, roc_curve, auc, classification_report
import pickle

## Importing `clean_df`

In [2]:
clean_df = pd.read_pickle('../pickle/clean_df.pkl')

In [3]:
clean_df.head(2)

Unnamed: 0,total_votes,hate_speech_votes,other_votes,label,tweet,round_1_tweet
0,3,0,3,0,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldnt complain about clea...
1,3,0,3,0,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats coldtyga dwn bad for cuffin dat ho...


## Train-Test Split

In [4]:
train, test = train_test_split(clean_df, test_size=0.3, random_state=42)

## Preparing the Data

In [5]:
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [6]:
tagged_train = train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['round_1_tweet']), tags=[r.label]), axis=1)
tagged_test = test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['round_1_tweet']), tags=[r.label]), axis=1)

In [7]:
tagged_train.values[30]

TaggedDocument(words=['this', 'bitch', 'instating', 'and', 'driving', 'for', 'me'], tags=[0])

## Training DBOW Model

This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. Here we can see that training a Doc2Vec model is much more straight forward in Gensim.

In [8]:
# train a doc2vec model, using only training data
dbow_model = Doc2Vec(vector_size=100, 
                alpha=0.025, 
                min_count=5,
                dm=1, epochs=100)

In [9]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

# building vocabulary 
dbow_model.build_vocab([x for x in tqdm(tagged_train.values)])


100%|██████████| 17348/17348 [00:00<00:00, 2728468.04it/s]


In [10]:
from sklearn import utils
for epoch in range(30):
    dbow_model.train(utils.shuffle([x for x in tqdm(tagged_train.values)]), total_examples=len(tagged_train.values), epochs=1)
    dbow_model.alpha -= 0.002
    dbow_model.min_alpha = dbow_model.alpha

100%|██████████| 17348/17348 [00:00<00:00, 2211433.18it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3177414.23it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2680028.94it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2853330.68it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3398700.82it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2881123.97it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2931382.88it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2777841.71it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3228162.63it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3088534.56it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3060603.42it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3268327.98it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2897299.74it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3029636.75it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2960002.68it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3214329.89it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2784538.89it/

### Building the final vector feature for the classifier

In [11]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

## Baseline Models

In [22]:
# train-test split
y_train, X_train = vec_for_learning(dbow_model, tagged_train)
y_test, X_test = vec_for_learning(dbow_model, tagged_test)

## Logisitic Regression

The Logisitic Regression baseline had the highest unweighted F1 score of 0.387805 with the Tf-IDF vectorization method.

In [24]:
logreg = LogisticRegression(n_jobs=1, C=1e5)

In [25]:
logreg.fit(X_train, y_train)

LogisticRegression(C=100000.0, n_jobs=1)

In [26]:
logreg_y_preds = logreg.predict(X_test)

In [27]:
logreg_precision = precision_score(y_test, logreg_y_preds)
logreg_recall = recall_score(y_test, logreg_y_preds)
logreg_f1_score = f1_score(y_test, logreg_y_preds)
logreg_f1_weighted = f1_score(y_test, logreg_y_preds, average='weighted')

In [29]:
print('Precision: {:.4}'.format(logreg_precision))
print('Recall: {:.4}'.format(logreg_recall))
print('F1 Score: {:.4}'.format(logreg_f1_score))
print('Weighted F1 Score: {:.4}'.format(logreg_f1_weighted))


Precision: 0.504
Recall: 0.1475
F1 Score: 0.2283
Weighted F1 Score: 0.9276


Looks like using Doc2Vec on a Logistic Regression model really lowers the F1 score, but it gets bumped up if we add the `weighted` parameter.

Additionally, this method increased Precision but decreased Recall.

According to the scikit-learn documentation, a weighted F1 score calculates metrics for each label, and finds their average weighted by support (the number of true instances for each label). **This alters ‘macro’ to account for label imbalance;** it can result in an F-score that is not between precision and recall.



In [30]:
# creating dictionary with all metrics
metric_dict = {}
metric_dict['Baseline Logisitic Regression'] = {'precision': logreg_precision, 'recall': logreg_recall, 'f1_score': logreg_f1_score, 'weighted_f1': logreg_f1_weighted}

## Support Vector Machine (SVM)
The baseline SVM had the highest weighted F1 score of 0.938102 with the Tf-IDF vectorization method.

In [31]:
SVM_baseline = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

In [38]:
# fit the training dataset on the classifier
SVM_baseline.fit(X_train, y_train)

SVC(gamma='auto', kernel='linear')

In [39]:
# predict the labels on validation dataset
SVM_y_preds = SVM_baseline.predict(X_test)

In [40]:
SVM_precision = precision_score(y_test, SVM_y_preds)
SVM_recall = recall_score(y_test, SVM_y_preds)
SVM_f1_score = f1_score(y_test, SVM_y_preds)
SVM_f1_weighted = f1_score(y_test, SVM_y_preds, average='weighted')

In [41]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for SVM Baseline with Lemmatization Features')
print('Precision: {:.4}'.format(SVM_precision))
print('Recall: {:.4}'.format(SVM_recall))
print('F1 Score: {:.4}'.format(SVM_f1_score))
print('Weighted F1 Score: {:.4}'.format(SVM_f1_weighted))

Testing Metrics for SVM Baseline with Lemmatization Features
Precision: 1.0
Recall: 0.002342
F1 Score: 0.004673
Weighted F1 Score: 0.915


In [42]:
metric_dict['Baseline SVM'] = {'precision': SVM_precision, 'recall': SVM_recall, 'f1_score': SVM_f1_score, 'weighted_f1': SVM_f1_weighted}

## Evaluation Metrics

In [43]:
pd.DataFrame.from_dict(metric_dict, orient='index')

Unnamed: 0,precision,recall,f1_score,weighted_f1
Baseline Logisitic Regression,0.504,0.147541,0.228261,0.927634
Baseline SVM,1.0,0.002342,0.004673,0.915034


Overall, the logisitic regression performs better than the SVM baseline. But that's just with the Doc2Vec Method. Turns out that a SVM baseline using Tf-IDF vectoriation performs slightly better with a weighted F1 of 0.938102.

So we may consider going back a grid searching on the baseline SVM with TF-IDF vectorization instead.