# Doc2Vec Modeling

The code in the notebook was referenced from [this](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4) Medium post.

In [46]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns; sns.set()
%matplotlib inline
# NLP
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction import text 
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# modeling
import sklearn.metrics as metrics
from sklearn import metrics, utils
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
# metrics
from sklearn import metrics, model_selection, svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, plot_confusion_matrix, roc_curve, auc, classification_report
import pickle

## Importing `clean_df`

In [47]:
clean_df = pd.read_pickle('../pickle/clean_df.pkl')

In [48]:
clean_df.head(2)

Unnamed: 0,total_votes,hate_speech_votes,other_votes,label,tweet,clean_tweets
0,3,0,3,0,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldnt complain about clea...
1,3,0,3,0,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats coldtyga dwn bad for cuffin dat ho...


## Train-Test Split

In [49]:
doc_train, doc_test = train_test_split(clean_df, test_size=0.3, random_state=42)

## Preparing the Data

In [50]:
def tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sent):
            if len(word) < 2:
                continue
            tokens.append(word.lower())
    return tokens

In [51]:
tagged_train = doc_train.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_tweets']), tags=[r.label]), axis=1)
tagged_test = doc_test.apply(
    lambda r: TaggedDocument(words=tokenize_text(r['clean_tweets']), tags=[r.label]), axis=1)

In [52]:
tagged_train.values[30]

TaggedDocument(words=['this', 'bitch', 'instating', 'and', 'driving', 'for', 'me'], tags=[0])

## Training DBOW Model

This is the Doc2Vec model analogous to Skip-gram model in Word2Vec. Here we can see that training a Doc2Vec model is much more straight forward in Gensim.

In [53]:
# train a doc2vec model, using only training data
dbow_model = Doc2Vec(vector_size=100, 
                alpha=0.025, 
                min_count=5,
                dm=1, epochs=100)

In [54]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

# building vocabulary 
dbow_model.build_vocab([x for x in tqdm(tagged_train.values)])


100%|██████████| 17348/17348 [00:00<00:00, 2569942.63it/s]


In [55]:
%%time
# this cell takes about 26 seconds to run
for epoch in range(30):
    dbow_model.train(utils.shuffle([x for x in tqdm(tagged_train.values)]), total_examples=len(tagged_train.values), epochs=1)
    dbow_model.alpha -= 0.002
    dbow_model.min_alpha = dbow_model.alpha

100%|██████████| 17348/17348 [00:00<00:00, 2433456.60it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3132141.78it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2624825.43it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3044595.41it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2493926.03it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3174918.66it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2533699.62it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2289001.69it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2420826.62it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3044722.81it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2816660.31it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2524031.70it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3262173.76it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2845742.34it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2744108.68it/s]
100%|██████████| 17348/17348 [00:00<00:00, 2821356.56it/s]
100%|██████████| 17348/17348 [00:00<00:00, 3037604.82it/

### Building the final vector feature for the classifier

In [56]:
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

## Baseline Models

In [57]:
# train-test split
y_train, X_train = vec_for_learning(dbow_model, tagged_train)
y_test, X_test = vec_for_learning(dbow_model, tagged_test)

## Logisitic Regression

The Logisitic Regression baseline had the highest unweighted F1 score of 0.387805 with the Tf-IDF vectorization method.

In [58]:
logreg = LogisticRegression(n_jobs=1, C=1e5)

In [59]:
%%time
logreg.fit(X_train, y_train)

CPU times: user 781 ms, sys: 14.7 ms, total: 796 ms
Wall time: 220 ms


LogisticRegression(C=100000.0, n_jobs=1)

In [60]:
logreg_y_preds = logreg.predict(X_test)

In [61]:
logreg_precision = precision_score(y_test, logreg_y_preds)
logreg_recall = recall_score(y_test, logreg_y_preds)
logreg_f1_score = f1_score(y_test, logreg_y_preds)
logreg_f1_weighted = f1_score(y_test, logreg_y_preds, average='weighted')

In [62]:
print('Precision: {:.4}'.format(logreg_precision))
print('Recall: {:.4}'.format(logreg_recall))
print('F1 Score: {:.4}'.format(logreg_f1_score))
print('Weighted F1 Score: {:.4}'.format(logreg_f1_weighted))


Precision: 0.4741
Recall: 0.1288
F1 Score: 0.2026
Weighted F1 Score: 0.9257


Looks like using Doc2Vec on a Logistic Regression model really lowers the F1 score, but it gets bumped up if we add the `weighted` parameter.

Additionally, this method increased Precision but decreased Recall.

According to the scikit-learn documentation, a weighted F1 score calculates metrics for each label, and finds their average weighted by support (the number of true instances for each label). **This alters ‘macro’ to account for label imbalance;** it can result in an F-score that is not between precision and recall.



In [63]:
# creating dictionary with all metrics
metric_dict = {}
metric_dict['Baseline Logisitic Regression'] = {'precision': logreg_precision, 'recall': logreg_recall, 'f1_score': logreg_f1_score, 'weighted_f1': logreg_f1_weighted}

## Support Vector Machine (SVM)
The baseline SVM had the highest weighted F1 score of 0.938102 with the Tf-IDF vectorization method.



In [64]:
SVM_baseline = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto', class_weight='balanced')

In [65]:
%%time
# this cell takes about 26 seconds to run
# fit the training dataset on the classifier
SVM_baseline.fit(X_train, y_train)

CPU times: user 26.8 s, sys: 352 ms, total: 27.1 s
Wall time: 27.3 s


SVC(class_weight='balanced', gamma='auto', kernel='linear')

In [66]:
# predict the labels on validation dataset
SVM_y_preds = SVM_baseline.predict(X_test)

In [67]:
SVM_precision = precision_score(y_test, SVM_y_preds)
SVM_recall = recall_score(y_test, SVM_y_preds)
SVM_f1_score = f1_score(y_test, SVM_y_preds)
SVM_f1_weighted = f1_score(y_test, SVM_y_preds, average='weighted')

In [68]:
# printing evaluation metrics up to 4th decimal place
print('Testing Metrics for SVM Baseline with Lemmatization Features')
print('Precision: {:.4}'.format(SVM_precision))
print('Recall: {:.4}'.format(SVM_recall))
print('F1 Score: {:.4}'.format(SVM_f1_score))
print('Weighted F1 Score: {:.4}'.format(SVM_f1_weighted))

Testing Metrics for SVM Baseline with Lemmatization Features
Precision: 0.1902
Recall: 0.6276
F1 Score: 0.2919
Weighted F1 Score: 0.8653


In [69]:
metric_dict['Baseline SVM'] = {'precision': SVM_precision, 'recall': SVM_recall, 'f1_score': SVM_f1_score, 'weighted_f1': SVM_f1_weighted}

## Evaluation Metrics

In [70]:
pd.DataFrame.from_dict(metric_dict, orient='index')

Unnamed: 0,precision,recall,f1_score,weighted_f1
Baseline Logisitic Regression,0.474138,0.128806,0.202578,0.925716
Baseline SVM,0.190206,0.627635,0.291939,0.865324


At first glance, it looks like the SVM model does slightly better with unweighted F1, but the Logisitic Regression model does better with weighted F1.



In [73]:
target_names = ['class 0', 'class 1']
# logistic regression baseline
print(classification_report(y_test, logreg_y_preds, target_names=target_names))
# SVM baseline
print(classification_report(y_test, SVM_y_preds, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.95      0.99      0.97      7008
     class 1       0.47      0.13      0.20       427

    accuracy                           0.94      7435
   macro avg       0.71      0.56      0.59      7435
weighted avg       0.92      0.94      0.93      7435

              precision    recall  f1-score   support

     class 0       0.97      0.84      0.90      7008
     class 1       0.19      0.63      0.29       427

    accuracy                           0.83      7435
   macro avg       0.58      0.73      0.60      7435
weighted avg       0.93      0.83      0.87      7435



However, it's important to note that the Doc2Vec method may be performing worse than the Tf-IDF method. We can try grid searching to improve this.
