## In this notebook, we will explore various ML algorithms

1. Random forest
2. Naive Bayes
3. Logistic Regression
4. Support Vector Machines
5. Gradient Boosted Classifiers

## Converting text data into forms that ML models can read

Before we utitlize machine learning models for prediction, we need to convert the text data into a form that ML models can read.
Here are some forms we will explore:

1. Word2Vec representation of text (for this, it might be better to not remove stopwords: https://www.kaggle.com/code/harshitmakkar/nlp-word2vec)
2. Bag of Words representation
3. TF-IDF representation

In [1]:
%pip install gensim
%pip install scipy==1.12.0
%pip install xgboost

Collecting scipy==1.12.0
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.4
    Uninstalling scipy-1.11.4:
      Successfully uninstalled scipy-1.11.4
Successfully installed scipy-1.12.0


In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.data import find
import gensim
import nltk
from gensim.models import KeyedVectors, Word2Vec
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler

Reading the data

In [3]:
train_df = pd.read_csv('data/ML/ML_train.csv')
test_df = pd.read_csv('data/ML/ML_test.csv')

In [4]:
train_df.head()

Unnamed: 0,text,humor
0,watch swimmer disappear winter storm jonas,False
1,laughed reagan trump idea outlast political stage,False
2,hey cold go corner 90 degress,True
3,cant get standing desk almost good,False
4,wanna hear joke penis never mind long,True


In [5]:
test_df.head()

Unnamed: 0,text,humor
0,thought reddit joke today triangle rectangle f...,True
1,much pirate pay corn buck ear,True
2,hillary clinton sent book every gop candidatee...,False
3,italian union lambast new museum bos working hard,False
4,life ocean surface wholly depends live,False


In [6]:
X_train = train_df['text']
y_train = train_df['humor']
X_test = test_df['text']
y_test = test_df['humor']

Word2Vec: We will use a pretrained model

In [7]:
import gensim.downloader as api
# path = api.load("word2vec-google-news-300", return_path = True)
import json
info = api.info()
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'],
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors of the movie matrix.
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets, 27B toke

In [8]:
path = api.load('word2vec-google-news-300', return_path=True)



In [9]:
model = KeyedVectors.load_word2vec_format(path, binary=True)

In [37]:
# Use a function to vectorize the sentences into Word2Vec form

def get_vector(tokens,vector,k=300):
    if len(tokens)<1:
        return np.zeros(k)

    vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens]
    lens = len(vectorized)
    sums = np.sum(vectorized,axis=0)
    avg = np.divide(sums,lens)
    return avg


def get_embedding(vectors,text,k=300):
    embs = text.apply(lambda x:get_vector(x,vectors,k=300))
    return list(embs)

In [38]:
w2v_X_train = get_embedding(model, train_df['text'], k=300)
w2v_X_test = get_embedding(model, test_df['text'], k=300)

Bag of Words

In [None]:
bow_vectorizer = CountVectorizer()
bow_vectorizer.fit(X_train)

bow_X_train = bow_vectorizer.transform(X_train)
bow_X_test = bow_vectorizer.transform(X_test)

TF-IDF

In [None]:
# ngram_range=(1, 3), min_df=2, max_df=0.85
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

tfidf_X_train = tfidf_vectorizer.transform(X_train)
tfidf_X_test = tfidf_vectorizer.transform(X_test)

# Function to test the models

In [23]:
def train_and_eval(model, trainX, trainY, testX, testY):

    # training the model
    fitted_model = model.fit(trainX, trainY)

    # getting predictions
    y_preds_train = fitted_model.predict(trainX)
    y_preds_test = fitted_model.predict(testX)

    # evaluating the model
    print()
    print(model)
    print(f"Train accuracy score : {accuracy_score(trainY, y_preds_train)}")
    print(f"Test accuracy score : {accuracy_score(testY, y_preds_test)}")
    print(classification_report(testY, y_preds_test))
    print('\n',40*'-')

# Multinomial Naive Bayes

Multinomial Naive Bayes with BoW

In [None]:
nb_model = MultinomialNB()
train_and_eval(nb_model, bow_X_train, y_train, bow_X_test, y_test)


MultinomialNB()
Train accuracy score : 0.9181125
Test accuracy score : 0.901575
              precision    recall  f1-score   support

       False       0.91      0.89      0.90     20000
        True       0.89      0.91      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


Multinomial Naive Bayes with TF-IDF

In [None]:
nb_model = MultinomialNB()
train_and_eval(nb_model, tfidf_X_train, y_train, tfidf_X_test, y_test)


MultinomialNB()
Train accuracy score : 0.9185125
Test accuracy score : 0.899375
              precision    recall  f1-score   support

       False       0.91      0.89      0.90     20000
        True       0.89      0.91      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


Multinomial Naive Bayes with Word2Vec

In [29]:
nb_model = MultinomialNB()
scaler = MinMaxScaler()

# transform the inputs because Multinomial Naive Bayes does not take in negative inputs
scaler.fit(w2v_X_train)
nb_w2v_train = scaler.transform(w2v_X_train)
nb_w2v_test = scaler.transform(w2v_X_test)

train_and_eval(nb_model, nb_w2v_train, y_train, nb_w2v_test, y_test)


MultinomialNB()
Train accuracy score : 0.5845
Test accuracy score : 0.5841
              precision    recall  f1-score   support

       False       0.58      0.62      0.60     20000
        True       0.59      0.55      0.57     20000

    accuracy                           0.58     40000
   macro avg       0.58      0.58      0.58     40000
weighted avg       0.58      0.58      0.58     40000


 ----------------------------------------


BoW works best for the Multinomial Naive Bayes model

# Logistic Regression

Logistic Regression with BoW

In [None]:
log_model = LogisticRegression(random_state=42)
train_and_eval(log_model, bow_X_train, y_train, bow_X_test, y_test)


LogisticRegression(random_state=42)
Train accuracy score : 0.93783125
Test accuracy score : 0.90545
              precision    recall  f1-score   support

       False       0.91      0.90      0.91     20000
        True       0.90      0.91      0.91     20000

    accuracy                           0.91     40000
   macro avg       0.91      0.91      0.91     40000
weighted avg       0.91      0.91      0.91     40000


 ----------------------------------------


Logistic Regression with TF-IDF

In [None]:
log_model = LogisticRegression(random_state=42)
train_and_eval(log_model, tfidf_X_train, y_train, tfidf_X_test, y_test)


LogisticRegression(random_state=42)
Train accuracy score : 0.91860625
Test accuracy score : 0.9008
              precision    recall  f1-score   support

       False       0.90      0.90      0.90     20000
        True       0.90      0.90      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


Logistic Regression with Word2Vec

In [30]:
log_model = LogisticRegression(random_state=42)
train_and_eval(log_model, w2v_X_train, y_train, w2v_X_test, y_test)


LogisticRegression(random_state=42)
Train accuracy score : 0.6046875
Test accuracy score : 0.601375
              precision    recall  f1-score   support

       False       0.60      0.62      0.61     20000
        True       0.61      0.58      0.59     20000

    accuracy                           0.60     40000
   macro avg       0.60      0.60      0.60     40000
weighted avg       0.60      0.60      0.60     40000


 ----------------------------------------


Logistic Regression works best with BoW

# Random Forest Classifier

Random Forest with BoW

In [None]:
clf = RandomForestClassifier(random_state=42)
train_and_eval(clf, bow_X_train, y_train, bow_X_test, y_test)


RandomForestClassifier(random_state=42)
Train accuracy score : 0.99999375
Test accuracy score : 0.87755
              precision    recall  f1-score   support

       False       0.89      0.86      0.88     20000
        True       0.86      0.90      0.88     20000

    accuracy                           0.88     40000
   macro avg       0.88      0.88      0.88     40000
weighted avg       0.88      0.88      0.88     40000


 ----------------------------------------


Random Forest with TF-IDF

In [None]:
clf = RandomForestClassifier(random_state=42)
train_and_eval(clf, tfidf_X_train, y_train, tfidf_X_test, y_test)


RandomForestClassifier(random_state=42)
Train accuracy score : 0.99999375
Test accuracy score : 0.874625
              precision    recall  f1-score   support

       False       0.89      0.85      0.87     20000
        True       0.86      0.90      0.88     20000

    accuracy                           0.87     40000
   macro avg       0.88      0.87      0.87     40000
weighted avg       0.88      0.87      0.87     40000


 ----------------------------------------


Random Forest with Word2Vec

In [32]:
clf = RandomForestClassifier(random_state=42)
train_and_eval(clf, w2v_X_train, y_train, w2v_X_test, y_test)


RandomForestClassifier(random_state=42)
Train accuracy score : 1.0
Test accuracy score : 0.625325
              precision    recall  f1-score   support

       False       0.62      0.63      0.63     20000
        True       0.63      0.62      0.62     20000

    accuracy                           0.63     40000
   macro avg       0.63      0.63      0.63     40000
weighted avg       0.63      0.63      0.63     40000


 ----------------------------------------


Random Forest works best with BoW

# Gradient Boosted Classifier

XGB classifier with BoW

In [None]:
xgb = XGBClassifier(random_state=42)
train_and_eval(xgb, bow_X_train, y_train, bow_X_test, y_test)


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)
Train accuracy score : 0.836325
Test accuracy score : 0.8283
              precision    recall  f1-score   support

       False       0.80      0.88      0.84     20000
        True       0.87      0.78      0.82     20000

    accuracy      

XGB classifier with TF_IDF

In [None]:
xgb = XGBClassifier(random_state=42)
train_and_eval(xgb, tfidf_X_train, y_train, tfidf_X_test, y_test)


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)
Train accuracy score : 0.8391625
Test accuracy score : 0.8272
              precision    recall  f1-score   support

       False       0.79      0.88      0.84     20000
        True       0.87      0.77      0.82     20000

    accuracy     

XGB classifier with Word2Vec

In [31]:
xgb = XGBClassifier(random_state=42)
train_and_eval(xgb, w2v_X_train, y_train, w2v_X_test, y_test)


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=42, ...)
Train accuracy score : 0.75400625
Test accuracy score : 0.628825
              precision    recall  f1-score   support

       False       0.62      0.66      0.64     20000
        True       0.64      0.59      0.62     20000

    accuracy  

XGB classifier works best with BoW