## In this notebook, we will explore various ML algorithms

1. Random forest
2. Naive Bayes
3. Logistic Regression
4. Support Vector Machines
5. Gradient Boosted Classifiers

## Converting text data into forms that ML models can read

Before we utitlize machine learning models for prediction, we need to convert the text data into a form that ML models can read.
Here are some forms we will explore:

1. Word2Vec representation of text (for this, it might be better to not remove stopwords: https://www.kaggle.com/code/harshitmakkar/nlp-word2vec)
2. Bag of Words representation
3. TF-IDF representation

In [2]:
%pip install gensim
%pip install scipy==1.12.0

Note: you may need to restart the kernel to use updated packages.
Collecting scipy==1.12.0
  Using cached scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.0
    Uninstalling scipy-1.13.0:
      Successfully uninstalled scipy-1.13.0
Successfully installed scipy-1.12.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.data import find
import gensim
import nltk
from gensim.models import KeyedVectors, Word2Vec

Reading the data

In [4]:
train_df = pd.read_csv('data/ML/ML_train.csv')
test_df = pd.read_csv('data/ML/ML_test.csv')

In [5]:
train_df.head()

Unnamed: 0,text,humor
0,watch swimmer disappear winter storm jonas,False
1,laughed reagan trump idea outlast political stage,False
2,hey cold go corner 90 degress,True
3,cant get standing desk almost good,False
4,wanna hear joke penis never mind long,True


In [6]:
test_df.head()

Unnamed: 0,text,humor
0,thought reddit joke today triangle rectangle f...,True
1,much pirate pay corn buck ear,True
2,hillary clinton sent book every gop candidatee...,False
3,italian union lambast new museum bos working hard,False
4,life ocean surface wholly depends live,False


In [7]:
X_train = train_df['text']
y_train = train_df['humor']
X_test = test_df['text']
y_test = test_df['humor']

Word2Vec: USing a pre-trained model (INCOMPLETE)

In [1]:
# import gensim.downloader as api
# path = api.load("word2vec-google-news-300", return_path = True)

In [2]:
# path

'/home/codespace/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'

In [8]:
# using a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset
# models/word2vec_sample/pruned.word2vec.txt
# word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
# model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [14]:
# model

<gensim.models.keyedvectors.KeyedVectors at 0x7f16281f0cd0>

Bag of Words

In [8]:
bow_vectorizer = CountVectorizer()
bow_vectorizer.fit(X_train)

bow_X_train = bow_vectorizer.transform(X_train)
bow_X_test = bow_vectorizer.transform(X_test)

TF-IDF

In [9]:
# ngram_range=(1, 3), min_df=2, max_df=0.85
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

tfidf_X_train = tfidf_vectorizer.transform(X_train)
tfidf_X_test = tfidf_vectorizer.transform(X_test)

# Function to test the models

In [10]:
def train_and_eval(model, trainX, trainY, testX, testY):

    # training the model
    fitted_model = model.fit(trainX, trainY)

    # getting predictions
    y_preds_train = fitted_model.predict(trainX)
    y_preds_test = fitted_model.predict(testX)

    # evaluating the model
    print()
    print(model)
    print(f"Train accuracy score : {accuracy_score(trainY, y_preds_train)}")
    print(f"Test accuracy score : {accuracy_score(testY, y_preds_test)}")
    print(classification_report(testY, y_preds_test))
    print('\n',40*'-')

# Multinomial Naive Bayes

Multinomial Naive Bayes with BoW

In [12]:
nb_model = MultinomialNB()
train_and_eval(nb_model, bow_X_train, y_train, bow_X_test, y_test)


MultinomialNB()
Train accuracy score : 0.9181125
Test accuracy score : 0.901575
              precision    recall  f1-score   support

       False       0.91      0.89      0.90     20000
        True       0.89      0.91      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


Multinomial Naive Bayes with TF-IDF

In [21]:
nb_model = MultinomialNB()
train_and_eval(nb_model, tfidf_X_train, y_train, tfidf_X_test, y_test)


MultinomialNB()
Train accuracy score : 0.9185125
Test accuracy score : 0.899375
              precision    recall  f1-score   support

       False       0.91      0.89      0.90     20000
        True       0.89      0.91      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


BoW works better for the Multinomial Naive Bayes model

# Logistic Regression

Logistic Regression with BoW

In [22]:
log_model = LogisticRegression(random_state=42)
train_and_eval(log_model, bow_X_train, y_train, bow_X_test, y_test)


LogisticRegression(random_state=42)
Train accuracy score : 0.93783125
Test accuracy score : 0.90545
              precision    recall  f1-score   support

       False       0.91      0.90      0.91     20000
        True       0.90      0.91      0.91     20000

    accuracy                           0.91     40000
   macro avg       0.91      0.91      0.91     40000
weighted avg       0.91      0.91      0.91     40000


 ----------------------------------------


Logistic Regression with TF-IDF

In [23]:
log_model = LogisticRegression(random_state=42)
train_and_eval(log_model, tfidf_X_train, y_train, tfidf_X_test, y_test)


LogisticRegression(random_state=42)
Train accuracy score : 0.91860625
Test accuracy score : 0.9008
              precision    recall  f1-score   support

       False       0.90      0.90      0.90     20000
        True       0.90      0.90      0.90     20000

    accuracy                           0.90     40000
   macro avg       0.90      0.90      0.90     40000
weighted avg       0.90      0.90      0.90     40000


 ----------------------------------------


# Random Forest Classifier

Random Forest with BoW

In [None]:
clf = RandomForestClassifier(random_state=42)
train_and_eval(clf, bow_X_train, y_train, bow_X_test, y_test)

Random Forest with TF-IDF

In [None]:
clf = RandomForestClassifier(random_state=42)
train_and_eval(clf, tfidf_X_train, y_train, tfidf_X_test, y_test)