# Modeling with Gradient Boosting

_This notebook will use Gradient Boosting to classify the Reddit data into the correct subreddit. First, the joined data will be split into train and test data, then preprocessed with Latent Semantic Analysis for dimensionality reduction, and then have the Gradient Boosting model applied._

In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
%matplotlib inline

## Joining, Preprocessing the Data

In [3]:
# Reading the dataset
join = pd.read_csv('./datasets/joined.csv')
join.head()

# Assigning target
target = join['subreddit']

# Dropping selftext and title columns
df = join.drop(['selftext','title'], axis=1)

# Importing nltk's English stop words
# Note: Using CamelCase to prevent overwriting variable later
stopWords = stopwords.words('english')

# From the EDA, I concluded that the following words were very common 
# in both /r/Jokes and /r/AntiJokes and were added noise. 
# I added these words to the nltk stop words
stopWords.extend(['wa', 'say', 'said', 'did', 'like', 'asked', 'woman', 
                  'don', 'know', 'year', 'wife', 'good', 'want', 'got', 
                  'ha', 'people', 'make', 'tell', 'didn', 'joke', 'x200b', 
                  'way', 'think', 'walk', 'll', 'home', 't'])

# Tokenizing by alphanumeric characters
tokenizer = RegexpTokenizer('\w+')

# Making all tokens lowercase
tokens = [tokenizer.tokenize(post.lower()) for post in (df['joined'])]

# Initializing lemmatizer
lemmatizer = WordNetLemmatizer()

# First had to lemmatize each word
# then rejoin words into one string
lems = []
for post in tokens:
    tok_post = []
    for word in post:
        tok_post.append(lemmatizer.lemmatize(word))
    posts = " ".join(tok_post)
    lems.append(posts)

# Adding the lemmatized data back to the DataFrame
join['text'] = lems
join['text'].astype(str)

# Dropping unnecessary columns
join=join.drop(['selftext','title','joined'],axis=1)
join.shape

(1725, 2)

_Adding in columns for word count. According to the EDA, I believe that this column will help add signal to the model._

In [4]:
join['word_count'] = join['text'].map(lambda x: len(x.split()))

In [5]:
join.head()

Unnamed: 0,subreddit,text,char_count,word_count
0,0,husband wa screwing his secretary up the as wh...,153,34
1,0,why doe batman wear dark clothing batman doesn...,132,26
2,0,a man is in court the judge say on the 3rd aug...,1156,235
3,0,a poor old lady wa forced to sell her valuable...,1540,299
4,0,how do you get a nun pregnant dress her up a a...,56,14


## Train-Test-Splitting the Data

In [6]:
X = join[['text','word_count']]
y = join['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [7]:
X_train.shape

(1293, 3)

## Using TFIDF to Vectorize the Data

In [8]:
# Initializing the TFIDF Vectorizer
tfidf = TfidfVectorizer(stop_words=stopWords, min_df=1, max_df=1.0)

# Fitting and transforming the data
X_train_tf = tfidf.fit_transform(X_train['text'])
X_test_tf = tfidf.transform(X_test['text'])

In [9]:
X_train_tf.shape

(1293, 5523)

## Using SVD for Dimensionality Reduction

In [10]:
X_train_tf_df = pd.DataFrame(X_train_tf.todense(), columns=tfidf.get_feature_names())
X_test_tf_df = pd.DataFrame(X_test_tf.todense(), columns=tfidf.get_feature_names())

In [11]:
# Initializing the SVD model
# From the LSA notebook, it was determined that 346 components 
# would allow for a variance explained score of 75%
SVD = TruncatedSVD(n_components=346)
X_train_svd = SVD.fit_transform(X_train_tf_df)
X_test_svd = SVD.transform(X_test_tf_df)

In [12]:
# Wrapped the SVD model into a DataFrame for the purpose of 
# adding back in the word_count and character_count features
X_train_svd_df = pd.DataFrame(X_train_svd)
X_test_svd_df = pd.DataFrame(X_test_svd)

In [13]:
# Had to reset the index of the original X_train and X_test data
# Had trouble adding this data back into the SVD Dataframe 
# and resetting the index solved the issue
X_train = X_train.reset_index()
X_test = X_test.reset_index()

In [14]:
# Adding word_count back into the train and test data
X_train_svd_df['word_count'] = X_train['word_count']
X_test_svd_df['word_count'] = X_test['word_count']

## Using the Gradient Boosting Model

_Gradient Boosting is the process of iteratively improving a weak model by optimizing the model to predict the residuals (or errors). This is different from Adaptive Boosting, where iterative models put heavier weight on misclassified observations._

In [25]:
gb = GradientBoostingClassifier(random_state=42)
params = {
    'max_depth' : [1,2,3,4],
    'n_estimators' : [25, 50, 75],
}
grid = GridSearchCV(gb, param_grid = params, cv=5)
grid.fit(X_train_svd_df, y_train)
print(grid.best_score_)
grid.best_params_

0.6411446249033256


{'max_depth': 1, 'n_estimators': 50}

In [26]:
grid.score(X_test_svd_df, y_test)

0.6481481481481481

_With GridSearch and Gradient Boosting, my baseline accuracy is now at about 65%. I decided to try to fit an Adaptive Boosting model to see if it increases my accuracy. Adaptive Boosting uses an iterative process and puts a heavier weight on observations that were misclassified in the iterations._

In [30]:
ada = AdaBoostClassifier(random_state=42)
params = {
    'n_estimators' : [40,50,60]
}
grid1 = GridSearchCV(ada, param_grid = params, cv=5)
grid1.fit(X_train_svd_df, y_train)
print(grid1.best_score_)
grid1.best_params_

0.617169373549884


{'n_estimators': 50}

In [31]:
grid1.score(X_test_svd_df, y_test)

0.5902777777777778

_It seems Adaptive Boosting modeling doesn't seem to improve the baseline accuracy in this case._

_I achieved the highest accuracy model with Gradient Boosting at 65%. Admittedly, this was not as high a score that I wanted to achieve. However, I realize that finding the subtle difference between a Joke and an AntiJoke is something that even we as humans have difficulty explaining or understanding at times. The 65% accuracy score tells me that there is definitely some signal that machine learning is picking up, which is a hopeful sign for future endeavors. Future steps would be to become more well-versed in the academic literature concerning AI and humor, as well as looking more deeply into NLP as a whole._