# Modeling with Random Forests

_This notebook will use Random Forests modeling to classify the Reddit data into the correct subreddit. First, the joined data will be split into train and test data, then preprocessed with Latent Semantic Analysis for dimensionality reduction, and then have the Random Forests model applied._

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
%matplotlib inline

## Joining, Preprocessing the Data

In [2]:
# Reading the dataset
join = pd.read_csv('./datasets/joined.csv')
join.head()

# Assigning target
target = join['subreddit']

# Dropping selftext and title columns
df = join.drop(['selftext','title'], axis=1)

# Importing nltk's English stop words
# Note: Using CamelCase to prevent overwriting variable later
stopWords = stopwords.words('english')

# From the EDA, I concluded that the following words were very common 
# in both /r/Jokes and /r/AntiJokes and were added noise. 
# I added these words to the nltk stop words
stopWords.extend(['wa', 'say', 'said', 'did', 'like', 'asked', 'woman', 
                  'don', 'know', 'year', 'wife', 'good', 'want', 'got', 
                  'ha', 'people', 'make', 'tell', 'didn', 'joke', 'x200b', 
                  'way', 'think', 'walk', 'll', 'home', 't'])

# Tokenizing by alphanumeric characters
tokenizer = RegexpTokenizer('\w+')

# Making all tokens lowercase
tokens = [tokenizer.tokenize(post.lower()) for post in (df['joined'])]

# Initializing lemmatizer
lemmatizer = WordNetLemmatizer()

# First had to lemmatize each word
# then rejoin words into one string
lems = []
for post in tokens:
    tok_post = []
    for word in post:
        tok_post.append(lemmatizer.lemmatize(word))
    posts = " ".join(tok_post)
    lems.append(posts)

# Adding the lemmatized data back to the DataFrame
join['text'] = lems
join['text'].astype(str)

# Dropping unnecessary columns
join=join.drop(['selftext','title','joined'],axis=1)
join.shape

(1725, 2)

_Adding in columns for word count and character count. According to the EDA, I believe that these columns will help add signal to the model._

In [3]:
join['char_count'] = join['text'].map(len)
join['word_count'] = join['text'].map(lambda x: len(x.split()))

In [4]:
join.head()

Unnamed: 0,subreddit,text,char_count,word_count
0,0,husband wa screwing his secretary up the as wh...,153,34
1,0,why doe batman wear dark clothing batman doesn...,132,26
2,0,a man is in court the judge say on the 3rd aug...,1156,235
3,0,a poor old lady wa forced to sell her valuable...,1540,299
4,0,how do you get a nun pregnant dress her up a a...,56,14


## Train-Test-Splitting the Data

In [5]:
X = join[['text','char_count','word_count']]
y = join['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
X_train.shape

(1293, 3)

## Using TFIDF to Vectorize the Data

In [7]:
# Initializing the TFIDF Vectorizer
tfidf = TfidfVectorizer(stop_words=stopWords, min_df=1, max_df=1.0)

# Fitting and transforming the data
X_train_tf = tfidf.fit_transform(X_train['text'])
X_test_tf = tfidf.transform(X_test['text'])

In [8]:
X_train_tf.shape

(1293, 5523)

## Using SVD for Dimensionality Reduction

In [9]:
X_train_tf_df = pd.DataFrame(X_train_tf.todense(), columns=tfidf.get_feature_names())
X_test_tf_df = pd.DataFrame(X_test_tf.todense(), columns=tfidf.get_feature_names())

In [10]:
# Initializing the SVD model
# From the LSA notebook, it was determined that 346 components 
# would allow for a variance explained score of 75%
SVD = TruncatedSVD(n_components=346)
X_train_svd = SVD.fit_transform(X_train_tf_df)
X_test_svd = SVD.transform(X_test_tf_df)

In [11]:
# Wrapped the SVD model into a DataFrame for the purpose of 
# adding back in the word_count and character_count features
X_train_svd_df = pd.DataFrame(X_train_svd)
X_test_svd_df = pd.DataFrame(X_test_svd)

In [12]:
# Had to reset the index of the original X_train and X_test data
# Had trouble adding this data back into the SVD Dataframe 
# and resetting the index solved the issue
X_train = X_train.reset_index()
X_test = X_test.reset_index()

In [13]:
# Adding word_count and char_count back into the train and test data
X_train_svd_df['word_count'] = X_train['word_count']
X_train_svd_df['char_count'] = X_train['char_count']
X_test_svd_df['word_count'] = X_test['word_count']
X_test_svd_df['char_count'] = X_test['char_count']

## Using the Random Forests Model

_Random Forest models take advantage of the aggregating nature of bagged decision tree models coupled with the factor of limited random features, forcing the model to make decisions based on a fraction of the available information. The aggregation results in a model that has low variance and high accuracy._

In [20]:
rf = RandomForestClassifier(random_state=42)
params = {
    'n_estimators' : [175, 200, 225],
    'max_depth' : [None],
    # Using 1.0 will use all features, quick way to test bagging classifier
    'max_features' : ['auto', 1.0],
    'min_samples_split' : [2,3]
}
grid = GridSearchCV(rf, param_grid = params, cv=5)
grid.fit(X_train_svd_df, y_train)
print(grid.best_score_)
grid.best_params_

0.6767208043310131


{'max_depth': None,
 'max_features': 1.0,
 'min_samples_split': 2,
 'n_estimators': 200}

_With GridSearch and Random Forests, my baseline accuracy is now at about 64%. I decided to try to fit an Extra Trees model to see if it increases my accuracy. Extra Trees model builds on top of the Random Forests model but has an additional factor of randomness (random values to divide the features, which are also randomly selected)._

In [18]:
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(random_state=42)
params = {
    'n_estimators' : [100, 150, 200],
    'max_depth' : [None,3,4],
    # Using 1.0 will use all features, quick way to test bagging classifier
    'max_features' : ['auto', 1.0],
    'min_samples_split' : [2,3,4]
}
grid1 = GridSearchCV(et, param_grid = params, cv=5)
grid1.fit(X_train_svd_df, y_train)
print(grid1.best_score_)
grid1.best_params_

0.6790409899458624


{'max_depth': None,
 'max_features': 1.0,
 'min_samples_split': 2,
 'n_estimators': 200}

In [19]:
grid1.score(X_test_svd_df, y_test)

0.6412037037037037

_Extra Trees modeling doesn't seem to improve the baseline accuracy in this case, achieving roughly the same score._

_In this notebook, I took the cleaned and preprocessed data and applied Random Forests and Extra Trees models to try to build a high accuracy model. The highest score achieved was about 64% accuracy. Next, I will apply Gradient Boosting to try to achieve a higher accuracy score in the next notebook._