# Phase III: Preprocessing & Model Selection

In order to find the best modeling method for the question at hand, I fit four models to the testing data using Gridsearch to find the best hyperparameters:

- [Logistic Regression](#Logistic-Regression)
- [Random Forest Classifier](#Random-Forest)
- [Multinomial Naive Bayes](#Naive-Bayes)
- [Bagging Classifier](#Bagging-Classifier)

I decided to use only the text data from the **'title'** column of the data set as the X variable. I chose to model the data in this way so that we could easily interpret the performance of each model. In other words, limiting the predictors to a single text feature forces the model to make predictions based solely on text content. Since the goal here is to see how well a computer can destinguish between substantiated and unsubstantiated claims, I felt it appropriate to isolate text as the sole predictor of class.

Logistic Regression and Random Forest had the strongest initial performance, so I decided to proceed with these two models.

---

### Importing Packages & Libraries:

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

  from numpy.core.umath_tests import inner1d


### Reading in the Data:

In [3]:
df = pd.read_csv('./Datasets/Final_Reddit_Dataset_CLEAN.csv')
df.shape

(1916, 9)

---

### Preprocessing & Gridsearching Hyperparameters

Using pipelines in conjunction with GridSearch allows for easy preprocessing, model tuning, and model fitting, all in the same step. After setting the predictor and target variables, running a train/test split, and instanitating the models, I built out pipelines for each of the following:

- Logistic Regression
- Random Forest Classifier
- Multinomial Naive Bayes
- Bagging Classifier

I fit each model to the training data twice, once using the Count Vectorizer, and once using the TF-IDF vectorizer. I proceeded to manually adjust the parameter grid for each pipeline until I reached an optimal train and test score. You'll notice that certain parameters below have been commented out - these are settings that were tested, but that ultimately did not improve the score.

In [4]:
X = df['title']
y = df['subreddit_class']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [6]:
cv = CountVectorizer()
tf = TfidfVectorizer()

lr = LogisticRegression(random_state=42)
rf = RandomForestClassifier(random_state=42)
nb = MultinomialNB()
bg = BaggingClassifier()

### Logistic Regression

##### Gridsearching with CountVectorizer

In [6]:
pipe = Pipeline([
    ('cv', cv),
    ('lr', lr)
])

params={
    'cv__stop_words': [None, 'english'],
    'cv__max_features': [None, 4000, 5000, 6000], #3000
    'cv__ngram_range': [(1, 3), (1, 4), (1, 5), (1, 6)], #(1, 1), (1, 2), (1, 3)
    #'cv__max_df': [2, 5, 7],
    #'cv__min_df': [1, 2, 3],
    'lr__class_weight': [None], # {1: .45}
    'lr__penalty': ['l2'], # 'l1'
    'lr__C': [.5] # 1.0, .01
}
gs_lr_cv = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_lr_cv.fit(X_train, y_train);

print('GridSearch Best Score:', gs_lr_cv.best_score_)
print('GridSearch Best Params:',gs_lr_cv.best_params_)
print()
print('Train Score:', gs_lr_cv.score(X_train, y_train))
print('Test Score:', gs_lr_cv.score(X_test, y_test))

GridSearch Best Score: 0.7049408489909533
GridSearch Best Params: {'cv__max_features': 5000, 'cv__ngram_range': (1, 5), 'cv__stop_words': None, 'lr__C': 0.5, 'lr__class_weight': None, 'lr__penalty': 'l2'}

Train Score: 0.9693806541405706
Test Score: 0.6993736951983298


##### Gridsearching with TF-IDF

In [7]:
pipe = Pipeline([
    ('tf', tf),
    ('lr', lr)
])

params={
    'tf__stop_words': [None, 'english'],
    'tf__max_features': [None, 2000, 3000, 4000], # 5000
    'tf__ngram_range': [(1, 1), (1, 2), (1, 3)], # (1, 5), (1, 4)
    #'tf__max_df': [2, 5, 7],
    #'tf__min_df': [1, 2, 3],
    #'tf__smooth_idf': [True, False],
    'lr__class_weight': [None, {1: .45}],
    'lr__penalty': ['l2', 'l1'],
    'lr__C': [1.0, .5, .01]
}
gs_lr_tf = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_lr_tf.fit(X_train, y_train)

print('GridSearch Best Score:', gs_lr_tf.best_score_)
print('GridSearch Best Params:',gs_lr_tf.best_params_)
print()
print('Train Score:', gs_lr_tf.score(X_train, y_train))
print('Test Score:', gs_lr_tf.score(X_test, y_test))

GridSearch Best Score: 0.7042449547668754
GridSearch Best Params: {'lr__C': 0.5, 'lr__class_weight': None, 'lr__penalty': 'l2', 'tf__max_features': None, 'tf__ngram_range': (1, 2), 'tf__stop_words': None}

Train Score: 0.9227557411273486
Test Score: 0.7265135699373695


---

### Random Forest



#### Gridsearching with CountVectorizer

In [8]:
pipe = Pipeline([
    ('cv', cv),
    ('rf', rf)
])

params = {
    'cv__stop_words': [None, 'english'],
    'cv__max_features': [5000, 6000], # 3000, None, 4000, 
    'cv__ngram_range': [(1, 5)], #(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
    #'cv__max_df': [2, 5, 7],
    #'cv__min_df': [1, 2, 3],
    'rf__n_estimators': [150, 250], # 10, 50, 100,
    'rf__criterion': ['entropy'], # 'gini'
    'rf__max_depth': [None], # 2, 4, 1, 3
    'rf__max_features': ['auto'] # , 1.0, .5
}

gs_rf_cv = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_rf_cv.fit(X_train, y_train)

print('GridSearch Best Score:', gs_rf_cv.best_score_)
print('GridSearch Best Params:',gs_rf_cv.best_params_)
print()
print('Train Score:', gs_rf_cv.score(X_train, y_train))
print('Test Score:', gs_rf_cv.score(X_test, y_test))

GridSearch Best Score: 0.7021572720946416
GridSearch Best Params: {'cv__max_features': 5000, 'cv__ngram_range': (1, 5), 'cv__stop_words': None, 'rf__criterion': 'entropy', 'rf__max_depth': None, 'rf__max_features': 'auto', 'rf__n_estimators': 150}

Train Score: 0.9937369519832986
Test Score: 0.7202505219206681


#### Gridsearching with TF-IDF

In [9]:
pipe = Pipeline([
    ('tf', tf),
    ('rf', rf)
])

params={
    'tf__stop_words': [None, 'english'],
    'tf__max_features': [3000, 4000, 5000],
    'tf__ngram_range': [(1, 2), (1, 3), (1, 4)], # (1, 1), (1, 5)
    'tf__smooth_idf': [True, False],
    'rf__n_estimators': [10, 50], # 100
    'rf__criterion': ['gini'], # 'entropy'
    'rf__max_depth': [None, 1, 3], # 2, 4
    'rf__max_features': ['auto'] #1.0, .5
}
gs_rf_tf = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_rf_tf.fit(X_train, y_train)

print('GridSearch Best Score:', gs_rf_tf.best_score_)
print('GridSearch Best Params:',gs_rf_tf.best_params_)
print()
print('Train Score:', gs_rf_tf.score(X_train, y_train))
print('Test Score:', gs_rf_tf.score(X_test, y_test))

GridSearch Best Score: 0.6896311760612387
GridSearch Best Params: {'rf__criterion': 'gini', 'rf__max_depth': None, 'rf__max_features': 'auto', 'rf__n_estimators': 50, 'tf__max_features': 4000, 'tf__ngram_range': (1, 3), 'tf__smooth_idf': True, 'tf__stop_words': None}

Train Score: 0.9930410577592206
Test Score: 0.7202505219206681


---

### Naive Bayes

#### Gridsearching with CountVectorizer

In [10]:
pipe = Pipeline([
    ('cv', cv),
    ('nb', nb)
])

params={
    'cv__stop_words': [None, 'english'],
    'cv__max_features': [None, ], #3000, 6000, 3000, 4000, 5000
    'cv__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)], # (1, 5), (1, 6), (1, 7)
    #'cv__max_df': [5, 7], # 2
    #'cv__min_df': [1, 2, 3],
    'nb__alpha': [0], # 2, 5, .5, 1
    'nb__fit_prior': [True, False]
}
gs_nb_cv = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_nb_cv.fit(X_train, y_train)

print('GridSearch Best Score:', gs_nb_cv.best_score_)
print('GridSearch Best Params:',gs_nb_cv.best_params_)
print()
print('Train Score:', gs_nb_cv.score(X_train, y_train))
print('Test Score:', gs_nb_cv.score(X_test, y_test))

GridSearch Best Score: 0.6638830897703549
GridSearch Best Params: {'cv__max_features': None, 'cv__ngram_range': (1, 3), 'cv__stop_words': None, 'nb__alpha': 0, 'nb__fit_prior': True}

Train Score: 0.9916492693110647
Test Score: 0.6638830897703549


#### Gridsearching with TF-IDF

In [11]:
pipe = Pipeline([
    ('tf', tf),
    ('nb', nb)
])

params={
    'tf__stop_words': [None, 'english'], # 
    'tf__max_features': [None, 2000, 3000, 4000], # 5000
    'tf__ngram_range': [(1, 2), (1, 3), (1, 4)], # (1, 1), (1, 5)
    'tf__smooth_idf': [True, False],
    #'nb__alpha': [1, 2],
    #'nb__fit_prior': [True, False]
}
gs_nb_tf = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_nb_tf.fit(X_train, y_train)

print('GridSearch Best Score:', gs_nb_tf.best_score_)
print('GridSearch Best Params:',gs_nb_tf.best_params_)
print()
print('Train Score:', gs_nb_tf.score(X_train, y_train))
print('Test Score:', gs_nb_tf.score(X_test, y_test))

GridSearch Best Score: 0.6868475991649269
GridSearch Best Params: {'tf__max_features': None, 'tf__ngram_range': (1, 2), 'tf__smooth_idf': True, 'tf__stop_words': None}

Train Score: 0.9909533750869868
Test Score: 0.7118997912317327


---
### Bagging Classifier

##### Gridsearching with CountVectorizer

In [12]:
pipe = Pipeline([
    ('cv', cv),
    ('bg', bg)
])

params={
    'cv__stop_words': [None], #
    'cv__max_features': [5000], # 3000, None, 4000, 6000
    'cv__ngram_range': [(1, 4), (1, 5)], #(1, 1), (1, 2), (1, 3), , (1, 6)
    #'cv__max_df': [2, 5, 7],
    #'cv__min_df': [1, 2, 3],
    'bg__n_estimators': [50, 75], # 5, 10
    'bg__bootstrap_features': [True] # False
}
gs_bg_cv = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_bg_cv.fit(X_train, y_train);

print('GridSearch Best Score:', gs_bg_cv.best_score_)
print('GridSearch Best Params:',gs_bg_cv.best_params_)
print()
print('Train Score:', gs_bg_cv.score(X_train, y_train))
print('Test Score:', gs_bg_cv.score(X_test, y_test))

GridSearch Best Score: 0.6965901183020181
GridSearch Best Params: {'bg__bootstrap_features': True, 'bg__n_estimators': 50, 'cv__max_features': 5000, 'cv__ngram_range': (1, 5), 'cv__stop_words': None}

Train Score: 0.9930410577592206
Test Score: 0.7014613778705637


##### Gridsearching with TF-IDF

In [11]:
pipe = Pipeline([
    ('tf', tf),
    ('bg', bg)
])

params={
    'tf__stop_words': [None, 'english'],
    'tf__max_features': [None, 3000, 4000, 5000], # 2000
    'tf__ngram_range': [(1, 1), (1, 2), (1, 3)], # (1, 5), (1, 4)
    'bg__n_estimators': [5, 10, 50],
    'bg__bootstrap_features': [True, False]
}
gs_bg_tf = GridSearchCV(pipe, param_grid=params, return_train_score=True)
gs_bg_tf.fit(X_train, y_train)

print('GridSearch Best Score:', gs_bg_tf.best_score_)
print('GridSearch Best Params:',gs_bg_tf.best_params_)
print()
print('Train Score:', gs_bg_tf.score(X_train, y_train))
print('Test Score:', gs_bg_tf.score(X_test, y_test))

GridSearch Best Score: 0.6805845511482255
GridSearch Best Params: {'bg__bootstrap_features': True, 'bg__n_estimators': 50, 'tf__max_features': 4000, 'tf__ngram_range': (1, 1), 'tf__stop_words': None}

Train Score: 0.9937369519832986
Test Score: 0.7160751565762005
