# Preprocessing and Hyperparameter Tuning

This notebook prepares our dataset for model hyperparameter tuning. After splitting data into training and testing sets, we clean our text for count vectorization and TF-IDF transformations. Using GridSearch, we determine which models and hyperparameters will be best for classification. 

In [27]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text  import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from nltk import RegexpTokenizer, WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import pickle
import re

### Import data and train-test-split

In [2]:
#read in data
df = pd.read_csv('../data/data_final.csv')

In [3]:
df.head()

Unnamed: 0,text,target
0,/r/BravoRealHousewives daily OT thread. Today ...,1
1,The Real Housewives of New Jersey S09E07 - Bru...,1
2,If we could pool our money and hire MKE to do ...,1
3,Gotta pay for that wedding somehow but holy Fa...,1
4,RHONJ Season 9 Midseason Trailer,1


In [4]:
df.tail()

Unnamed: 0,text,target
3599,His fresh fade has evolved.,0
3600,Ron Baker The Virginity Taker,0
3601,Ahhh the infamous karma whore. Probably hops i...,0
3602,Fuck Pacers have McDermott... RIP raptors,0
3603,I swear to God --- That'd be epic af!,0


In [5]:
df['target'].value_counts(normalize=True)

1    0.514151
0    0.485849
Name: target, dtype: float64

 - Our data has a close to 50-50 split of classes. We do not need to worry about oversampling or undersampling. 
 - A baseline score for our classification model is 51%
 - 1 = BravoRealHousewives
 - 2 = NBA

In [6]:
#train test split
X = df[['text']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=42)

### Instatiate lemmatizer, tokenizer, list of stop words, and a function to clean our text data.

In [7]:
#instatiate lemmatizer, tokenizer, and stemmer
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer('\w+')
p_stemmer = PorterStemmer()

#create set of stopwords from sklearn and add more words
stops = set(stopwords.words('english'))
#remove meaningless characters xb, amp, and r
more_stops = ['xb','amp','r']
for w in more_stops:
    stops.add(w)

#function to clean text
def to_words(raw_text):
    #remove links
    raw_text = re.sub('http\S+', '', raw_text)
    raw_text = re.sub('www\S+', '', raw_text)
    #remove numbers
    raw_text = re.sub('\d+', '', raw_text)
    #tokenize
    words = tokenizer.tokenize(raw_text.lower())
    #remove stop words and stem/lemmatize
    meaningful_words = [p_stemmer.stem(w) for w in words if not w in stops]
    
    return (" ".join(meaningful_words))
#use our to_words function to create a list of texts for our training and testing set

# Initialize empty lists to hold the clean texts.
clean_train_text = []
clean_test_text = []

# Append clean texts to list.
for text in X_train['text']:
    clean_train_text.append(to_words(text))
for text in X_test['text']:
    clean_test_text.append(to_words(text))

 - Our text data has been stemmed and removed of stop words, numbers, and hyperlinks. (In the end, we forego lemmatizing)

In [8]:
clean_test_text[:5]

['rhoa offici intro new taglin releas',
 'southern charm ladi',
 'nikola jokic flex felip eichenberg denver nugget head strength coach',
 'total',
 'housew moment life chill part mayb use drama thursday last met hw moment long interestingli explain got think though hw worthi moment last year']

In [9]:
clean_train_text[:5]

['jokic realiz swaggi p team right',
 'bravorealhousew daili ot thread today novemb daili thread topic discuss',
 'nba shoud introduc hypermax contract basic instead measli cap supermax hypermax would team cap mean player could potenti receiv million year nba current million cap would team alway gonna front offic either desper enough dumb enough shell money player even mention sentenc max contract',
 'told someon attend parti like sign outsid hous establish arriv guest consent form sometim non disclosur agreement given guest complet enter parti venu',
 'umm want emili back know anti tamra']

### Tune models with CountVectorizer dataframe
- GridSearching through hyperparamters for Logistic Regression, Random Forest, AdaBoost, and Gradient Boost classificatoin models.
- Cross score validation on Naive Bayes Multinomial classification model.

In [10]:
#create CountVectorizer Dataframe for Train and Test
cv = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 1500, ngram_range=(1,3)) 

train_cv = cv.fit_transform(clean_train_text)
test_cv = cv.transform(clean_test_text)

train_cv = train_cv.toarray()

test_cv = test_cv.toarray()

X_train_cv = pd.DataFrame(train_cv,columns=cv.get_feature_names(),index=y_train)
X_test_cv = pd.DataFrame(test_cv,columns=cv.get_feature_names(),index=y_test)

In [26]:
#dataframe of features and their appearance count for each observation
X_train_cv.head()

Unnamed: 0_level_0,aaron,aaron fox,abil,abl,absolut,abus,accord,account,accus,act,...,wwhl,ye,yeah,year,year old,yesterday,yet,york,young,youtub
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
lr = LogisticRegression()
lr_params = {
    'penalty': ['l2','l1'],
    'C': [.1, .5, .7, .9, .95, .99,1],
}

lr_gs = GridSearchCV(lr, lr_params, cv=5)
lr_model_cv = lr_gs.fit(X_train_cv,y_train)
print(lr_model_cv.best_estimator_)
print('LogisticRegression best score:', lr_model_cv.best_score_)
print('LogisticRegression train score:', lr_model_cv.score(X_train_cv,y_train))
print('LogisticRegression test score:', lr_model_cv.score(X_test_cv,y_test))

LogisticRegression(C=0.95, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
LogisticRegression best score: 0.85275619681835
LogisticRegression train score: 0.9452460229374768
LogisticRegression test score: 0.8679245283018868


In [14]:
rf = RandomForestClassifier(random_state=42)

rf_params = {
    'n_estimators' : [100,125,150],
}

rf_gs = GridSearchCV(rf, rf_params, cv=5)
rf_model_cv = rf_gs.fit(X_train_cv,y_train)
print(rf_model_cv.best_estimator_)
print('RandomForest best score:', rf_model_cv.best_score_)
print('RandomForest train score:', rf_model_cv.score(X_train_cv,y_train))
print('RandomForest test score:', rf_model_cv.score(X_test_cv,y_test))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=125, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
RandomForest best score: 0.8383277839437662
RandomForest train score: 0.975212726600074
RandomForest test score: 0.8645948945615982


In [15]:
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
ada_params = {
    'base_estimator__max_depth' : [1,2,3],
    'n_estimators' : [40,50,60],
}
ada_gs = GridSearchCV(ada, param_grid=ada_params, cv = 5)

ada_model_cv = ada_gs.fit(X_train_cv,y_train)
print(ada_model_cv.best_estimator_)
print('AdaBoost best score:', ada_model_cv.best_score_)
print('AdaBoost train score:', ada_model_cv.score(X_train_cv,y_train))
print('AdaBoost test score:', ada_model_cv.score(X_test_cv,y_test))

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=60, random_state=None)
AdaBoost best score: 0.7909729929707732
AdaBoost train score: 0.9041805401405846
AdaBoost test score: 0.8157602663706992


In [16]:
gb = GradientBoostingClassifier()
gb_params = {
    'learning_rate':[.25,.5,.75],
}
gb_gs = GridSearchCV(gb,param_grid=gb_params, cv=3)
gb_model_cv = gb_gs.fit(X_train_cv,y_train)
print(gb_model_cv.best_estimator_)
print('GradientBoost best score:', gb_model_cv.best_score_)
print('GradientBoost train score:', gb_model_cv.score(X_train_cv,y_train))
print('GradientBoost test score:', gb_model_cv.score(X_test_cv,y_test))

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.5, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
GradientBoost best score: 0.82944876063633
GradientBoost train score: 0.903070662227155
GradientBoost test score: 0.8290788013318535


In [12]:
nb = MultinomialNB()

nb_model_cv = nb.fit(X_train_cv,y_train)
print('NaiveBayes Multinomial train score:',nb_model_cv.score(X_train_cv,y_train))
print('NaiveBayes Multinomial CV score:',cross_val_score(nb,X_train_cv, y_train).mean())
print('NaiveBayes Multinomial test score:',nb_model_cv.score(X_test_cv,y_test))

NaiveBayes Multinomial train score: 0.8945615982241953
NaiveBayes Multinomial CV score: 0.8845730141894296
NaiveBayes Multinomial test score: 0.8612652608213096


### Tune models with TF-IDF dataframe

In [17]:
#create TF-IDF Dataframe for Train and Test
tv = TfidfVectorizer(analyzer = "word",
                     tokenizer = None,
                     preprocessor = None,
                     stop_words = None, 
                     max_features = 1500, ngram_range=(1,3))

train_tv = tv.fit_transform(clean_train_text)
test_tv = tv.transform(clean_test_text)

train_tv = train_tv.toarray()
test_tv = test_tv.toarray()

X_train_tv = pd.DataFrame(train_tv,columns=tv.get_feature_names())
X_test_tv = pd.DataFrame(test_tv,columns=tv.get_feature_names())

In [19]:
lr_model_tv = lr_gs.fit(X_train_tv,y_train)
print(lr_model_tv.best_estimator_)
print('LogisticRegression best score:', lr_model_tv.best_score_)
print('LogisticRegression train score:', lr_model_tv.score(X_train_tv,y_train))
print('LogisticRegression test score:', lr_model_tv.score(X_test_tv,y_test))

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
LogisticRegression best score: 0.8679245283018868
LogisticRegression train score: 0.9197188309285979
LogisticRegression test score: 0.8756936736958935


In [20]:
rf_model_tv = rf_gs.fit(X_train_tv,y_train)
print(rf_gs.best_estimator_)
print('RandomForset best score:', rf_model_tv.best_score_)
print('RandomForest train score:', rf_model_tv.score(X_train_tv,y_train))
print('RandomForest test score:', rf_model_tv.score(X_test_tv,y_test))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=125, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
RandomForset best score: 0.8472068072512023
RandomForest train score: 0.975212726600074
RandomForest test score: 0.8634850166481687


In [21]:
ada_model_tv = ada_gs.fit(X_train_tv,y_train)
print(ada_model_tv.best_estimator_)
print('AdaBoost best score:', ada_model_tv.best_score_)
print('AdaBoost train score:', ada_model_tv.score(X_train_tv,y_train))
print('AdaBoost test score:', ada_model_tv.score(X_test_tv,y_test))

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=40, random_state=None)
AdaBoost best score: 0.7846836847946725
AdaBoost train score: 0.8386977432482426
AdaBoost test score: 0.7746947835738068


In [22]:
gb_model_tv = gb_gs.fit(X_train_tv,y_train)
print(gb_gs.best_estimator_)
print('Gradient Boost best score:', gb_model_tv.best_score_)
print('Gradient Boost train score:', gb_model_tv.score(X_train_tv,y_train))
print('Gradient Boost test score:', gb_model_tv.score(X_test_tv,y_test))

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.25, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)
Gradient Boost best score: 0.8087310395856456
Gradient Boost train score: 0.9089900110987791
Gradient Boost test score: 0.8135405105438401


In [18]:
nb_model_tv = nb.fit(X_train_tv,y_train)
print('NaiveBayes Multinomial train score:', nb_model_tv.score(X_train_tv,y_train))
print('NaiveBayes Multinomial CV score:', cross_val_score(nb,X_train_tv, y_train).mean())
print('NaiveBayes Multinomial test score:', nb_model_tv.score(X_test_tv,y_test))

NaiveBayes Multinomial train score: 0.9137994820569737
NaiveBayes Multinomial CV score: 0.8886437988248984
NaiveBayes Multinomial test score: 0.881243063263041


- In general, our classification models appear to be overfitting on the training data for both CountVectorized features and TF-IDF features. 
- Multinomial Naive Bayes performs the best with train and cross-validated scores falling close to the test score. 
  - This might indicate that our features are very dependent of each other, which is not surprising; the features, even in n-grams of 2 or 3, might need more context in order to be classified more accurately. 
- For our final model, we will use the CountVectorized features and its respective combination of models, as they are scoring better overall.


In [23]:
#export data and best models
models = {'lr_model_cv' : lr_model_cv.best_estimator_,
          'rf_model_cv' : rf_model_cv.best_estimator_,
          'ada_model_cv': ada_model_cv.best_estimator_,
          'gb_model_cv' : gb_model_cv.best_estimator_,
          'nb_model_cv' : nb_model_cv,
          'lr_model_tv' : lr_model_tv.best_estimator_,
          'rf_model_tv' : rf_model_tv.best_estimator_,
          'ada_model_tv': ada_model_tv.best_estimator_,
          'gb_model_tv' : gb_model_tv.best_estimator_,
          'nb_model_tv' : nb_model_tv,
          'X_train' : X_train,
          'X_train_cv' : X_train_cv,
          'X_train_tv' : X_train_tv,
          'X_test_cv' : X_test_cv,
          'X_test_tv' : X_test_tv,
          'y_train' : y_train,      
          'y_test' : y_test,      
          'clean_train_text' : clean_train_text,
          'clean_test_text' :clean_test_text
         }

pickle.dump(models, open('../data/models.pk', 'wb'))