# Building ML Classifiers: Evaluate Random Forest with GridSearch CV

*Grid-search*: Exhaustively search all parameter combinations in a given grid to determine the best model.  
*Cross-validation*: Divide a dataset into k subsets and repeat the holdout method k times where a different subset is used as the holdout set in each iteration.

### Read in text

In [1]:
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

In [2]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
X_tfidf_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vect.get_feature_names_out())], axis=1)
X_tfidf_feat.head()

Unnamed: 0,body_len,punct%,Unnamed: 3,0,008704050406,0089mi,0121,01223585236,01223585334,0125698789,...,zindgi,zoe,zogtoriu,zoom,zouk,zyada,é,ü,üll,〨ud
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['body_text'])
X_count_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_count.toarray(), columns=count_vect.get_feature_names_out())], axis=1)
X_count_feat.head()

Unnamed: 0,body_len,punct%,Unnamed: 3,0,008704050406,0089mi,0121,01223585236,01223585334,0125698789,...,zindgi,zoe,zogtoriu,zoom,zouk,zyada,é,ü,üll,〨ud
0,128,4.7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,49,4.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,62,3.2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,28,7.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,135,4.4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exploring parameter settings using GridSearchCV

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [5]:
# TF-IDF models

rf = RandomForestClassifier()
param = {
    'n_estimators': [10, 150, 300],
    'max_depth': [30, 60, 90, None],
}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1) # return_train_score=False by default
# GridSearchCV is a scikit-learn object that we actually have to fit in our training data
# so that it can fit our model across different folds on each of the parameter settings
gs_fit = gs.fit(X_tfidf_feat, data['label'])

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,33.464239,0.968948,0.464548,0.043731,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.978456,0.978456,0.973944,0.968553,0.973046,0.974491,0.003717,1
10,18.908961,0.507772,0.305568,0.037735,,150,"{'max_depth': None, 'n_estimators': 150}",0.977558,0.978456,0.975741,0.966757,0.971249,0.973952,0.004372,2
6,1.966229,0.177248,0.143863,0.011178,90.0,10,"{'max_depth': 90, 'n_estimators': 10}",0.977558,0.983842,0.973046,0.962264,0.973046,0.973951,0.007058,3
7,17.037896,0.511843,0.269527,0.012663,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.978456,0.975763,0.974843,0.967655,0.972147,0.973773,0.003664,4
11,31.198131,0.565726,0.265494,0.054262,,300,"{'max_depth': None, 'n_estimators': 300}",0.976661,0.973968,0.975741,0.969452,0.972147,0.973594,0.002585,5


- **mean_fit_time**: average time it takes each model to fit
- **mean_score_time**: average amount of time it takes each model to make a prediction on the test set
- **mean_test_score**: average accuracy on the test set
- **mean_train_score**: average accuracy on the train set  
  
We see that the best performing tfidf models are the ones with the deepest individual decision trees (90 or None) and we see that number of estimators doesn't seem to matter quite as much.
Also the model with just 10 estimators is much faster.

In [6]:
# CountVectorizer models

rf = RandomForestClassifier()
param = {
    'n_estimators': [10, 150, 300],
    'max_depth': [30, 60, 90, None],
}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1) # return_train_score=False by default
# GridSearchCV is a scikit-learn object that we actually have to fit in our training data
# so that it can fit our model across different folds on each of the parameter settings
gs_fit = gs.fit(X_count_feat, data['label'])

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,17.219825,0.501948,0.293461,0.011252,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.976661,0.975763,0.973046,0.969452,0.971249,0.973234,0.002699,1
10,21.175483,1.572806,0.348967,0.075441,,150,"{'max_depth': None, 'n_estimators': 150}",0.981149,0.973968,0.973046,0.967655,0.97035,0.973234,0.004531,2
8,36.510925,1.31809,0.719888,0.06604,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.977558,0.974865,0.974843,0.966757,0.968553,0.972515,0.004129,3
11,36.288682,1.439076,0.326608,0.070011,,300,"{'max_depth': None, 'n_estimators': 300}",0.976661,0.97307,0.974843,0.966757,0.97035,0.972336,0.003481,4
5,27.291945,0.488282,0.393102,0.004521,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.975763,0.970377,0.972147,0.964061,0.97035,0.97054,0.003792,5


Here we see that tfidf performs slightly better than count vectorizer. Also, deeper max_depth (90 or None) is important as well as for tfidf, but also number or estimators might have some significance in contrast to tfidf.  

In practice we would usually explore a lot more settings (e.g n-grams, different parameters within the vectorizer, 4-5 other hyperparameter settings within Random Forest, whether we should include stopwords, whether removing punctuation is helpful etc.). It's not uncommon to test over a hundred or even a thousand models in some cases.