# Building ML Classifiers: Random Forest model with grid search

It is possible to make our model better simply by changing the hyperparameter settings (e.g. number of estimators, max depth etc.)
Grid-search means defining a grid of hyperparameter settings, and then exploring a model fit with each combination of those hyperparameter settings in order to choose the best one. 

### Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vect.get_feature_names_out())], axis=1)
# we're creating a data frame X_features that does not include the label
X_features.head()

Unnamed: 0,body_len,punct%,Unnamed: 3,0,008704050406,0089mi,0121,01223585236,01223585334,0125698789,...,zindgi,zoe,zogtoriu,zoom,zouk,zyada,é,ü,üll,〨ud
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Build our own Grid-search

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)

In [4]:
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')

    precision = round(precision, 3)
    recall = round(recall, 3)
    accuracy = round((y_pred==y_test).sum() / len(y_pred), 3)
    print(f'Est: {n_est} / Depth: {depth} ----- Precision: {precision} / Recall: {recall} / Accuracy: {accuracy}')

In [5]:
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_RF(n_est, depth)

Est: 10 / Depth: 10 ----- Precision: 1.0 / Recall: 0.291 / Accuracy: 0.91
Est: 10 / Depth: 20 ----- Precision: 0.99 / Recall: 0.688 / Accuracy: 0.96
Est: 10 / Depth: 30 ----- Precision: 1.0 / Recall: 0.745 / Accuracy: 0.968
Est: 10 / Depth: None ----- Precision: 0.991 / Recall: 0.823 / Accuracy: 0.977
Est: 50 / Depth: 10 ----- Precision: 1.0 / Recall: 0.248 / Accuracy: 0.905
Est: 50 / Depth: 20 ----- Precision: 1.0 / Recall: 0.645 / Accuracy: 0.955
Est: 50 / Depth: 30 ----- Precision: 1.0 / Recall: 0.801 / Accuracy: 0.975
Est: 50 / Depth: None ----- Precision: 1.0 / Recall: 0.837 / Accuracy: 0.979
Est: 100 / Depth: 10 ----- Precision: 1.0 / Recall: 0.213 / Accuracy: 0.9
Est: 100 / Depth: 20 ----- Precision: 1.0 / Recall: 0.66 / Accuracy: 0.957
Est: 100 / Depth: 30 ----- Precision: 1.0 / Recall: 0.78 / Accuracy: 0.972
Est: 100 / Depth: None ----- Precision: 1.0 / Recall: 0.865 / Accuracy: 0.983


As we see, the best recall have models with n_estimators=100 and depth=None. This model is more agressive and fits our purpose better.