# 6 Machine Learning Classifiers
## 6.3 Random Forest model
### Read in & clean text

In [24]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

data = pd.read_csv('SMSSpamCollection', sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(' ')), 3)*100

def count_cap(text):
    count = len(re.findall('[A-Z]', text))
    return round(count/(len(text) - text.count(' ')), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(' '))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))
data['cap%'] = data['body_text'].apply(lambda x: count_cap(x))

def clean_text_ps(text):
    text = ''.join([char.lower() for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    tokens = re.findall('\w+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

def clean_text_wn(text):
    text = ''.join([char.lower() for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    tokens = re.findall('\w+', text)
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    return text

# TF-IDF using stemming
tfidf_vect_ps = TfidfVectorizer(analyzer=clean_text_ps)
X_tfidf_ps = tfidf_vect_ps.fit_transform(data['body_text'])
X_tfidf_feat_ps = pd.concat([data['body_len'], data['punct%'], data['cap%'], pd.DataFrame(X_tfidf_ps.toarray())], axis=1)

# TF-IDF using lemmatizing
tfidf_vect_wn = TfidfVectorizer(analyzer=clean_text_wn)
X_tfidf_wn = tfidf_vect_wn.fit_transform(data['body_text'])
X_tfidf_feat_wn = pd.concat([data['body_len'], data['punct%'], data['cap%'], pd.DataFrame(X_tfidf_wn.toarray())], axis=1)

# CountVectorizer using stemming
count_vect_ps = CountVectorizer(analyzer=clean_text_ps)
X_count_ps = count_vect_ps.fit_transform(data['body_text'])
X_count_feat_ps = pd.concat([data['body_len'], data['punct%'], data['cap%'], pd.DataFrame(X_count_ps.toarray())], axis=1)

# CountVectorizer using lemmatizing
count_vect_wn = CountVectorizer(analyzer=clean_text_wn)
X_count_wn = count_vect_wn.fit_transform(data['body_text'])
X_count_feat_wn = pd.concat([data['body_len'], data['punct%'], data['cap%'], pd.DataFrame(X_count_wn.toarray())], axis=1)

### RandomForestClassifier

In [2]:
from sklearn.ensemble import RandomForestClassifier

**Attributes**
* `feature_importances_`: outputs the value of each feature to the model
* `fit`: fit actual model and store that fit model as object
* `predict`: make prediction on test set

**Hyperparameters**
* `max_depth`: the depth of each decision trees is, default is none
* `n_estimators`: the number of decision trees built within random forest, default is 10

### RandomForestClassifier through Cross_Validation

In [19]:
from sklearn.model_selection import KFold, cross_val_score

# Build individual decision trees in parallel
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
print('IF-IDF using stemming: ', cross_val_score(rf, X_tfidf_feat_ps, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1))
print('TF-IDF using lemmatizing: ', cross_val_score(rf, X_tfidf_feat_wn, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1))
print('Count using stemming: ', cross_val_score(rf, X_count_feat_ps, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1))
print('Count using lemmatizing: ', cross_val_score(rf, X_count_feat_wn, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1))

IF-IDF using stemming:  [0.97578475 0.97755835 0.97307002 0.96768402 0.97486535]
TF-IDF using lemmatizing:  [0.96860987 0.97396768 0.97576302 0.95960503 0.97307002]
Count using stemming:  [0.96860987 0.97307002 0.97576302 0.95960503 0.97576302]
Count using lemmatizing:  [0.96950673 0.97037702 0.97486535 0.96140036 0.96947935]


### RandomForestClassifier through Holdout Test Set

In [4]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_features_ps, data['label'], test_size=0.2)

rf_h = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf_h.fit(X_train, y_train)

sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:5]

[(0.05895296291205725, 'body_len'),
 (0.046055213992170996, 'cap%'),
 (0.036850849709918206, 843),
 (0.03033015077545737, 1069),
 (0.029476376792644476, 3829)]

In [11]:
y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')

print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                        round(recall, 3),
                                                        round((y_pred==y_test).sum()/len(y_test), 3)))

Precision: 1.0 / Recall: 0.72 / Accuracy: 0.967


* **Precision** 100%: All mail in the spam folder is actually spam.
* **Recall** 72%: 72% of the all spam that has come into email was properly placed in the spam folder.
* **Accuracy** 96.7%: 96.7% of emails that have come into emails were correctly identified as spam or ham.

### Random Forest model with Grid-search
**Grid-search**: Exhaustively search all parameter combinations in a given grid to determine the best model.

In [14]:
def train_rf(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        n_est, depth, round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_test), 3)))

In [15]:
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_rf(n_est, depth)

Est: 10 / Depth: 10 ---- Precision: 1.0 / Recall: 0.303 / Accuracy: 0.917
Est: 10 / Depth: 20 ---- Precision: 1.0 / Recall: 0.636 / Accuracy: 0.957
Est: 10 / Depth: 30 ---- Precision: 1.0 / Recall: 0.803 / Accuracy: 0.977
Est: 10 / Depth: None ---- Precision: 0.973 / Recall: 0.826 / Accuracy: 0.977
Est: 50 / Depth: 10 ---- Precision: 1.0 / Recall: 0.371 / Accuracy: 0.926
Est: 50 / Depth: 20 ---- Precision: 1.0 / Recall: 0.727 / Accuracy: 0.968
Est: 50 / Depth: 30 ---- Precision: 1.0 / Recall: 0.795 / Accuracy: 0.976
Est: 50 / Depth: None ---- Precision: 0.983 / Recall: 0.879 / Accuracy: 0.984
Est: 100 / Depth: 10 ---- Precision: 1.0 / Recall: 0.318 / Accuracy: 0.919
Est: 100 / Depth: 20 ---- Precision: 1.0 / Recall: 0.689 / Accuracy: 0.963
Est: 100 / Depth: 30 ---- Precision: 1.0 / Recall: 0.803 / Accuracy: 0.977
Est: 100 / Depth: None ---- Precision: 0.983 / Recall: 0.871 / Accuracy: 0.983


Through grid-search, we can see how certain parameters impact the aggressiveness in the model. Here, as the depth increases from 10, 20, to 30, and eventually to nono, the recall increases quite drastically, while the precision doesn't really drop. So the model is getting much better and more aggressive as the depth increases. On the other side, we noticed that adding estimators might be helping a little bit, but the improvement isn't as adding depth to the individual trees. 10 is clearly pretty bad, no matter how many estimators. 20 isn't really great either. Once get towards 30, it starts to level out. So we can eliminate model with limited max depth to get best random forest model.

### Evaluate Random Forest with GridSearchCV to explore parameter settings
**Cross-validation**: Divide a dataset into *k* subsets and repeat the holdout method _k_ times where a different subset is used as the holdout set in each iteration.

In [21]:
from sklearn.model_selection import GridSearchCV
import time

In [25]:
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300], 'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1, return_train_score=True)

# TF-IDF
start = time.time()
gs_fit = gs.fit(X_tfidf_feat_ps, data['label'])
end = time.time()
print('Fit time: ', round(end - start, 3))
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False).head()

Fit time:  514.623


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
10,34.836667,0.769263,0.415283,0.03,,150,"{'max_depth': None, 'n_estimators': 150}",0.979372,0.977578,0.974865,...,0.976306,0.001944,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
11,60.273258,6.539977,0.574894,0.073772,,300,"{'max_depth': None, 'n_estimators': 300}",0.978475,0.976682,0.974865,...,0.975588,0.002086,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,61.792665,0.592637,0.593537,0.016471,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.978475,0.975785,0.976661,...,0.975408,0.002703,3,0.999551,0.999551,0.999551,1.0,0.999551,0.999641,0.00018
6,4.960417,0.148955,0.22578,0.021746,90.0,10,"{'max_depth': 90, 'n_estimators': 10}",0.978475,0.974888,0.974865,...,0.975049,0.004624,4,0.997756,0.998654,0.998654,0.998429,0.998205,0.99834,0.000336
7,32.746749,0.622644,0.402019,0.02817,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.979372,0.973991,0.976661,...,0.975049,0.003654,4,0.999327,0.999551,0.999327,1.0,0.999551,0.999551,0.000246


* **mean_fit_time**: the average time it takes each model to fit
* **mean_score_time**: the average amount of time it takes each model to make a prediction on the test set
* **mean_test_score**: the average accuracy on the test set
* **mean_train_score**: the average accuracy on the training set

In terms of parameter combinations, we notice that the best performing models are the ones with the deepest individual decision trees. The number of estimators doesn't seem to matter as much. Looking at the mean_fit_time, it's much faster for 10 estimators than it is for 150 or 300. 

In [26]:
# CountVectorizer
start = time.time()
gs_fit1 = gs.fit(X_count_feat_ps, data['label'])
end = time.time()
print('Fit time: ', round(end - start, 3))
pd.DataFrame(gs_fit1.cv_results_).sort_values('mean_test_score', ascending=False).head()

Fit time:  502.035


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
10,34.628215,0.424908,0.435221,0.022161,,150,"{'max_depth': None, 'n_estimators': 150}",0.980269,0.975785,0.976661,...,0.976126,0.002583,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,62.067042,0.487726,0.577043,0.023161,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.980269,0.976682,0.975763,...,0.975947,0.002868,2,0.998878,0.999327,0.999103,1.0,0.999327,0.999327,0.000375
7,32.046166,0.326746,0.390415,0.015534,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.978475,0.976682,0.975763,...,0.975767,0.002413,3,0.999327,0.999327,0.999103,0.999551,0.999551,0.999372,0.000168
11,61.82121,6.535141,0.596016,0.112341,,300,"{'max_depth': None, 'n_estimators': 300}",0.977578,0.977578,0.973968,...,0.975767,0.001615,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,26.80229,0.212494,0.36887,0.033454,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.978475,0.975785,0.972172,...,0.975049,0.002234,5,0.995287,0.995287,0.996186,0.996635,0.995065,0.995692,0.000609


The mean_test_score at 97.61% for CountVectorizer is just a tight below 97.63% for TF-IDF, so TF-IDF is doing slightly better. The max_depth with 90 or none are in most of the top models on this side. The number of estimators seems that it matters a little bit more as there are only 150 and 300.
## 6.4 Gradient Boosting model
### GradientBoostingClassifier

In [27]:
from sklearn.ensemble import GradientBoostingClassifier

Recall back to random forest, the default setting for `max_depth` is none, so it could build each tree as deep as it wanted. For gradient boosting, the `max_depth` default setting is three. Also for random forest the default for the number of estimators was 10, and for gradient boosting it's 100. Random forest is built with a couple fully grown trees, whereas gradient boosting uses a lot of very basic trees. We can notice there's no `n_jobs` paramter like for random forest. So gradient boosting can't be parallelized because each iteration builds on the prior iteration. `learning_rate` determines how quickly an algorithm optimizes, but it also has performance implications, because it can cause the model to optimize too quickly without truly finding the best model.
### Gradiet Boosting model with Grid-search

In [28]:
def train_gb(est, max_depth, lr):
    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)
    start = time.time()
    gb_model = gb.fit(X_train, y_train)
    end = time.time()
    y_pred = gb_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
    print('Fit time: {} / Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        round(end-start, 3), est, max_depth, lr, round(precision, 3), round(recall, 3), 
        round((y_pred==y_test).sum()/len(y_test), 3)))

In [29]:
for n_est in [50, 100, 150]:
    for max_depth in [3, 7, 11, 15]:
        for lr in [0.05, 0.1, 1]:
            train_gb(n_est, max_depth, lr)

Fit time: 40.868 / Est: 50 / Depth: 3 / LR: 0.05 ---- Precision: 0.88 / Recall: 0.78 / Accuracy: 0.961
Fit time: 38.697 / Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 0.889 / Recall: 0.848 / Accuracy: 0.97
Fit time: 33.715 / Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.83 / Recall: 0.848 / Accuracy: 0.961
Fit time: 97.537 / Est: 50 / Depth: 7 / LR: 0.05 ---- Precision: 0.884 / Recall: 0.864 / Accuracy: 0.97
Fit time: 91.083 / Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.878 / Recall: 0.871 / Accuracy: 0.97
Fit time: 72.068 / Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.898 / Recall: 0.871 / Accuracy: 0.973
Fit time: 144.766 / Est: 50 / Depth: 11 / LR: 0.05 ---- Precision: 0.889 / Recall: 0.848 / Accuracy: 0.97
Fit time: 141.948 / Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.885 / Recall: 0.871 / Accuracy: 0.971
Fit time: 51.359 / Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.863 / Recall: 0.856 / Accuracy: 0.967
Fit time: 188.081 / Est: 50 / Depth: 15 / LR: 0.05 ---- Precision: 0.

If set the learning rate too low, the models might not be able to predict any messages to be spam, the precision can't be calculated and will be set to zero.

**Poorly performing models**
* Very low max depth
* Very low number of estimators

**Best performing models**
* Learning rate of 0.1
* Very high max depth
* Very high number of estimators

### Evaluate Gradient Boosting with GridSearchCV to explore parameter settings

In [30]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150],
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1, 0.3]
}

gs = GridSearchCV(gb, param, cv=5, n_jobs=-1, return_train_score=True)

# TF-IDF
start = time.time()
gs_fit = gs.fit(X_tfidf_feat_ps, data['label'])
end = time.time()
print('Fit time: ', round(end - start, 3))
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False).head()

Fit time:  54150.29


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
9,522.031952,11.573965,0.379298,0.042388,0.3,11,150,"{'learning_rate': 0.3, 'max_depth': 11, 'n_est...",0.975785,0.983857,...,0.977383,0.003834,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
7,350.827729,3.627007,0.352834,0.02858,0.3,7,150,"{'learning_rate': 0.3, 'max_depth': 7, 'n_esti...",0.976682,0.98296,...,0.977203,0.003253,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
11,412.56342,25.534372,0.312962,0.010733,0.3,15,150,"{'learning_rate': 0.3, 'max_depth': 15, 'n_est...",0.975785,0.984753,...,0.977203,0.004573,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
10,432.301239,16.106874,0.347346,0.013523,0.3,15,100,"{'learning_rate': 0.3, 'max_depth': 15, 'n_est...",0.976682,0.98565,...,0.977024,0.00478,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,373.915833,2.72982,0.343288,0.026163,0.3,11,100,"{'learning_rate': 0.3, 'max_depth': 11, 'n_est...",0.976682,0.98296,...,0.976485,0.003921,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Lookint at results for TF-IDF, in `mean_fit_time`, for random forest, the most time consuming model took around 60 secondes to fit. Gradient boosting models are taking a minimum of 350 seconds, and up to almost 500. Move to `mean_train_score`, these models are getting perfect scores on the training set. If the model is overfitting, to the point of just memorizing the training set, then that's bad because it won't do well generalizing to the test set. But if the model is just getting so good that it's doing really well on the training set, and it can still generalize, then that's great. The `mean_test_score` tells us whether the model can generalize to data that it was not trained on. The `mean_test_score` are all just around 97.77%, or just below, and the best models are the ones with 150 estimators, and around 11 `max_depth`. So the very best model has 150 estimators, and a `max_depth` of 11.

In [31]:
# CountVectorizer
start = time.time()
gs_fit1 = gs.fit(X_count_feat_ps, data['label'])
end = time.time()
print('Fit time: ', round(end - start, 3))
pd.DataFrame(gs_fit1.cv_results_).sort_values('mean_test_score', ascending=False).head()

Fit time:  65641.601


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
7,339.770966,3.35238,0.339256,0.021179,0.3,7,150,"{'learning_rate': 0.3, 'max_depth': 7, 'n_esti...",0.980269,0.986547,...,0.978639,0.005784,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
9,2311.477935,2177.338841,0.330816,0.043589,0.3,11,150,"{'learning_rate': 0.3, 'max_depth': 11, 'n_est...",0.978475,0.986547,...,0.978101,0.005592,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
10,2206.522328,2181.255303,0.345557,0.024413,0.3,15,100,"{'learning_rate': 0.3, 'max_depth': 15, 'n_est...",0.975785,0.986547,...,0.977383,0.005491,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
11,421.516874,20.547864,0.322416,0.011674,0.3,15,150,"{'learning_rate': 0.3, 'max_depth': 15, 'n_est...",0.975785,0.98565,...,0.977203,0.00529,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,232.196359,6.406001,0.336145,0.034254,0.3,7,100,"{'learning_rate': 0.3, 'max_depth': 7, 'n_esti...",0.977578,0.984753,...,0.977024,0.005561,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Looking at the results for the count vectorizing, the time it takes to fit the `mean_train_score`, and `mean_test_score`, are quite different, some are all pretty much right in line with TF-IDF, some are not. However, there's some difference on `mean_test_score`, all the test score results are just above 97.7%, and the best model reaches 97.8%. It's the one with 150 estimators, and a `max_depth` of 7. Thus, count vectorizing performs a little better than TF-IDF.

## Model selection

In [40]:
data1 = pd.read_csv('SMSSpamCollection', sep='\t')
data1.columns = ['label', 'body_text']

data1['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(' '))
data1['punct%'] = data['body_text'].apply(lambda x: count_punct(x))
data1['cap%'] = data['body_text'].apply(lambda x: count_cap(x))

### Split into train/test

In [41]:
X_train, X_test, y_train, y_test = train_test_split(data1[['body_text', 'body_len', 'punct%', 'cap%']],
                                                    data1['label'], test_size=0.2)

### Vectorize text

In [47]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text_ps)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%', 'cap%']].reset_index(drop=True),
                          pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%', 'cap%']].reset_index(drop=True),
                          pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,cap%,0,1,2,3,4,5,6,...,6347,6348,6349,6350,6351,6352,6353,6354,6355,6356
0,129,4.7,8.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,156,10.9,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,124,3.2,5.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,128,7.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,36,5.6,5.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Final evaluation of models

In [51]:
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = end - start

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = end - start

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3),
    round((y_pred==y_test).sum()/len(y_test), 3)))

Fit time: 8.315 / Predict time: 0.223 ---- Precision: 1.0 / Recall: 0.853 / Accuracy: 0.982


In [54]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11, learning_rate=0.3)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = end - start

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = end -start

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3),
    round((y_test==y_pred).sum()/len(y_test), 3)))

Fit time: 338.49 / Predict time: 0.208 ---- Precision: 0.984 / Recall: 0.912 / Accuracy: 0.987


With particular focus on predict time, precision, and recall, let's do a final comparison. Once these models are fit, they wouldn't really ever be refit or retrained again until current model needs to be replaced. Thus we care more about predict time than fit time. Moreover, precision and recall can give a more in-depth look into how is the model's performence. Even though gradient boosting taks way longer than random forest does to fit, it actually takes less time to predict. In terms of precision and recall, random forest has much better precision at 100%, but gradient boosting has slightly better recall. So there will be a trade off when we pick up model. If we care more about precision than than we do predict time or recall, we can use random forest, and vice versa.

Further evaluation: Slice test set in a variety of different ways to understand how it does across a number of different dimensions, s.t. text meassages that have a length greater than 50 or that have zero punctuation. Examine text messages the model is getting wrong.

Results of trade-off &mdash; consider business context: Is predict time going to create a bottlenect? Spam filter &mdash; optimize for precision. Antivirus software &mdash; optimize for recall.