# Building Machine Learning Classifiers: Model selection

### Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Split into train/test

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

### Vectorize text

In [3]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,7153,7154,7155,7156,7157,7158,7159,7160,7161,7162
0,19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,115,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,106,2.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,29,3.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,152,4.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Final evaluation of models

In [4]:
#importing functions
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [5]:
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 1.782 / Predict time: 0.213 ---- Precision: 1.0 / Recall: 0.81 / Accuracy: 0.975


In [6]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 186.61 / Predict time: 0.135 ---- Precision: 0.889 / Recall: 0.816 / Accuracy: 0.962


For the final comparison, what we want to do is we want to compare Fit time, Predict time, Precision, Recall, and Accuracy between these two models with a particular focus on Predict time, Precision, and Recall. 

The reason being that once these models are fit, you generally store them for the purpose of making predictions later on. They wouldn't really ever be refit or retrained again until you decide that your current model needs to be replaced. That's why we care more about predict time than we do fit time.

And then precision and recall just give you a more in-depth look into how your model's performing than accuracy does. That's why we really focus on those three metrics.

Even though GradientBoosting takes way longer than RandomForest does to fit, it actually takes less time to predict. In terms of precision and recall, our RandomForest has much better precision at 100%, but GradientBoosting has slightly better recall. 

Now we find ourselves in a situation where no matter which model we pick we're making some kind of trade off. If we pick RandomForest, that means that we care more about precision than we do predict time or recall, and vice versa. This kind of trade off is very common, which brings me to a couple of very important points. 

**Further Evaluation:**
First, generally you'll dive into the metrics much more than we are here. We wouldn't base it only on overall precision, recall, and predict time. We'd split our test set in a variety of different ways to understand how it does across a number of different dimensions.

For example, let's consider only at text messages that have a length greater than 50, see how our model does there. Or, let's consider at text messages that have zero punctuation and see how our model does there. We'd slice it in a variety of different ways to really understand where the model's doing well, and maybe where it doesn't do well. That would also include looking at specific text messages that the model is getting wrong

**Results-Trade off:**
The second point is after thorough training and evaluation process, you usually end up in a place where you have some kind of trade off like we have here between performance and predict time. And, in this case, and this is very important, you make your decision based on the business problem or the business context. What that means is having a longer predict time going to create a huge bottleneck in your process. In some business context, having a model that takes over .2 seconds to predict might be a deal breaker so you might have no choice but to go with the GradientBoosting model.

Additionally, most problems either have a higher cost on false positives, which means you would prioritize precision, or false negatives, which means you'd prioritize recall. For instance, for a spam filter, you can probably deal with some spam in your inbox here and there, but you don't want your spam filter to capture real emails, so we'd prioritize here for precision. So when it says it's spam, it better be spam. In this case, false positives are very costly. 

The second case would be something like anti-virus software. False positives where they say that you have a virus but you really don't, that can be scary without a doubt, but if you're getting hacked and your software doesn't catch it, that's much, worse. In this case, we should optimize for recall so that if there's a breach, the model better be able to catch it. 

**Summary:** In this case, we should optimize for recall so that if there's a breach, the model better be able to catch it. With all that said, assuming that predict time is not a deal breaker for your business problem, and you don't necessarily have a super-clear answer of whether false positives or false negatives are more costly, the model that you'd probably select here is the RandomForest model. That's because the precision is so much better than the GradientBoosting model, and the recall is very close.