# Model Selection

### Machine Learning Pipeline

1. Read in raw text
2. Clean text and tokenize
3. Feature engineering
4. Fit simple model
5. Tune hyperparameters and evaluate with GridSearchCV
6. Final model selection

Previously we've been bending the rules a bit regarding to our vectorizers.  
Vectorizers are like models, they should be fit on a training set and only be used to transform the test set. When it transforms the test set, it will only create columns for the words that were in the training set. Any words that appear in the test set but not in the training set will not show up in the vectorized version of the test set. The vectorizer will only recognize words that it saw in the training set.  
Up to this moment we've been training the vectorizer on the entire data set, instead of just the training set because it makes things easier with Grid-search and breaking them apart would require an introduction to scikit-learn pipelines.  
That's why we're going to tweak that process just a little bit as we go into the final model selection.  
  
Process
1. Split the data into training and test set
2. Train vectorizers on training set and use that to transform test set
3. Fit best random forest model and best gradient boosting model on training set and predict on test set
4. Evaluate results of these two models to select the best one

### Read in & clean text

In [7]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Split into train/test

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

### Vectorize text

In [9]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)

# fit
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

# transform
tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

# create features
X_train_vect = pd.concat([
    X_train[['body_len', 'punct%']].reset_index(drop=True),
    # newly created df comes with a brand new set of indices, but X_train is still keeping the index from the original dataset
    # the indices for two dataframes do not match, but the text messages are still in the same order
    # so, we can reset indices dropping the old one and now indices of 2 dataframes will match  
    pd.DataFrame(tfidf_train.toarray(), columns=tfidf_vect.get_feature_names_out())
], axis=1) # what axis we want to concatenate on 
# (axis=1 means concatenation side by side, axis=0 would stack X_train on top of the tfidf_train)

X_test_vect = pd.concat([
    X_test[['body_len', 'punct%']].reset_index(drop=True),
    pd.DataFrame(tfidf_test.toarray(), columns=tfidf_vect.get_feature_names_out())
], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,Unnamed: 3,0,008704050406,0089mi,0121,01223585236,01223585334,020603,...,zhong,zindgi,zoe,zogtoriu,zouk,zyada,é,ü,üll,〨ud
0,57,24.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,60,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,39,7.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,138,8.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,145,2.1,0.0,0.0,0.0,0.274595,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As we see not we have only 7k words instead of 8k when we fit vectorizer on full dataset instead of just train one.  
It tells us that there are around 1k words in the test set that won't be recognized by the vectorizer.

### Final evaluation

In [10]:
import time
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score

In [11]:
# Random Forest

rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = round((end - start), 3)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = round((end - start), 3)

precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
precision = round(precision, 3)
recall = round(recall, 3)
accuracy = round((y_pred==y_test).sum() / len(y_pred), 3)

print(f'Fit time: {fit_time} / Predict time: {pred_time} / Precision: {precision} / Recall: {recall} / Accuracy: {accuracy}')

Fit time: 2.4 / Predict time: 0.192 / Precision: 1.0 / Recall: 0.867 / Accuracy: 0.982


In [12]:
# Gradient Boosting

gb = GradientBoostingClassifier(n_estimators=150, max_depth=15)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = round((end - start), 3)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = round((end - start), 3)

precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
precision = round(precision, 3)
recall = round(recall, 3)
accuracy = round((y_pred==y_test).sum() / len(y_pred), 3)

print(f'Fit time: {fit_time} / Predict time: {pred_time} / Precision: {precision} / Recall: {recall} / Accuracy: {accuracy}')

Fit time: 158.618 / Predict time: 0.145 / Precision: 0.883 / Recall: 0.853 / Accuracy: 0.965


From the results we see, that even Gradient Booster takes way longer to fit, it takes less time to predict, so the performance will be better. Precision, recall and accuracy metrics are slightly better for Randlom Forest.  
  
The final decision should rely on business needs and context. For example for spam filter false positives are very costly and we may need to choose model with better precision, bcz we don't want our spam filter to capture real emails.  
 
Antivirus software for example may be optimized for recall, because not catching actual virus is much worse than false positive.  
The performance difference can be a big deal for the systems with potential bottlenecks caused by model activity; 