# Building Machine Learning Classifiers: Model selection

## Machine Learning Pipeline:
1. Read in raw text.
2. Clean text & tokenize
3. Feature Engineering
4. Fit simple model.
5. Tune hyperparameters and evaluated each one using GridSearchCV.
6. Final model selection

Vectorizers are like models. They need to be fit on a training set and then stored in order to transform the test set. So when we say fit on the training set, in the context of a vectorizer, it basically just means it stores all of the words in the training set. Then when it transforms the test set, it will only create columns for the words that were in the training set. Any words that appear in the test set but not in the training set, will not show up in the vectorized version of the test set. The vectorizer will only recognize words that it saw in the training set. Up to this point, we've been training the vectorizer on the entire data set, instead of just the training set because it makes things easier with GridSearch.

**Process:**
1. Split into training and test set.
2. Train the vectorizers on the training set and use that to transform the test set. 
3. Fit the very best gradient boosting model and the best random forest model on the training set and then predict on the test set. 
4. Evaluate the results of these two models to select the very best model.

### Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Split into train/test

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

### Vectorize text

In [3]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,7177,7178,7179,7180,7181,7182,7183,7184,7185,7186
0,27,3.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,27,3.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,27,3.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,33,6.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,39,5.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The vectorizer on the full data set generated over 8,000 features. In other words, it recognized over 8,000 unique words. Now it only contains just over 7,000. Again, that's because this vectorizer was fit only on the training data. So what this tells us is that there are around 1,000 words in the test set that won't be recognized by the vectorizer. So they'll essentially be ignored.