## ***Building Machine Learning Classifiers.***
### **Machine Learning**
- *The field of study that gives computers the ability to learn without being explicitly programmed*
- *A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.*
- *Algorithms that 'can figure out how to perform important tasks by generalizing from examples'*
- *Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world*

#### **Two Broad Types of Machine Learning**
- `Supervised Learning` - *Task of inferring a function from labeled training data to make predictions on useen data*
    - example: Predict whether any given email is spam based on known information about the email.
- `Unsupervised Learning` - *Deriving structure from data where we don't know the effect of any of the variables*
    - example: Based on the content of an email, group similar emails together in distinct folders.
    
#### **Holdout Test Set**
- Sample of data not used in fitting a model for the purpose of evaluationg the model's ability to generalize unseen data.

#### ***K-Fold Cross-Validation***
- This is used to evalate the model
- Full dataset is divided into k-subsets and the holdout method is repeated k times. Each time, one of the k-subsets is used as the test set and the other k-1 subsets are put together to be used to train the model. 
- This gives a more robust read on the performance of the model rather than just having one single hold out test set.<br>

**Illustration:**<br>
<img src='five-fold cross validation.png'>

#### ***Evaluation Metrics***

- **$Accuracy = \frac{\#\ predicted\ correctly}{total\ \#\ of\ observations}$**
- **$Precision = \frac{\#\ predicted\ as\ spam\ that\ are\ actually\ spam [True\ Positive]}{total\ \#\ predicted\ as\ spam}$**
- **$Recall = \frac{\#\ predicted\ as\ spam\ that\ are\ actually\ spam [True\ Positive]}{total\ \#\ that\ are\ actually\ spam}$**
<img src='Precision-Recall.png'>

> ***Precision-Recall give you the ability to kind of tailor the aggressiveness of your algorithm basead on your business problem.***
<hr>

### **Fitting a Machine Learning model on text data.**
> #### ***`Ensemble Method`***
> - ***Technique that creates multiple models and then combines them to produce better results than any of the single models individually.***
> - Idea is to ***`combine a lot of weak models to create a single strong model.`***
> - Basic idea - This leverages the ***aggregate opinion of many over the isolated opinion*** of one. 
>
> <img src='Ensemble-method.png'>
> <hr>

> #### ***`Random Forest`*** 
> - ***Ensemble learning method that constructs a `collection of decision trees` and then aggregates the predictions of each tree to determine the final prediction***

> ***`Benefits` of Ensemble Methods***
> - *Can be used for classification or regression.*
> - *Easily handles outliers, missing values, etc.*
> - *Accepts various types of inputs (continous, ordinal, etc.)*
> - *Less likely to overfit*
> - *Outputs feature importance*

> #### **Building a basic Random Forest model.**

In [1]:
# Read in and clean text

import nltk
import pandas as pd
import re 
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

# fuction for creating new feature to get percentage of punctuations in text body.
def punct_percentage(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(' ')), 3)*100

# create new feature
data['body_len'] = data['text_body'].apply(lambda x: len(x)-x.count(' '))
data['punct_percent'] = data['text_body'].apply(lambda x: punct_percentage(x))

# function to clean text
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(token) for token in tokens if token not in stopwords]
    return text

# vectorization
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['text_body'])

# concate this vectorized features with the new made features
X_features = pd.concat([data['body_len'], 
                        data['punct_percent'], 
                        pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct_percent,0,1,2,3,4,5,6,7,...,8097,8098,8099,8100,8101,8102,8103,8104,8105,8106
0,160,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> ***Explore RandomForestClassifier Attributes & Hyperparameters***

In [2]:
from sklearn.ensemble import RandomForestClassifier

In [4]:
print(dir(RandomForestClassifier))
print(RandomForestClassifier())

['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_estimator_type', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_required_parameters', '_set_oob_score', '_validate_X_predict', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_sample

> ***Explore RandomForestClassifier through Cross-Validation***

In [3]:
from sklearn.model_selection import KFold, cross_val_score

In [4]:
rf = RandomForestClassifier(n_jobs=-1, random_state=42)      #n_jobs=-1 => to run jobs in parallel. here decesion trees will be built in parallel processes
k_folds = KFold(n_splits=5)     #5-folds
cross_val_score(rf, X_features, data['label'], cv=k_folds, scoring='accuracy', n_jobs=-1)

array([0.97755835, 0.97845601, 0.97396768, 0.96585804, 0.97394429])

> #### ***Random Forest on a holdout test set***

In [6]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split as tts

In [7]:
X_train, X_test, y_train, y_test = tts(X_features, data['label'], test_size=0.2)

In [8]:
rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(X_train, y_train)

In [9]:
# print top 10 important features

sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]
# model.feature_importances_ gives a list without feature names, hence we zipped column names with it

[(0.07768423141760826, 'body_len'),
 (0.05079383461068237, 7353),
 (0.034540120207089194, 3135),
 (0.0318269854549568, 1804),
 (0.03160939230385166, 5727),
 (0.03122544832610075, 4799),
 (0.023014978482740363, 6749),
 (0.022408596703639518, 2032),
 (0.017864758929738024, 690),
 (0.017536690722193132, 392)]

In [10]:
# predict on X_test

y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')

In [12]:
# Print precision, recall, and accuracy

print(f'Precision: {round(precision,3)} / Recall: {round(recall,3)} / Accuracy: {round(sum(y_pred==y_test)/len(y_pred), 3)}')

Precision: 1.0 / Recall: 0.62 / Accuracy: 0.944


> <img src='PR-results1.png'>
>
> - ***Thus from the `recall` we get that the predicted `spam` messages out of `all original spams` are only `62%`, and thus this model is `not great` as `38% of the original spam messages are predicted as ham.`***
> - *And hence, the amount of spams that are classified as hams, tells us that our model is* ***not good enough in identifying spam***
>
> <hr>

> #### **Random Forest** model with ***`Grid-Search`***
> ***`Grid-search`***: **Exhaustively search all parameters combinations in a given grid to determine the best model.**

> **Build `OWN` Grid Search**

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split as tts

In [14]:
X_train, X_test, y_train, y_test = tts(X_features, data['label'], test_size=0.2)

In [16]:
# function for training and evaluation, will help while grid-search
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1, random_state=42)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')    
    print(f'Est: {n_est} / Max_depth: {depth} <====> Precision: {round(precision,3)} / Recall: {round(recall,3)} / Accuracy: {round(sum(y_pred==y_test)/len(y_pred), 3)}')

In [17]:
for n_est in [10,50,100]:
    for depth in [10,20,30,None]:
        train_RF(n_est, depth)

Est: 10 / Max_depth: 10 <====> Precision: 1.0 / Recall: 0.234 / Accuracy: 0.885
Est: 10 / Max_depth: 20 <====> Precision: 1.0 / Recall: 0.599 / Accuracy: 0.94
Est: 10 / Max_depth: 30 <====> Precision: 0.991 / Recall: 0.689 / Accuracy: 0.952
Est: 10 / Max_depth: None <====> Precision: 1.0 / Recall: 0.796 / Accuracy: 0.969
Est: 50 / Max_depth: 10 <====> Precision: 1.0 / Recall: 0.234 / Accuracy: 0.885
Est: 50 / Max_depth: 20 <====> Precision: 1.0 / Recall: 0.659 / Accuracy: 0.949
Est: 50 / Max_depth: 30 <====> Precision: 1.0 / Recall: 0.76 / Accuracy: 0.964
Est: 50 / Max_depth: None <====> Precision: 1.0 / Recall: 0.868 / Accuracy: 0.98
Est: 100 / Max_depth: 10 <====> Precision: 1.0 / Recall: 0.21 / Accuracy: 0.882
Est: 100 / Max_depth: 20 <====> Precision: 1.0 / Recall: 0.665 / Accuracy: 0.95
Est: 100 / Max_depth: 30 <====> Precision: 1.0 / Recall: 0.754 / Accuracy: 0.963
Est: 100 / Max_depth: None <====> Precision: 0.986 / Recall: 0.868 / Accuracy: 0.978


> Now we can select the hyperparameters that are best suited for our results/ business needs.

> #### ***Random Forest*** with ***`GridSearchCV`***
> - ***`Grid-search`***: *Exhaustively search all parameters combinations in a given grid to determine the best model.*
> - ***`Cross-Validation`***: *Divide a dataset into k subsets and repeat the holdout method k times where a different subset is used as the holdout set in each iteration.*
>
> Thus GridSearchCV is the combination of Grid-search and Cross-validation.

In [18]:
# Read in and clean text

import nltk
import pandas as pd
import re 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

# fuction for creating new feature to get percentage of punctuations in text body.
def punct_percentage(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(' ')), 3)*100

# create new feature
data['body_len'] = data['text_body'].apply(lambda x: len(x)-x.count(' '))
data['punct_percent'] = data['text_body'].apply(lambda x: punct_percentage(x))

# function to clean text
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(token) for token in tokens if token not in stopwords]
    return text

# vectorization
# Tf-idf
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['text_body'])
X_tfidf_feat = pd.concat([data['body_len'], 
                        data['punct_percent'], 
                        pd.DataFrame(X_tfidf.toarray())], axis=1)

# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['text_body'])
X_count_feat = pd.concat([data['body_len'], 
                        data['punct_percent'], 
                        pd.DataFrame(X_count.toarray())], axis=1)

> ***Exploring parameter settings using `GridSearchCV`***

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
rf = RandomForestClassifier()

# defining a grid of parameters for grid-search
param = {'n_estimators': [10,150,300,400],
        'max_depth': [30,60,90,None]} 

# create GridSearchCV object, you need to explicitly train this object
gs = GridSearchCV(rf, param_grid=param, cv=5, n_jobs=-1)
gs_tfidf_fit = gs.fit(X_tfidf_feat, data['label'])

# print out the top 5 models from 'df.cs_results_' attribute in gridsearchcv
pd.DataFrame(gs_tfidf_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

> results:
> <img src='gs_cv_tfidf_results.png'>

In [None]:
# create GridSearchCV object, you need to explicitly train this object
gs = GridSearchCV(rf, param_grid=param, cv=5, n_jobs=-1)
gs_count_fit = gs.fit(X_tfidf_feat, data['label'])

# print out the top 5 models from 'df.cs_results_' attribute in gridsearchcv
pd.DataFrame(gs_count_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

> results:
> <img src='gs_cv_countvect_results.png'>

> In practise, we should experiement a lot. Try out different feature engineering steps, or make good features, try different vectorizers, try for different ML models, blah blah blah..... many things have to be done. All this was just a small example of how to fit and in what steps to create a NLP based model.
> <hr>


> #### ***`Gradient Boosting`***
> - **Ensemble learning method** that takes an ***iterative approach to `combining weak learners` to create a `strong learner` by `focusing on mistakes of prior iterations`.*** 
> - Uses decision trees as well, but they're incredibly basic, like a ***`decision stump.`***
> - It evaluates what it gets right and what it gets wrong on that first tree, and then with the next iteration it places a ***`heavier weight` on those observations that it got wrong*** and it does this over and over and over again focusing on the examples it doesn't quite understand yet until it has minimized the error as much as possible.

> **How are RF and Gradient Boosting different?**
> <img src='rf-vs-gb.png'>

> ***`Bagging` vs `Boosting`*** 
>
> |Bagging|Boosting|
> |:-----:|:------:|
> |Samples Randomly|Samples with an increased weight on the ones that it got wrong previously|
> |Can be parallalized|Can't be parallalized, it is iterative. Relies on the tree before it|
> |Less time taken to train|Huge amount of time taken for training job|

> ***Gradient Boosting is typically more powerful and better-performing if tuned properly***

> **Trade-Offs of Gradient Boosting**:
> - `Pros`
>    - Extremely powerful
>    - Accepts various types of inputs
>    - Can be used for classification or regression
>    - Outputs feature importance
> - `Cons`
>    - Longer to train (can't parallelize)
>    - More likely to overfit
>    - More difficult to properly tune

> #### **Explore Gradient Boosting model with grid-search**

In [3]:
# Read in and clean text

import nltk
import pandas as pd
import re 
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

# fuction for creating new feature to get percentage of punctuations in text body.
def punct_percentage(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(' ')), 3)*100

# create new feature
data['body_len'] = data['text_body'].apply(lambda x: len(x)-x.count(' '))
data['punct_percent'] = data['text_body'].apply(lambda x: punct_percentage(x))

# function to clean text
def clean_text(text):
    text = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(token) for token in tokens if token not in stopwords]
    return text

# vectorization
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['text_body'])

# concate this vectorized features with the new made features
X_features = pd.concat([data['body_len'], 
                        data['punct_percent'], 
                        pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct_percent,0,1,2,3,4,5,6,7,...,8097,8098,8099,8100,8101,8102,8103,8104,8105,8106
0,160,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
from sklearn.ensemble import GradientBoostingClassifier

print(dir(GradientBoostingClassifier))
print(GradientBoostingClassifier())

['_SUPPORTED_LOSS', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_initialized', '_check_params', '_clear_state', '_compute_partial_dependence_recursion', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_get_tags', '_init_state', '_is_initialized', '_make_estimator', '_more_tags', '_raw_predict', '_raw_predict_init', '_required_parameters', '_resize_state', '_staged_raw_predict', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params', 

In [5]:
# Built our own Grid-search

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = tts(X_features, data['label'], test_size=0.2)

In [8]:
def train_GB(est, max_depth, lr):
    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)
    gb_model = gb.fit(X_train, y_train)
    y_pred = gb_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')    
    print(f'Est: {n_est} / Max_depth: {max_depth} / LR: {lr} <====> Precision: {round(precision,3)} / Recall: {round(recall,3)} / Accuracy: {round(sum(y_pred==y_test)/len(y_pred), 3)}')

In [9]:
for n_est in [50,100,150]:
    for max_depth in [3,7,11]:
        for lr in [0.01,0.1,1]:
            train_GB(n_est, max_depth, lr)

  _warn_prf(average, modifier, msg_start, len(result))


Est: 50 / Max_depth: 3 / LR: 0.01 <====> Precision: 0.0 / Recall: 0.0 / Accuracy: 0.851
Est: 50 / Max_depth: 3 / LR: 0.1 <====> Precision: 0.959 / Recall: 0.705 / Accuracy: 0.952
Est: 50 / Max_depth: 3 / LR: 1 <====> Precision: 0.901 / Recall: 0.771 / Accuracy: 0.953
Est: 50 / Max_depth: 7 / LR: 0.01 <====> Precision: 1.0 / Recall: 0.006 / Accuracy: 0.852
Est: 50 / Max_depth: 7 / LR: 0.1 <====> Precision: 0.947 / Recall: 0.753 / Accuracy: 0.957
Est: 50 / Max_depth: 7 / LR: 1 <====> Precision: 0.899 / Recall: 0.807 / Accuracy: 0.958
Est: 50 / Max_depth: 11 / LR: 0.01 <====> Precision: 1.0 / Recall: 0.018 / Accuracy: 0.854
Est: 50 / Max_depth: 11 / LR: 0.1 <====> Precision: 0.928 / Recall: 0.777 / Accuracy: 0.958
Est: 50 / Max_depth: 11 / LR: 1 <====> Precision: 0.918 / Recall: 0.807 / Accuracy: 0.961
Est: 100 / Max_depth: 3 / LR: 0.01 <====> Precision: 0.966 / Recall: 0.512 / Accuracy: 0.925
Est: 100 / Max_depth: 3 / LR: 0.1 <====> Precision: 0.961 / Recall: 0.741 / Accuracy: 0.957
Est:

> **Insights**<br>
> <img src='poorly-performing-gb-gs-model.png'>
> <img src='best-performing-gb-gs-models.png'>

> #### **Evaluate Gradient Boosting with GridSearchCV**
> As we used GridSearchCV with RF, similarly we can use Gradinet boosting with `GridSearchCV`

> Here is just the syntax

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
gb = GradientBoostingClassifierdientBoostingClassifier()
param = {
    'n_estimators': [100,150],
    'max_depth': [7,11,15],
    'learning_rate': [0.1]
}

gs = GridSearchCV(gb, param, cv=5, n_jobs=-1)

# fitting data for tfidf vectorized data (X_tfidf_feat)
cv_fit = gs.fit(X_tfidf_feat, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

# fitting data for count-vectorized data (X_count_feat)
cv_fit = gs.fit(X_count_feat, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

> #### ***`Final Model Selection`***
> Now that we have completed the entire pipeline except for `final model selection`, we would tweak this final process a bit.
> ### ***Process*** (`new/tweaked`)
> - *Split the data* into training and test set.
> - *Train vectorizers on training set* and use that to *transform test set*. [IMPORTANT CHANGE]
> - Fit best ramdom forest model and best gradient boosting model on training set and *predict on test set*.
> - Thoroughly evaluate results of these two models to *select best model.*

In [11]:
# Read and clean text
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

In [12]:
# Split the data into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

In [13]:
# Vectorizing
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,7135,7136,7137,7138,7139,7140,7141,7142,7143,7144
0,63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,31,6.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,129,3.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,21,4.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,95,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Final evaluation of the model
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [15]:
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

rf_model = rf.fit(X_train_vect, y_train)
y_pred = rf_model.predict(X_test_vect)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Precision: 1.0 / Recall: 0.879 / Accuracy: 0.985


In [16]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

gb_model = gb.fit(X_train_vect, y_train)
y_pred = gb_model.predict(X_test_vect)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Precision: 0.907 / Recall: 0.83 / Accuracy: 0.968


> Checking the `time taken to train` and evaluate for both the models<br>
> This becomes a major factor for `model selection`

In [17]:
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 2.616 / Predict time: 0.191 ---- Precision: 1.0 / Recall: 0.865 / Accuracy: 0.983


In [18]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 128.681 / Predict time: 0.058 ---- Precision: 0.902 / Recall: 0.844 / Accuracy: 0.969


> ***Insights***<br>
> - For final comparision we should ideally compare `fit time`, `predict time`, `precision`, `recall`, and `accuracy` between the two models. with a particular *focus* on **predict time, precision and recall**<br>
> - Even if the **Gradient Boosting model** takes a lot `more time to train` but it takes much `less time to predict`. Thus working at real time, this becomes a major reason to select GradientBoosting model.<br>
> - RF has better precision, but GB has better recall.
> - Now final selection has to done based on the business needs.

> ### **Two Final Points**
> - Further evaluation
>    - Slice test set
>    - Examine text messages the model is getting wrong
> - **Result trade-off**
>    - Make decisions based on **business problem or the business context**
>    - Is predict time of 0.213 vs. 0.135 going to create a bottleneck?
>    - Precision/recall
>       - Spam filter- optimize for precision (when it says spam, it better be spam)
>       - Antivirus software- optimize for recall

> In our case, RF has better precision, very close recall to that of GB and predict time is not huge, hence we can easily **select RF as the final model** here!!! Yaay!