### Mission

We recently used Naive Bayes to classify spam in this [SMSSpamClassifier dataset](http://localhost:8888/edit/Notebooks/Supervised%20Learning/Ensemble%20Learning/data/SMSSpamCollection). \
In this notebook, we will expand on the previous analysis by using a few of the new techniques we've learned in Ensemble learning.

First we will recreate the Naive Bayes model and then use the Ensemble techniques on the same dataset and compare the 2 models.

#### 1. Recreate the Naive Bayes model on the Spam dataset

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Read in our dataset
df = pd.read_table('./data/SMSSpamCollection', sep = '\t', names = ['label', 'sms_message'])

# Fix our response value
df['label'] = df.label.map({'ham': 0, 'spam': 1})

# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, -1].values, 
                                                    df.iloc[:, :-1].values, 
                                                    random_state = 1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# Instantiate our model
naive_bayes = MultinomialNB()

# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

# Predict on the test data
predictions = naive_bayes.predict(testing_data)

# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


  return f(*args, **kwargs)


#### 2. Now try different Ensemble techniques and compare the models 

We can see from the scores above that our Naive Bayes model actually does a pretty good job of classifying "spam" and "ham."  However, let's take a look at a few additional models to see if we can't improve anyway.

Specifically, we will take a look at the following techniques:

* [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
* [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
* [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

Really useful guide for ensemble methods can be found [in the documentation here](http://scikit-learn.org/stable/modules/ensemble.html).

Since we have cleaned and vectorized the text as per above, we now can be focused on - the machine learning part.

In general, there is a five step process that can be used each time you want to use a supervised learning method (which we have used):

1. **Import** the model.
2. **Instantiate** the model with the hyperparameters of interest.
3. **Fit** the model to the training data.
4. **Predict** on the test data.
5. **Score** the model by comparing the predictions to the actual values.

In [2]:
# Import the Bagging, RandomForest, and AdaBoost Classifier

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

Now instantiate each of the classifiers with the hyperparameters mentioned on each comment:

In [3]:
# Instantiate a BaggingClassifier with:
# 200 weak learners (n_estimators) and everything else as default values

bag_classifier = BaggingClassifier(n_estimators = 200)


# Instantiate a RandomForestClassifier with:
# 200 weak learners (n_estimators) and everything else as default values

random_classifier = RandomForestClassifier(n_estimators = 200)


# Instantiate an a AdaBoostClassifier with:
# With 300 weak learners (n_estimators) and a learning_rate of 0.2

ada_classifier = AdaBoostClassifier(n_estimators = 300, learning_rate = 0.2)

Fit each model using the `training_data` and `y_train` obtained above in the Naive Bayes section

In [4]:
# Fit your BaggingClassifier to the training data

bag_classifier.fit(training_data, y_train)


# Fit your RandomForestClassifier to the training data

random_classifier.fit(training_data, y_train)


# Fit your AdaBoostClassifier to the training data

ada_classifier.fit(training_data, y_train)

  return f(*args, **kwargs)
  random_classifier.fit(training_data, y_train)
  return f(*args, **kwargs)


AdaBoostClassifier(learning_rate=0.2, n_estimators=300)

Its time to get the prediction for the above models using the `testing_data`

In [5]:
# Predict using BaggingClassifier on the test data

y_pred_bag = bag_classifier.predict(testing_data)


# Predict using RandomForestClassifier on the test data

y_pred_randforest = random_classifier.predict(testing_data)


# Predict using AdaBoostClassifier on the test data

y_pred_ada = ada_classifier.predict(testing_data)

Define a function which will return the score values for our models. \
We will also use this on Naive Bayes to compare the performance side by side

In [6]:
def print_metrics(y_true, preds, model_name = None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (NumPy array or pandas series)
    preds - the predictions for those values from some model (NumPy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

Now print the metrics for individual models:

In [7]:
# Print Bagging scores

print_metrics(y_test, y_pred_bag, 'BaggingClassifier')


# Print Random Forest scores

print_metrics(y_test, y_pred_randforest, 'RandomForestsClassifier')


# Print AdaBoost scores

print_metrics(y_test, y_pred_ada, 'AdaboostClassifier')


# Naive Bayes Classifier scores

print_metrics(y_test, predictions, 'NaiveBayesClassifier')

Accuracy score for BaggingClassifier : 0.9741564967695621
Precision score BaggingClassifier : 0.9116022099447514
Recall score BaggingClassifier : 0.8918918918918919
F1 score BaggingClassifier : 0.9016393442622951



Accuracy score for RandomForestsClassifier : 0.9834888729361091
Precision score RandomForestsClassifier : 1.0
Recall score RandomForestsClassifier : 0.8756756756756757
F1 score RandomForestsClassifier : 0.9337175792507205



Accuracy score for AdaboostClassifier : 0.9770279971284996
Precision score AdaboostClassifier : 0.9693251533742331
Recall score AdaboostClassifier : 0.8540540540540541
F1 score AdaboostClassifier : 0.9080459770114943



Accuracy score for NaiveBayesClassifier : 0.9885139985642498
Precision score NaiveBayesClassifier : 0.9720670391061452
Recall score NaiveBayesClassifier : 0.9405405405405406
F1 score NaiveBayesClassifier : 0.9560439560439562



