## In-sample evaluation and cross-validation

### Naive Bayes: spam filter
- building a spam filter
- There should be spam/ham labels. In this data this is done for us.


In [None]:
#### import necessary files
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

%matplotlib inline

In [None]:
### # Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

In [None]:
### lets look at the data
sms_raw

#### Data analysis
- we have two columns, a label and a message.
- We have to get features from this data.  Now, a message isn't really a feature. However, we can engineer features from the message field relatively easily. This kind of feature engineering is a basic version of what we'll cover in the NLP section.
- At the most obvious level, a feature from the message can be whether it contains a given word. These words could be keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent'] etc.

- Let's add those columns to our dataframe. The words chosen below are simply intuited as possibly having something to do with spam. Try some of your own ideas too!
- Note that you could add new features to the dataframe simply by adding them to the keywords list. 

In [None]:
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

### lets look at the message and see if the message has any of these words. if they do, possibility of spam
for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
    )

In [None]:
### 
sms_raw

#### Data analysis
- Another feature option would be if the message is all uppercase. That seems kind of spammy doesn't it?

In [None]:
### is themessage all uppercase?
sms_raw['allcaps'] = sms_raw.message.str.isupper()

In [None]:
# Before we go further, let's turn the spam column into a boolean so we can easily 
# do some statistics to prepare for modeling.

sms_raw['spam'] = (sms_raw['spam'] == 'spam')
# Note that if you run this cell a second time everything will become false.
# So... Don't.

In [None]:
sms_raw

### Naive Bayes assumptions
- Now, as we covered before one of the main assumptions of Naive Bayes is that the variables fed into the model are independent of each other. 
- Let's check to see how true that is in this case using Pandas' built in correlation matrix function, corr(), and the heatmap from seaborn.

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(sms_raw.corr(), linewidth=1,annot=True,cmap='coolwarm')
plt.show()

### 
- Most of the words show strong independence from each other. The only exceptions are free:offer and cash:winner

### Building model: Pick out your training data


In [None]:
### Training data
# SKLearn required you to specify an outcome (y or dependent variable) 
# and your inputs (x or independent variables). 
# We'll do that below under the titles data and target.

data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

In [None]:
#data

In [None]:
#target

### Create Model

In [None]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

## Check of accuracy

In [None]:
# Calculate the accuracy of your model here.
# Display our results.
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%".format(
    data.shape[0],
    (target != y_pred).sum(),
    100*(1-(target != y_pred).sum()/data.shape[0])
))

## Use confusion matrix to see FP and FN

In [None]:
confusion_matrix(target, y_pred)

## Use confusion matrix (your own method) to see FP and FN

In [None]:
# Build your confusion matrix and calculate sensitivity and specificity here.
#cm = [[TN, FP],[FN, TP]] - true negetive, false positive, false negetive, true positive

def manually_calculate_cm(test_y, test_predictions):
    """Manually create confusion matrix by comparing predictions with answers."""
    TP = 0
    TN = 0
    FP = 0
    FN = 0

    x = 0
    while x < len(test_y):
        if test_y[x] == 0 and test_predictions[x] == 0:
            TN += 1
        if test_y[x] == 0 and test_predictions[x] == 1:
            FP += 1
        if test_y[x] == 1 and test_predictions[x] == 1:
            TP += 1
        if test_y[x] == 1 and test_predictions[x] == 0:
            FN += 1
        x += 1

    cm = [[TN, FP],[FN, TP]]
    return cm

manual = manually_calculate_cm(target, y_pred)

#print("\nsklearn cm:", data)
print("Manual:", manual)

## # Test your model with different holdout groups.

In [None]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

## Cross Validation - with multiple holdouts (folds)

In [None]:
cross_val_score(bnb, data, target, cv=10)

## Cross Validation : your own method

## Perform your additional evaluation here
 - Using the topics we introduced earlier in this lesson, try to do a more in depth evaluation of the model looking at the kind of errors we're generating and what accuracy we'd get if we just randomly guessed. You may want to use what's known as a confusion matrix to show different kinds of errors.

## Use confusion matrix again to see FP and FN

### Magda Questions
- Do we always crete your own cross validation? if so how?
- How do I do additional evaluation here?