# PMR3508 - Aprendizado de Máquina e Reconhecimento de Padrões
**Tarefa 2 - Aprendendo a lidar com o classificador Naive Bayes**

HASH: f20977e697


## Table of Contents

- [Introduction](#introduction)
- [Initial Data Exploration](#initial)
- [Naive Bayes](#classifier)
- [K-nearest neighbors](#classifier)
- [Conclusions](#conclusions)
- [References](#references) 


## 1. Introduction

  This spam classifier is implemented by Naive Bayes Model, a simple but very efficient solution in spam classification problem. In brief, Naive Bayes remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. In this kernel we will explore some data about spam and ham, compare two methods of classifications and discuss the differences between them.

### Environment Setting

Before we start we need set up the environment, please make sure those packages are installed in your computer when you copy this code to the local file.

In [None]:
#Importing stuff that I will need in this notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import feature_extraction, model_selection, naive_bayes, metrics
from collections import Counter
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,fbeta_score
from scipy import stats

### Read Data File

The first task is to read data from .csv file. Most of the features are frequency of some especif word.

In [None]:
train = pd.read_csv( "../input/spambase/train_data.csv")
testTOsubmit = pd.read_csv( "../input/spambase/test_features.csv")

In [None]:
print ("train size: ", train.shape)
print ("test size: ", testTOsubmit.shape)

In [None]:
train.head()

 - As we can see below, the features are mostly a frequency of some especific word. Also, the database does not have missing data (which is good for a better analysis)

In [None]:
train.info()

## 2. Initial data exploration

 In this section, we will explore the data in order to understand what each of them means in the database and how they relate to the target variable.

In [None]:
# Plot a graph with the proportions of spam or ham in our data

count_Class=pd.value_counts(train["ham"], sort= True)
count_Class.plot(kind= 'bar', color= ["green", "red"])
plt.title('Is it a ham?')
plt.show()

In [None]:
train.describe()

 - As we can see above, some words have more variation on average than others (which means that they are less common in email texts).

### Separate in groups (for fequency analysis)
 
 - Firstly, we are going to focus on the management of the frequency variables and them relations. So, let's drop the other columns and separate the data in two groups (one with spam emails and the other with the non-spam).

In [None]:
# Drop the columns that is not about frquency

train2 = train.drop(columns = ['capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'Id'])

In [None]:
# Separate the ham and spam for better analisys

HAMdata = train2[train2['ham']==True]
SPAMdata = train2[train2['ham']==False]

In [None]:
# Drop the column that classify if it is ham or not

HAMdata.drop(columns = ['ham'])
SPAMdata.drop(columns = ['ham'])

### Find useful informations

- Secondly, we will try to find out some basic info and limitations of our classifier.

In [None]:
# Discovering some information about our data that can be useful

HAMline = HAMdata.shape[0]
SPAMline = SPAMdata.shape[0]
ALLlines = SPAMline + HAMline
SPAMperc = SPAMline * 100 / ALLlines
HAMperc = HAMline * 100 / ALLlines

print('     Useful Information')
print()
print ("HAM data size: ", HAMdata.shape)
print ("SPAM data size: ", SPAMdata.shape)
print("HAM percentage: ", HAMperc)
print("SPAM percentage: ", SPAMperc)

 - We need to find a way to our program get a classifier with an error less than 38% (because, in the worst case of telling that every email is HAM, we have 38% of error). Furthermore,  we can conclude the proportion between the emails that are or not spam and realize that our database is reasonably large (which is good for machine learning).

### Comparing frequency of each word or char in spam and non-spam emails

- Now, let's see the frequency of each word or char in spam and non-spam emails

In [None]:
# Set the frequency of words and chars in the non-spam email


sums1 = HAMdata.select_dtypes(pd.np.number).sum().rename('total')
print('Frequency of words and chars in the non-spam email')
print()
print(sums1)

In [None]:
# Set the frequency of words and chars in the spam email

sums2 = SPAMdata.select_dtypes(pd.np.number).sum().rename('total')
print('Frequency of words and chars in the spam email')
print()
print(sums2)

In [None]:
# Plot the graph of more frequent words in non-spam messages

sums1.plot.bar(legend = False)
plt.title('More frequent words in non-spam messages')
plt.xlabel('Words or Chars')
plt.ylabel('Numbers')
plt.show()

In [None]:
# Plot the graph of more frequent words in spam messages

sums2.plot.bar(legend = False)
plt.title('More frequent words in spam messages')
plt.xlabel('Words or Chars')
plt.ylabel('Numbers')
plt.show()

 - As we can see, the frequency of "you" in both classifications of emails are very high, but, in terms of "your", the spam messages have this word more frequently. So, we will drop only the frequency of "you" (it is not meaningful).

In [None]:
train.drop(columns = ['word_freq_you'])

 - Moreover, we can notice that "hp", "hpl" and "george" are very frequent in non-spam messages (compared to the frequency in spam messages), maybe, this information will be useful posteriorly.

### Comparing frequency between each word or char in spam and non-spam emails

- Now, let's see the difference between frequency of each word or char in spam and non-spam emails. Thus, we can infer which words or chars are more important in our clasification

In [None]:
# Set the frequency of words and chars in each non-spam email

sums1FRACTION = sums1 / HAMline

# Set the frequency of words and chars in each spam email

sums2FRACTION = sums2 / SPAMline

 - We will subtract a fraction for the other. The larger the number module, the more relevant this word or char is to our classifier. In addition, positive numbers indicate that the frequency is higher in spam emails, while negatives indicate a higher frequency in ham emails.

In [None]:
# Set the difference between frequencies of words and chars

difSums = sums2FRACTION - sums1FRACTION
print(difSums)

In [None]:
# Plot the graph of the difference between frequencies of words and chars

difSums.plot.bar(legend = False)
plt.title('Difference between frequencies of words and chars')
plt.xlabel('Words or Chars')
plt.ylabel('Numbers')
plt.show()

 - As we can see above, a lot of information can be concluded through the graphic. The chars ("$" and "#") have a big difference in your frequencies in spam or not spam emails, so, you might infer that emails with those chars are spam. Moreover, the word "free" has the same frequency characteristics as the chars, so we can think the same way. The words "you" and "your" also have a high difference between frequency, but, unlike the words quoted above, these are very common words in talks/speeches (they are known as stop words in text classifiers), so, they will appear in non-spam emails (with a much lower frequency).
 - The same idea described in this previous paragraph can be applied to frequencies with negative numbers (and large module numbers). The emails that appear words like "hp", "hpl" and "george" probably are non-spam emails. 
 - It is worth pointing out that we are concluding and supposing ideas from the data and this discussion does not replace an math classifier.

### Separate in groups (for length analysis)
 
 - In this part, we are going to focus on the management of lenght variables and them relations. So, let's drop the other columns and separate the data in two groups (one with spam emails and the other with the non-spam).

In [None]:
# Drop the columns that is about frquency and the Id column

columns = ['capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'ham']
train3 = pd.DataFrame(train, columns=columns)

In [None]:
# Separate the ham and spam for better analisys

HAMcarac = train3[train3['ham']==True]
SPAMcarac = train3[train3['ham']==False]

In [None]:
# Drop the column that classify if it is ham or not

HAMcarac.drop(columns = ['ham'])
SPAMcarac.drop(columns = ['ham'])

### Comparing types of length between spam and non-spam emails

- In this section, we will explore the data related to the lenght (of words, email...). As you can see, we will increase a method similar to the one used above.

In [None]:
# Set the rate of the types of length in the non-spam email

sums3 = HAMcarac.select_dtypes(pd.np.number).sum().rename('total')
sums3FRACTION = sums3 / HAMline
print('Rate of the types of length in the non-spam email')
print()
print(sums3FRACTION)

In [None]:
# Set the rate of the types of length in the spam email

sums4 = SPAMcarac.select_dtypes(pd.np.number).sum().rename('total')
sums4FRACTION = sums4 / SPAMline
print('Rate of the types of length in the spam email')
print()
print(sums4FRACTION)

 - As we can see above, the average length of uninterrupted sequences of capital letters (capital_run_length_average), the length of longest uninterrupted sequence of capital letters (capital_run_length_longest) and the total number of capital letters in the e-mail (capital_run_length_total) are commonly larger in spam emails. That is, spam emails tend to be more extensive and prolix than the non-spam emails.

## 3. Naive Bayes

 Now, we will choose the classifier that best fits the solution of the problem. And then, we will apply the classifier's performance assessment methods to make decisions about the classifier and the hyperparameters.

- Firstly, we will split our training data in four parts

In [None]:
#Spliting the data in target(output) and features(input)

outputDATA = train['ham']
inputDATA = train.drop(columns=['ham'])

In [None]:
#Spliting the data in train output, train inputs, test output and test inputs

data_train, data_test, target_train, target_test = train_test_split(
    inputDATA,
    outputDATA,
    random_state = 0) 

In [None]:
#Creating the object pertaining to the Naive Bayes classifier for normal probability distribution.

gnb = GaussianNB()
gnb.fit(data_train, target_train)

#Predicting with Naive Bayes classifier

predictions = gnb.predict(data_train)
print(predictions)

 - Here, we will plot our confusion matrix (In the field of machine learning and specifically the problem of statistical classification, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one.

In [None]:
print ('Accuracy Score: ' + str(accuracy_score(target_train, predictions)))
print()
print (classification_report(target_train, predictions))
print (confusion_matrix(target_train, predictions))

In [None]:
print('     Confusion Matrix')
m_confusion_test = metrics.confusion_matrix(target_train, predictions)
pd.DataFrame(m_confusion_test, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

In [None]:
# Setting some information that will be useful

TP = m_confusion_test[0][0]
TN = m_confusion_test[1][1]
FP = m_confusion_test[1][0]
FN = m_confusion_test[0][1]
Trues = TP + TN
Falses = FP + FN
All = TP + TN + FP + FN
recall = TP / (TP + FN)
precision = TP / (TP + FP)
specificity = TN / (TN + FP)

 - Then, let's calculate our F-score with beta equal to 3 (it means that we consider recall 3 times more important than precision). In statistical analysis of binary classification, the F3 score (also F-score or F-measure) is a measure of a test's accuracy. It is the harmonic average of the precision and recall, where an F3 score reaches its best value at 1 (perfect precision and recall) and worst at 0. 

In [None]:
F3_score = 10 * precision * recall / (9 * precision + recall)
print ('F3 Score: ' + str(F3_score))

 - In this step, we will plot the ROC Curve. (It is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varies. It's analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.)

In [None]:
# Calculate the Fpr and Tpr for all thresholds of the classification

fpr, tpr, threshold = metrics.roc_curve(target_train, predictions)
roc_auc = metrics.auc(fpr, tpr)

# Ploting the (ROC Curve)

plt.title('Receiver Operating Characteristic (The ROC cruve)')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

- Now we will search for the best CV, testing multiples numbers.

In [None]:
CVlist = [5, 7, 10, 13, 15, 17, 20, 25, 30]
for i in CVlist:
    scores = cross_val_score(gnb, data_train, target_train, cv=i)
    strI = str(i)
    mean = str(np.mean(scores))
    print('Scores with CV equal to ' + strI)
    print(scores)
    print()
    print('Mean of scores: ' + mean)
    print()
    print()

- The best score found was with CV equal to 5

## 4. K-nearest neighbors

 - In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.

- Firstly, we will split our training data in four parts

In [None]:
#Spliting the data in target(output) and features(input)

outputDATA = train['ham']
inputDATA = train.drop(columns=['ham'])

In [None]:
#Spliting the data in train output, train inputs, test output and test inputs

dataVtrain, dataVtest, targetVtrain, targetVtest = train_test_split(
    inputDATA,
    outputDATA,
    random_state = 0) 

 - Then, let's try some numbers to the K and CV and find the best performance

In [None]:
# Setting the K numbers and our datas of test and train

neighbors = [3,5,7,9,13,15,20,25]
Xtrain = dataVtrain
Ytrain = targetVtrain
Xtest = dataVtest
Ytest = targetVtest

In [None]:
# Here, we define a function that test (with CV equal to 3) various K-nn and yours misclassification error

print('With CV = 3')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=3, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 3)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 5) various K-nn and yours misclassification error

print('With CV = 5')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 5)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 7) various K-nn and yours misclassification error

print('With CV = 7')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=7, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 7)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 9) various K-nn and yours misclassification error

print('With CV = 9')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=9, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 9)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 13) various K-nn and yours misclassification error

print('With CV = 13')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=13, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 13)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 15) various K-nn and yours misclassification error

print('With CV = 15')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=15, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 15)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 20) various K-nn and yours misclassification error

print('With CV = 20')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=20, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 20)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
# Here, we define a function that test (with CV equal to 25) various K-nn and yours misclassification error

print('With CV = 25')
cv_scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xtrain, Ytrain, cv=25, scoring='accuracy')
    cv_scores.append(scores.mean())
    
    
MSE = [1 - x for x in cv_scores]
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
# Plot a graph that show us the perfomance of the K-nn with each number of neighbors (in this case, the CV is equal to 25)

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

 - We can conclude that our best perfomace with K-NN was with CV equal to 13 and the optimal number of neighbors is 20. So, let's use it and take more useful info of your classification.

In [None]:
# Set the K-nn with K=13 and CV=20

knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(Xtrain,Ytrain)
scores = cross_val_score(knn, Xtrain, Ytrain, cv=20)
print(scores)

In [None]:
# Set the prediction

YtrainPred = knn.predict(Xtrain)

 - Here, we will plot our confusion matrix (In the field of machine learning and specifically the problem of statistical classification, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one.

In [None]:
print ('Accuracy Score: ' + str(accuracy_score(Ytrain, YtrainPred)))
print()
print (classification_report(Ytrain, YtrainPred))
print (confusion_matrix(Ytrain, YtrainPred))

In [None]:
print('     Confusion Matrix')
n_confusion_test = metrics.confusion_matrix(Ytrain, YtrainPred)
pd.DataFrame(n_confusion_test, columns = ['Predicted Ham', 'Predicted Spam'],
            index = ['Actual Ham', 'Actual Spam'])

In [None]:
# Setting some information that will be useful

TP = n_confusion_test[0][0]
TN = n_confusion_test[1][1]
FP = n_confusion_test[1][0]
FN = n_confusion_test[0][1]
Trues = TP + TN
Falses = FP + FN
All = TP + TN + FP + FN
recall = TP / (TP + FN)
precision = TP / (TP + FP)
specificity = TN / (TN + FP)

 - Then, let's calculate our F-score with beta equal to 3 (it means that we consider recall 3 times more important than precision). In statistical analysis of binary classification, the F3 score (also F-score or F-measure) is a measure of a test's accuracy. It is the harmonic average of the precision and recall, where an F3 score reaches its best value at 1 (perfect precision and recall) and worst at 0. 

In [None]:
F3_score = 10 * precision * recall / (9 * precision + recall)
print ('F3 Score: ' + str(F3_score))

 - In this step, we will plot the ROC Curve. (It is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varies. It's analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.)

In [None]:
# Calculate the Fpr and Tpr for all thresholds of the classification

fpr, tpr, threshold = metrics.roc_curve(Ytrain, YtrainPred)
roc_auc = metrics.auc(fpr, tpr)

# Ploting the (ROC Curve)

plt.title('Receiver Operating Characteristic (The ROC cruve)')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

## 5. Conclusions

 - After we analyze all the data and look at two methods of classification, we have some conclusions.

 1. As we said in our initial data exploration, we need to find a way to our program get a classifier with an error less than 38% (because, in the worst case of telling that every email is HAM, we have 38% of error).
 2. Comparing the Naive Bayes and K-NN, we concluded that (in this especific case) the Naive Bayes performs better than K-NN. The F3 score of Naive Bayes was around 0.932 (while the KNN performs 0.624). Moreover, the accuracy (0.825/0.742) and others rates were, generally, better in the Naive Bayes.
 3. It is worth pointing out that not necessarily the Naive Bayes is the best possible method to be used in this case, however, compared to K-NN, he is the best.
 4. There are more possible manipulations that could improve our classifier, as we saw in the Initial data exploration(2), a lot of information could be added to our classifier. Futhermore, we could increase our classifier with an spell corrector (mostly informal email text will not have the correct spellings), for example, [this is one of the most used](https://github.com/phatpiglet/autocorrect) for correct the spelling. Also, the verification of the origin of the email could be a good information to increase in our classification, because, according to researches, most of the accounts that send spam emails is used only with this function.

 - Now, let's predict in our test file and send the results

In [None]:
# Open the file that contains a sample how to send ours predictions

sample = pd.read_csv("../input/pmrdatasettarefa/sample_submission_1.csv")
print ("submission size: ", sample.shape)

In [None]:
sample.head()

In [None]:
# Predicting and sending the predictions

predictionsTOsubmit = gnb.predict(testTOsubmit)
str(predictionsTOsubmit)
ids = testTOsubmit['Id']
submission = pd.DataFrame({'Id':ids,'ham':predictionsTOsubmit[:]})
submission.to_csv("predictions.csv", index = False)
submission.shape
submission.head()

## References


 - [Cross Validation Score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
 - [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
 - [K-Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html)
 - [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)
 - [Pandas](https://pandas.pydata.org/pandas-docs/stable/)
 - [ROC Curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
 - [Split Data](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
 