# ME3


## A simple classification task with Naive Bayes classifier & ROC curve

### Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [None]:
%matplotlib notebook
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import pandas as pd
import seaborn as sn
import os

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


# to make this notebook's output stable across runs
np.random.seed(42)

# Where to save the figures
PROJECT_ROOT_DIR = "."

CHAPTER_ID = 'Naive Bayesian'
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [None]:
#pip install -U scikit-learn

## Part 0:

Read and run each cell of the example. 

### Confusion matrix - simple example 1

A simple example shows what confusion matrix represents.
This example includes two class labels, 0 and 1. 

In [None]:
y_true1 = [1, 0, 0, 1, 1, 0, 1, 1, 0]
y_pred1 = [1, 1, 0, 1, 1, 0, 1, 1, 1]

confusion_mat1 = confusion_matrix(y_true1, y_pred1)
print(confusion_mat1)

In [None]:
# Print classification report
target_names1 = ['Class-0', 'Class-1']

result_metrics1 = classification_report(y_true1, y_pred1, target_names=target_names1)
print(result_metrics1)

# We can also retrieve a dictionary of metrics and access the values using dictionary
result_metrics_dict1 = classification_report(y_true1, y_pred1, target_names=target_names1, output_dict=True)
print(result_metrics_dict1)

### Confusion matrix - simple example 2

A simple example shows what confusion matrix represents. 

This example includes four class labels, 0, 1, 2 and 3. 

In [None]:
y_true2 = [1, 0, 0, 2, 1, 0, 3, 3, 3]
y_pred2 = [1, 1, 0, 2, 1, 0, 1, 3, 3]

confusion_mat2 = confusion_matrix(y_true2, y_pred2)
print(confusion_mat2)

In [None]:
target_names2 = ['Class-0', 'Class-1', 'Class-2', 'Class-3']

result_metrics2 = classification_report(y_true2, y_pred2, target_names=target_names2)
print(result_metrics2)


# We can also retrieve a dictionary of metrics and access the values using dictionary
result_metrics_dict2 = classification_report(y_true2, y_pred2, target_names=target_names2, output_dict=True)
print(result_metrics_dict2)

## Naive Bayes Classifiers

- Read Naive Bayes classifier in Python:
https://scikit-learn.org/stable/modules/naive_bayes.html

- Check out the difference between model parameters and hyper parameters:
https://towardsdatascience.com/model-parameters-and-hyperparameters-in-machine-learning-what-is-the-difference-702d30970f6

### 1. Sythetic Datasets

In [None]:
# synthetic dataset for classification (binary)

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])

plt.figure()
plt.title('Sample binary classification problem with two informative features')

# generate X values and y values (labels)
X, y = make_classification(n_samples = 100, n_features=2,
                                n_redundant=0, n_informative=2,
                                n_clusters_per_class=1, flip_y = 0.1,
                                class_sep = 0.5, random_state=0)

# plot the data
plt.scatter(X[:, 0], X[:, 1], marker= 'o', c=y, s=50, cmap=cmap_bold)
plt.show()

###  Naive Bayes classifier 1

#### Split the data to training data and test data

In [None]:
from sklearn.naive_bayes import GaussianNB

# split the data into training data and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#### Training: Develop a model using training data

In [None]:
# create a Naive Bayes classifier using the training data
nbclf = GaussianNB()
nbclf.fit(X_train, y_train)

#### Testing: evaluate the model using testing data

In [None]:
# predict class labels on test data
y_pred = nbclf.predict(X_test)

#### Model Evaluation

In [None]:
# plot a confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)

print(confusion_mat)

# Print classification report
target_names = ['Class 0', 'Class 1']

result_metrics = classification_report(y_test, y_pred, target_names=target_names)
print(result_metrics)

In [None]:
# The average accuracy of the model on test data. This is the value of macro avg in results
nbclf.score(X_test, y_test)

In [None]:
from adspy_shared_utilities import plot_class_regions_for_classifier

# This shows the boundaries of classified regions
# build a NB model using training data and display the classified region 
plot_class_regions_for_classifier(nbclf, X_train, y_train, X_test, y_test,
                                 'Gaussian Naive Bayes classifier: Dataset 1')

## ROC Curve

In [None]:
from sklearn.metrics import roc_curve, auc
y_score = nbclf.predict_proba(X_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score[:,1])

roc_auc = auc(false_positive_rate, true_positive_rate)

print('Accuracy = ', roc_auc)

# Plotting
plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate, label=('Accuracy = %0.2f'%roc_auc))
plt.legend(loc='lower right', prop={'size':8})
plt.plot([0,1],[0,1], color='lightgrey', linestyle='--')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## 2. Application to a real-world dataset

- Breast Cancer dataset: one of the well-known datasets used in ML. 


In [None]:
# Breast cancer dataset for classification
cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

print(X_cancer)

In [None]:
# Print class labels
target_names = cancer.target_names
target_names

#### Modeling through k-Cross Validation

- Create 10 folds for training and testing.
- Evaluate model performance for each iteration and obtain the average. 

In [None]:
from sklearn.model_selection import KFold 

# We start with k=3 and will increase it to 10.
kf = KFold(n_splits=3, random_state=None, shuffle=True) # Define the split - into 10 folds 

kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

print (kf)

#### Apply k-Cross Validation

In [None]:
nbclf = GaussianNB()

for train_index, test_index in kf.split(X_cancer):
    # for each iteration, get training data and test data
    X_train, X_test = X_cancer[train_index], X_cancer[test_index]
    y_train, y_test = y_cancer[train_index], y_cancer[test_index]

    # train the model using training data
    nbclf.fit(X_train, y_train)
    
    # show how model performs with training data and test data
    print('Accuracy of GaussianNB classifier on training set: {:.2f}'
         .format(nbclf.score(X_train, y_train)))

    print('Accuracy of GaussianNB classifier on test set: {:.2f}'
         .format(nbclf.score(X_test, y_test)))

#### Model performance uisng k-Cross Validation

In [None]:
nbclf2 = GaussianNB()

# !!!!! Please make a summary of the model performance (averaging k folds' results) using result_metrics_dict 
for train_index, test_index in kf.split(X_cancer):
    # for each iteration, get training data and test data
    X_train, X_test = X_cancer[train_index], X_cancer[test_index]
    y_train, y_test = y_cancer[train_index], y_cancer[test_index]

    # train the model using training data
    nbclf2.fit(X_train, y_train)
    
    # predict y values using test data
    y_pred = nbclf2.predict(X_test)

    confusion_mat = confusion_matrix(y_test, y_pred)
    print(confusion_mat)
    
    print(classification_report(y_test, y_pred, target_names=target_names))
    
    # Since we can retrieve a dictionary of metrics and access the values using dictionary,
    # now we can sum of the results of each iteration and get the average
    result_metrics_dict = classification_report(y_test, y_pred, target_names=target_names, output_dict=True)
    print(result_metrics_dict)

### ROC Curve

The example shows a ROC curve using training data and test data for one time. This can be done in k-Cross Validation.

In [None]:
from sklearn.metrics import roc_curve, auc

X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

y_score = nbclf2.predict_proba(X_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score[:,1])

roc_auc = auc(false_positive_rate, true_positive_rate)
print('Accuracy = ', roc_auc)

# Plotting
plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate, label=('Accuracy = %0.2f'%roc_auc))
plt.legend(loc='lower right', prop={'size':8})
plt.plot([0,1],[0,1], color='lightgrey', linestyle='--')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## ME3 Part 1

#### Build Naive Bayes classifiers on a well-known dataset, iris dataset. 

You are asked to build NB classifiers on two different datasets: (1) the original dataset (the data is not normalized) and (2) the normalized dataset. Use k-cross validation to evaluate the model performance. 

In [None]:
from IPython.display import Image

Image("images/iris.png")

### Dataset 1: iris

Obtain the data through either (1) or (2). 

- (1) You can read the data from sklearn.datasets using load_iris()
- (2) you can directly read the data from a local file: iris.csv is stored in a folder "data"

Run one of the two. 

#### (1) Obtain the data from sklearn.datsets

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data # petal length and width
y = iris.target
print(iris.target_names)
print(X)
print(y)

#### (2) Read the data from a local file: iris.csv is stored in a folder "data"

In [None]:
# read data from CSV file to dataframe
iris = pd.read_csv('./data/iris.csv')

# define target_namees (class lables)
target_names = ['setosa', 'versicolor', 'virginica']

print(iris.head())
print(iris.tail())

# X contains the first four columns, y contains class labels
#X = iris_data.iloc[:, [0,1,2,3]]
X = iris.drop(['Name', 'Class'], axis=1)
y = iris.iloc[:, [5]]
print(X.head())
print(y.head())

### Tasks:

- First, run basic Python functions for checking the data.

    - describe(), info(), isnull(), boxplot(), etc. 

- Your modeling analysis should be done on two different datasets, (1) the original dataset and (2) 

(1) NB classifier using the original dataset

- Create Naive Bayes classifier. 

- A framework of k-cross validation (k = 3).

- Display confusion matrix (a matrix with numbers).

- Print a summary of performance metrics.

- Plot ROC curves (this task is done. See the example code segment). 

#### ROC Curve

- This part is done. This code assumes that your NB classifier is defined as nbclf. 

- The code segment shows how to draw ROC curves for multi-classification where there are more than two class labels. 

In [None]:
from sklearn.preprocessing import label_binarize

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# we assume that your NB classifier's name is nbclf.
# Otherwise, you need to modify the name of the model. 
y_score = nbclf.predict_proba(X_test)
    
y_test = label_binarize(y_test, classes=[0,1,2])
n_classes = 3

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot of a ROC curve for a specific class
for i in range(n_classes):
    print("accuracy: " , roc_auc[i])
    plt.figure()
    plt.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f)' % roc_auc[i])
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example for class ' + str(i) )
    plt.legend(loc="lower right")
    plt.show()

(2) NB classifier using the normalized dataset

- Normalize the data - Make sure that you normalized only X values. 

- Create Naive Bayes classifier. 

- A framework of k-cross validation (k = 3).

- Display confusion matrix (a matrix with numbers).

- Print a summary of performance metrics.

- Plot ROC curves (this task is done. See the example code segment). 

### Part 2 Summary

- Upload your notebook on GitHub repo and provide an URL to the file.

- Write a summary of the analysis and submit it to Canvas. Your summary should include the comparisons of the two models in terms of their performance. 