In this notebook we will look at <b> multi-class classification </b> usig <b>Naive Bayes algorithm</b>. 

## Loading the dataset

In this section, we will import all the necessary packages and load the datasets we plan to work on. For this exercise, we have picked Wine dataset available through sklearn
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html">Wine Dataset</a>


In [1]:
import pandas as pd
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics

The Wine dataset is a dictionary like object with various information stored in key:value pairs

In [2]:
# Load dataset
wine_data = datasets.load_wine()
wine_data.keys()

['target_names', 'data', 'target', 'DESCR', 'feature_names']

In [38]:
wine_data['target_names']

array(['class_0', 'class_1', 'class_2'], dtype='|S7')

## Explore the dataset
    Understanding the data, its features and distribution is a major part of builiding ML models. 

In [3]:
# Let us convert the dataset to Pandas for easier exploration
wine = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

#target is the class label column that we intend to predict, in this case it is wine type and categorical in nature.
wine['wine_class']=wine_data.target

# Dataset has 13 numeric features
print "Features: ", wine_data.feature_names

# Dataset has 3 classes - type of wine(class_0, class_1, class_2)
print "Labels: ", wine_data.target_names

Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Labels:  ['class_0' 'class_1' 'class_2']


In [21]:
# It has 178 rows (data points) and 13 features + 1 target variable
wine.shape

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,wine_class
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [20]:
# Let us see the data 
wine.head(5)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,wine_class
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [28]:
# Check the label distribution 
wine.wine_class.value_counts()

1    71
0    59
2    48
Name: wine_class, dtype: int64

In [29]:
wine.isnull().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
wine_class                      0
dtype: int64

## Splitting your Data
In Machine Learning problems, we split the datasets into 2 parts to make the model creation and testing most efficient

Training - 80%
Testing - 20%

You further need to split your data into features and labels.
<ol>
        <li> Features - Input X to the model </li>
        <li>Label - Expected output Y </li>
</ol>
For the Wine dataset, we are going to use all the columns as input features and predict 'wine_class'

In [4]:
features =  wine_data.feature_names
X_train, X_test, y_train, y_test = train_test_split(wine[features], wine['wine_class'], test_size=0.2,random_state=109)

In [53]:
# check label distribution for training data
y_train.value_counts()

1    55
0    46
2    41
Name: wine_class, dtype: int64

In [54]:
# check label distribution for training data
y_test.value_counts()

1    16
0    13
2     7
Name: wine_class, dtype: int64

## Model Creation 
We will train a Naive Bayes classifier om our training data and evaluate its performance using the test data
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html">Sklearn Gaussian NB Documentation</a>

In [5]:
#Create a Gaussian Classifier
# The prior is not specified, hence the model uses the data to compute the prior
gnb = GaussianNB()
#Train the model using the training sets
# As the data is balanced we do not specify sample_weight parameter
gnb.fit(X_train, y_train, sample_weight=None)

GaussianNB(priors=None, var_smoothing=1e-09)

In [6]:
# predict method is used to predict the label for test data
y_predict=gnb.predict(X_test)

In [30]:
# If we want to predict the probability instead of label
np.set_printoptions(suppress=True)
y_prob = gnb.predict_proba(X_test)
print "Example-- "
print "Probability of data point 0 belonging to Class 0, Class 1, Class 2 -->", y_prob[0]
print "Predicted label of data point 0 --> ", y_predict[0]

Example-- 
Probability of data point 0 belonging to Class 0, Class 1, Class 2 --> [0.99999544 0.00000456 0.        ]
Predicted label of data point 0 -->  0


In [31]:
# Concat the actual label, predicted label and probability of each classfor better visualization
pd.options.display.float_format = '{:.5f}'.format

y_test=y_test.reset_index(drop=True)
y_prob_df=pd.DataFrame(y_prob,columns=['Class_0','Class_1','Class_2'])
y_predict_df=pd.DataFrame(y_predict,columns=['Predicted_label'])

tmp=pd.concat([y_predict_df,y_prob_df],axis=1)
prediction = pd.concat([y_test,tmp],axis=1)
prediction.head(5)

Unnamed: 0,wine_class,Predicted_label,Class_0,Class_1,Class_2
0,0,0,1.0,0.0,0.0
1,0,0,0.60527,0.39473,0.0
2,1,1,0.0,1.0,0.0
3,2,2,0.0,0.0,1.0
4,0,0,1.0,0.0,0.0


## Model Evaluation 
Note -- Please add the evaluation cell from other notebook

Accuracy

In [107]:
print "Accuracy:",metrics.accuracy_score(y_test, y_predict)

Accuracy: 0.9444444444444444


Confusion matrix

In [41]:
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(y_test,y_predict)
cf_df = pd.DataFrame(cf,columns=['Class_0_true','Class_1_true','Class_2_true'])
cf_df=cf_df.join(pd.DataFrame(['Class_0_pred','Class_1_pred','Class_2_pred'],columns=['class']))
cf_df.set_index('class',drop=False).drop('class',axis=1)

Unnamed: 0_level_0,Class_0_true,Class_1_true,Class_2_true
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Class_0_pred,13,0,0
Class_1_pred,0,14,2
Class_2_pred,0,0,7


Precision, Recall and Fscore

In [28]:
prf=metrics.precision_recall_fscore_support(y_test, y_predict, average=None)
prf_df = pd.DataFrame(prf[0:3],columns=['Precision','Recall','F-score'])
prf_df=prf_df.join(pd.DataFrame(['Class_0','Class_1','Class_2'],columns=['class']))
prf_df.set_index('class',drop=False).drop('class',axis=1)

Unnamed: 0_level_0,Precision,Recall,F-score
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Class_0,1.0,1.0,0.777778
Class_1,1.0,0.875,1.0
Class_2,1.0,0.933333,0.875
