# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries

#### Load Dataset. Use "bank-data.csv"

In [210]:
# import dataset
import pandas as pd
dataset = pd.read_csv('bank-data.csv',index_col=0)

In [211]:
dataset.head()

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,FEMALE,INNER_CITY,17546.0,NO,1,NO,NO,NO,NO,YES
ID12102,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
ID12103,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
ID12104,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
ID12105,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO


#### Preprocess the data

In [212]:
# import library for preprocessing
from sklearn import preprocessing
le = preprocessing.LabelEncoder()


In [213]:
# Tranform data using "fit_transform(attribute)" function  
dataset.car = le.fit_transform(dataset.car)
dataset.sex = le.fit_transform(dataset.sex)
dataset.save_act = le.fit_transform(dataset.save_act)
dataset.married = le.fit_transform(dataset.married)
dataset.current_act = le.fit_transform(dataset.current_act)
dataset.mortgage = le.fit_transform(dataset.mortgage)
dataset.pep = le.fit_transform(dataset.pep)
dataset.region = le.fit_transform(dataset.region)

#### Select independent variables and target column

In [214]:
# Select the independent variables and the target attribute
X = dataset[['age', 'sex', 'income', 'married', 'children', 'car', 'save_act',
       'current_act', 'mortgage', 'region']]
Y = dataset['pep']

#### Import Naive Bayes Classifier library 

In [215]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [216]:
# Call the Classifier
nb = GaussianNB()

#### Predict the target column and find the perfromance of the model

In [217]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
nb.fit(X_train,Y_train)
predictions = nb.predict(X_test)

In [218]:
# Print Number of mislabeled points
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions).sum()))

Number of mislabeled points out of a total 144 points : 56


### Prediction and Evaluation

In [219]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [220]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.62      0.78      0.69        80
           1       0.59      0.41      0.48        64

    accuracy                           0.61       144
   macro avg       0.61      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[62 18]
 [38 26]]

 Accuracy
0.6111111111111112


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [221]:
# display dataframe first 5 columns
X_train.drop('current_act', axis =1 , inplace = True)
X_test.drop('current_act', axis =1 , inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [222]:
# Apply the classifier and Print Number of mislabeled points
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
nb.fit(X_train,Y_train)
predictions = nb.predict(X_test)

In [223]:
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions).sum()))

Number of mislabeled points out of a total 144 points : 56


In [224]:
# Calculate and print confusion matrix and other performance measures
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.62      0.78      0.69        80
           1       0.59      0.41      0.48        64

    accuracy                           0.61       144
   macro avg       0.61      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

Confusion Matrix
[[62 18]
 [38 26]]

 Accuracy
0.6111111111111112


#### Q2: Write your observation

In [225]:
#Accuracy is same as above, whether we consider "current_act" attribute or not. This mean NB is robust to irrelevant features

In [226]:
# Find the pairwise correlation of attributes and arrange in ascending order
c = dataset.corr().abs()
#print(c)
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)

income       married        0.000374
married      income         0.000374
mortgage     married        0.001784
married      mortgage       0.001784
             age            0.002897
age          married        0.002897
region       married        0.003875
married      region         0.003875
children     sex            0.004089
sex          children       0.004089
region       car            0.004216
car          region         0.004216
mortgage     car            0.005357
car          mortgage       0.005357
region       pep            0.006827
pep          region         0.006827
region       income         0.010030
income       region         0.010030
save_act     car            0.011336
car          save_act       0.011336
sex          save_act       0.012936
save_act     sex            0.012936
             mortgage       0.013255
mortgage     save_act       0.013255
married      sex            0.013560
sex          married        0.013560
save_act     children       0.013860
c

In [227]:
#dropping income column
X_train.drop('income', axis =1 , inplace = True)
X_test.drop('income', axis =1 , inplace = True)
nb.fit(X_train,Y_train)
predictions = nb.predict(X_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [228]:
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.64      0.70      0.67        80
           1       0.58      0.52      0.55        64

    accuracy                           0.62       144
   macro avg       0.61      0.61      0.61       144
weighted avg       0.61      0.62      0.62       144

Confusion Matrix
[[56 24]
 [31 33]]

 Accuracy
0.6180555555555556


In [238]:
#Accuracy has improved when we removed correlated features

### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [229]:
# Load the data
df = pd.read_csv('car.csv', header= None)
df.columns = ['buying', 'maint','doors','persons','lug_boot','safety','class']
# shuffle the DataFrame rows 
df.sample(frac=1)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
506,high,vhigh,4,6,small,high,unacc
939,med,vhigh,4,6,med,low,unacc
1354,low,vhigh,4,2,med,med,unacc
1167,med,med,6,2,big,low,unacc
320,vhigh,med,6,6,med,high,acc
536,high,vhigh,6,6,med,high,unacc
556,high,high,2,4,big,med,acc
827,high,low,4,4,big,high,acc
465,high,vhigh,3,2,big,low,unacc
1507,low,high,6,6,med,med,acc


In [230]:
# Preprocess and Tranform data using "fit_transform(attribute)" function  
df.buying = le.fit_transform(df.buying)
df.maint = le.fit_transform(df.maint)
df.lug_boot = le.fit_transform(df.lug_boot)
df.safety = le.fit_transform(df.safety)
df['class'] = le.fit_transform(df['class'])

In [231]:
# Select the independent variables and the target attribute
X = df[df.columns[:-1]] # Selecting the independent variables
Y=df[df.columns[len(df.columns)-1]] # selecting only the target lableled column

In [232]:
# Apply the classifier
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
nb.fit(X_train,Y_train)
predictions = nb.predict(X_test)

In [233]:
# Print Number of mislabeled points
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (Y_test != predictions).sum()))

Number of mislabeled points out of a total 519 points : 186


In [234]:
# Calculate and print confusion matrix and other performance measures
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       0.50      0.12      0.19       111
           1       0.00      0.00      0.00        21
           2       0.85      0.82      0.83       368
           3       0.14      1.00      0.24        19

    accuracy                           0.64       519
   macro avg       0.37      0.48      0.32       519
weighted avg       0.72      0.64      0.64       519

Confusion Matrix
[[ 13   0  45  53]
 [  2   0   7  12]
 [ 11   0 301  56]
 [  0   0   0  19]]

 Accuracy
0.6416184971098265


  _warn_prf(average, modifier, msg_start, len(result))


#### Q4: Find the correlation between the attributes of the dataset.

In [235]:
# Find the pairwise correlation of attributes and arrange in ascending order
c = df.corr().abs()
#print(c)
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)

persons   buying      0.000000
safety    buying      0.000000
lug_boot  safety      0.000000
          persons     0.000000
          doors       0.000000
          maint       0.000000
          buying      0.000000
safety    persons     0.000000
persons   safety      0.000000
          lug_boot    0.000000
          doors       0.000000
          maint       0.000000
doors     safety      0.000000
safety    maint       0.000000
doors     persons     0.000000
safety    lug_boot    0.000000
doors     lug_boot    0.000000
          buying      0.000000
buying    maint       0.000000
          doors       0.000000
          persons     0.000000
          lug_boot    0.000000
          safety      0.000000
doors     maint       0.000000
maint     doors       0.000000
          buying      0.000000
          persons     0.000000
          lug_boot    0.000000
          safety      0.000000
safety    doors       0.000000
          class       0.021044
class     safety      0.021044
doors   

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [239]:
# Drop highly correlated attribute
#solved for bankdataset earlier

#### Q6: Write your observation below in the performance of model in Q4 and Q6

Accuracy has increased when we removed a highly correalted columns.(Done for bank_data above)