# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries

#### Load Dataset. Use "bank-data.csv"

In [1]:
# import dataset
import pandas as pd
df = pd.read_csv('bank-data.csv')
df.head()

Unnamed: 0,id,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
0,ID12101,48,FEMALE,INNER_CITY,17546.0,NO,1,NO,NO,NO,NO,YES
1,ID12102,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
2,ID12103,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
3,ID12104,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
4,ID12105,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO


#### Preprocess the data

In [2]:
# import library for preprocessing
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
df.sex = encoder.fit_transform(df.sex)
df.region = encoder.fit_transform(df.region)
df.married = encoder.fit_transform(df.married)
df.car = encoder.fit_transform(df.car)
df.save_act = encoder.fit_transform(df.save_act)
df.current_act = encoder.fit_transform(df.current_act)
df.mortgage = encoder.fit_transform(df.mortgage)
df.head()


Unnamed: 0,id,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
0,ID12101,48,0,0,17546.0,0,1,0,0,0,0,YES
1,ID12102,40,1,3,30085.1,1,3,1,0,1,1,NO
2,ID12103,51,0,0,16575.4,1,0,1,1,1,0,NO
3,ID12104,23,0,3,20375.4,1,3,0,0,1,0,NO
4,ID12105,57,0,1,50576.3,1,0,0,1,0,0,NO


In [3]:
# Tranform data using "fit_transform(attribute)" function  

#### Select independent variables and target column

In [4]:
# Select the independent variables and the target attribute
from sklearn.model_selection import train_test_split
X = df[df.columns[1:-1]]
Y=df[df.columns[len(df.columns)-1]]

#### Import Naive Bayes Classifier library 

In [5]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [6]:
# Call the Classifier
naiveBayes = GaussianNB()

#### Predict the target column and find the perfromance of the model

In [7]:
# Divide the dataset into training and testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

In [8]:
# Print Number of mislabeled points
naiveBayes.fit(X_train, Y_train)
predictions = naiveBayes.predict(X_test)
print("Number of mislabeled points : %d" % (Y_test != predictions).sum())

Number of mislabeled points : 56


### Prediction and Evaluation

In [13]:
# import required libraries
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

In [14]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
print("Classification Report")
print(print(classification_report(Y_test,predictions)))
cm = confusion_matrix(Y_test,predictions)
print("Confusion Matrix")
print(cm)
accuracy = accuracy_score(Y_test,predictions)
print("Accuracy")
print(accuracy)
print("Mislabelled Points")
print(cm[0,1]+cm[1,0])

Classification Report
              precision    recall  f1-score   support

          NO       0.62      0.78      0.69        80
         YES       0.59      0.41      0.48        64

    accuracy                           0.61       144
   macro avg       0.61      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

None
Confusion Matrix
[[62 18]
 [38 26]]
Accuracy
0.6111111111111112
Mislabelled Points
56


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [15]:
# display dataframe first 5 columns
df1 = df
df.drop(columns='current_act', inplace=True)
df.head()

Unnamed: 0,id,age,sex,region,income,married,children,car,save_act,mortgage,pep
0,ID12101,48,0,0,17546.0,0,1,0,0,0,YES
1,ID12102,40,1,3,30085.1,1,3,1,0,1,NO
2,ID12103,51,0,0,16575.4,1,0,1,1,0,NO
3,ID12104,23,0,3,20375.4,1,3,0,0,0,NO
4,ID12105,57,0,1,50576.3,1,0,0,1,0,NO


In [16]:
# Selecting the independent variables
X = df[df.columns[1:-1]]
print(X)


     age  sex  region   income  married  children  car  save_act  mortgage
0     48    0       0  17546.0        0         1    0         0         0
1     40    1       3  30085.1        1         3    1         0         1
2     51    0       0  16575.4        1         0    1         1         0
3     23    0       3  20375.4        1         3    0         0         0
4     57    0       1  50576.3        1         0    0         1         0
..   ...  ...     ...      ...      ...       ...  ...       ...       ...
474   31    0       3  22678.1        0         1    1         1         1
475   33    0       3  12178.5        1         2    0         1         1
476   43    1       1  26106.7        0         1    0         0         0
477   40    1       0  27417.6        1         0    0         1         1
478   47    1       3  23337.2        1         2    0         1         1

[479 rows x 9 columns]


In [17]:
# selecting only the target lableled column
Y=df[df.columns[len(df.columns)-1]]

In [18]:
# Apply the classifier and Print Number of mislabeled points
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
naiveBayes.fit(X_train, Y_train)
predictions = naiveBayes.predict(X_test)
print("Number of mislabeled points : %d" % (Y_test != predictions).sum())

Number of mislabeled points : 56


In [19]:
# Calculate and print confusion matrix and other performance measures
print("Classification Report")
print(print(classification_report(Y_test,predictions)))
cm = confusion_matrix(Y_test,predictions)
print("Confusion Matrix")
print(cm)
accuracy = accuracy_score(Y_test,predictions)
print("Accuracy")
print(accuracy)
print("Mislabelled Points")
print(cm[0,1]+cm[1,0])

Classification Report
              precision    recall  f1-score   support

          NO       0.62      0.76      0.69        80
         YES       0.59      0.42      0.49        64

    accuracy                           0.61       144
   macro avg       0.60      0.59      0.59       144
weighted avg       0.61      0.61      0.60       144

None
Confusion Matrix
[[61 19]
 [37 27]]
Accuracy
0.6111111111111112
Mislabelled Points
56


#### Q2: Write your observation

In [20]:
print("Accuracy did not change since naive bayes is insensitive to irrelevant attributes")

Accuracy did not change since naive bayes is insensitive to irrelevant attributes


### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [9]:
# Load the data
import pandas as pd
df2 = pd.read_csv('car.csv', header=None)
df2.sample(frac=1)
df2.columns =['price', 'maintenance_cost','doors', 'person_capacity', 'luggage_boot_size','safety','class'] 
df2
# shuffle the DataFrame rows 


Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,6,6,med,med,good
1724,low,low,6,6,med,high,vgood
1725,low,low,6,6,big,low,unacc
1726,low,low,6,6,big,med,good


In [22]:
# Preprocess and Tranform data using "fit_transform(attribute)" function  
df2.price = encoder.fit_transform(df2.price)
df2.maintenance_cost = encoder.fit_transform(df2.maintenance_cost)
df2.doors = encoder.fit_transform(df2.doors)
df2.person_capacity = encoder.fit_transform(df2.person_capacity)
df2.luggage_boot_size = encoder.fit_transform(df2.luggage_boot_size)
df2.safety = encoder.fit_transform(df2.safety)
df2.head()


Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,safety,class
0,3,3,0,0,2,1,unacc
1,3,3,0,0,2,2,unacc
2,3,3,0,0,2,0,unacc
3,3,3,0,0,1,1,unacc
4,3,3,0,0,1,2,unacc


In [23]:
# Select the independent variables and the target attribute
X2 = df2[df2.columns[:-1]]
Y2 = df2[df2.columns[len(df2.columns)-1]]

In [24]:
# Apply the classifier
naiveBayes2 = GaussianNB()



In [25]:
# Divide the dataset into training and testing partition
# predictions for testing partition
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X2, Y2, test_size=0.30, random_state = 30)

In [26]:
# Print Number of mislabeled points
naiveBayes2.fit(X_train2, Y_train2)
predictions2 = naiveBayes2.predict(X_test2)
print("Number of mislabeled points : %d" % (Y_test2 != predictions2).sum())
print("Total number of points : %d" % (X2.shape[0]))

Number of mislabeled points : 195
Total number of points : 1728


In [27]:
# Calculate and print confusion matrix and other performance measures
print("Classification Report")
print(print(classification_report(Y_test2,predictions2)))
cm2 = confusion_matrix(Y_test2,predictions2)
print("Confusion Matrix")
print(cm2)
accuracy2 = accuracy_score(Y_test2,predictions2)
print("Accuracy")
print(accuracy2)

Classification Report
              precision    recall  f1-score   support

         acc       0.45      0.12      0.19       111
        good       0.00      0.00      0.00        21
       unacc       0.85      0.79      0.82       368
       vgood       0.13      1.00      0.23        19

    accuracy                           0.62       519
   macro avg       0.36      0.48      0.31       519
weighted avg       0.70      0.62      0.63       519

None
Confusion Matrix
[[ 13   0  45  53]
 [  2   0   7  12]
 [ 14   0 292  62]
 [  0   0   0  19]]
Accuracy
0.6242774566473989


  _warn_prf(average, modifier, msg_start, len(result))


#### Q4: Find the correlation between the attributes of the dataset.

In [29]:
# Find the pairwise correlation of attributes and arrange in ascending order
c = df2.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so)

doors              safety               0.0
luggage_boot_size  person_capacity      0.0
                   doors                0.0
                   maintenance_cost     0.0
                   price                0.0
person_capacity    safety               0.0
                   luggage_boot_size    0.0
safety             maintenance_cost     0.0
person_capacity    doors                0.0
                   maintenance_cost     0.0
                   price                0.0
safety             luggage_boot_size    0.0
doors              luggage_boot_size    0.0
                   person_capacity      0.0
safety             doors                0.0
doors              maintenance_cost     0.0
                   price                0.0
maintenance_cost   safety               0.0
                   luggage_boot_size    0.0
                   person_capacity      0.0
                   doors                0.0
safety             person_capacity      0.0
maintenance_cost   price        

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [128]:
# Drop highly correlated attribute
df2.drop(columns='luggage_boot_size', inplace=True)
X2 = df2[df2.columns[:-1]]
Y2 = df2[df2.columns[len(df2.columns)-1]]


In [129]:
# Apply the classifier
# Divide the dataset into training and testing partition
# predictions for testing partition
# Print Number of mislabeled points
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X2, Y2, test_size=0.30, random_state = 30)
naiveBayes2.fit(X_train2, Y_train2)
predictions2 = naiveBayes2.predict(X_test2)
print("Number of mislabeled points : %d" % (Y_test2 != predictions2).sum())
print("Total number of points : %d" % (X2.shape[0]))


Number of mislabeled points : 216
Total number of points : 1728


In [130]:
# Calculate and print confusion matrix and other performance measures
print("Classification Report")
print(print(classification_report(Y_test2,predictions2)))
cm2 = confusion_matrix(Y_test2,predictions2)
print("Confusion Matrix")
print(cm2)
accuracy2 = accuracy_score(Y_test2,predictions2)
print("Accuracy")
print(accuracy2)

Classification Report
              precision    recall  f1-score   support

         acc       0.50      0.13      0.20       111
        good       0.00      0.00      0.00        21
       unacc       0.85      0.73      0.79       368
       vgood       0.11      1.00      0.20        19

    accuracy                           0.58       519
   macro avg       0.36      0.46      0.30       519
weighted avg       0.71      0.58      0.61       519

None
Confusion Matrix
[[ 14   0  42  55]
 [  2   0   7  12]
 [ 12   0 270  86]
 [  0   0   0  19]]
Accuracy
0.5838150289017341


  _warn_prf(average, modifier, msg_start, len(result))


#### Q6: Write your observation below in the performance of model in Q4 and Q6

In [131]:
print("Accuracy dropped after removing correlated attribute")

Accuracy dropped after removing correlated attribute
