# Naive Bayes Classifier

Implementing a Naive Bayes Classifier from scratch.

### Step 0: Loading Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('diabetes.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.isnull().any().any()

False

Q1. Fill out this function which splits the dataset into X_train, y_train, X_test, y_test.

In [5]:
def splitDataset(dataset, split, target_label):
    dataset = dataset.sample(frac=1)  #shuffle dataset
    train_size = int(len(dataset) * split)
    X = dataset.drop(target_label, axis=1)
    y = dataset[target_label]
    X_train = X[:train_size].values
    X_test = X[train_size:].values
    y_train = y[:train_size].values
    y_test = y[train_size:].values
    return X_train, y_train, X_test, y_test

In [6]:
dataset = pd.DataFrame([[1, 0], [2, 0], [3, 1], [4, 1], [5, 1]])
split = 0.67
target_label = 1
X_train, y_train, X_test, y_test = splitDataset(dataset, split, target_label)
print(('Split {0} rows into train with {1} and test with {2}').format(len(dataset), len(y_train), len(y_test)))

Split 5 rows into train with 3 and test with 2


### Step 1: Separate By Class

Q2. Fill out this function which separates the data by class(labels). Create a dictionary object where the keys are the class value and then add a list of all the records that have the class as the value in the dictionary.

In [7]:
def separateByClass(X, y):
    separated = {}
    for i in range(len(X)):
        if y[i] not in separated:
            separated[y[i]] = []
        separated[y[i]].append(X[i])
    return separated

In [8]:
#This cell should run properly
X = [[1, 20], [2, 21], [3, 22]]
y = [1, 0, 1]
separated = separateByClass(X, y)
separated

{1: [[1, 20], [3, 22]], 0: [[2, 21]]}

### Step 2: Summarize Data
Q3. Fill out functions which return the mean and standard deviation of a list of numbers.

In [9]:
import math


def mean(numbers):
    return sum(numbers) / float(len(numbers))


def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
    return math.sqrt(variance)

In [10]:
#This cell should run properly
numbers = [1, 2, 3, 4, 5]
print(('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers)))

Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.5811388300841898


Q4. Fill out a function which calculates the mean and standard deviation of each column in a dataset. Store the mean and standard deviation for each column as a tuple or list. Return a list that contains each column's statistics (mean, stdev).

In [11]:
def summarize(X):
    summary = [(mean(col), stdev(col)) for col in zip(*X)]
    return summary

In [12]:
#This cell should run properly
dataset = [[1, 20, 0], [2, 21, 0], [3, 22, 10]]
summary = summarize(dataset)
print(('Attribute summaries: {0}').format(summary))

Attribute summaries: [(2.0, 1.0), (21.0, 1.0), (3.3333333333333335, 5.773502691896257)]


### Step 3: Summarize Data By Class

Q5. Summarize the columns in the dataset organized by class values. Split the dataset by class, then calculate statistics on each subset. Return a dictionary that contains the results in the form of a list of tuples of statistics for each class value.

In [13]:
def summarizeByClass(X, y):
    separated = separateByClass(X, y)
    summaries = {}
    for label in separated:
        summaries[label] = summarize(separated[label])
    return summaries

In [14]:
#This should work properly
X = [[1, 20], [2, 21], [3, 22], [4, 21]]
y = [1, 0, 1, 0]
summary = summarizeByClass(X, y)
print(('Summary by class value: {0}').format(summary))

Summary by class value: {1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)], 0: [(3.0, 1.4142135623730951), (21.0, 0.0)]}


### Step 4: Gaussian Probability Density Function

We will now calculate the probability or likelihood of a data point to belong to a certain class.

One way we can do this is to assume that data is drawn from a distribution, such as a bell curve or Gaussian distribution.

Q6. Fill out the function which calculates the likelihood of data point using Gaussian density function. 
<img src=https://wikimedia.org/api/rest_v1/media/math/render/svg/f0506065a47bd1efc86fe9aa01a1ed66c6846a02>

In [15]:
def calculateProbability(x, mean, stdev):
    exp = -(x - mean)**2 / float(2 * (stdev**2))
    return np.exp(exp) / float(np.sqrt(2 * np.pi) * stdev)

In [16]:
#This cell should run properly
x1 = 71.5
mean1 = 73
stdev1 = 6.2
probability = calculateProbability(x1, mean1, stdev1)
print(('Probability of belonging to this class: {0}').format(probability))

Probability of belonging to this class: 0.06248965759370005


### Step 5: Class Probabilities

Q7. Fill out the function which calculates the probability that a data point belongs to either class. We can calculate the probabilities of an attribute belonging to a class using the above function, and we can combine the probabilities by multiplying them(Naive). Thus, this function returns a dictionary which shows the probability that the data summary belongs to a particular class.

    P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

In [17]:
def calculateClassProbabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            mean, stdev = classSummaries[i]
            probabilities[classValue] *= calculateProbability(inputVector[i], mean, stdev)
    return probabilities

In [18]:
#This cell should run properly
summaries = {0: [(1, 0.5)], 1: [(20, 5.0)]}
inputVector = [1.1, '?']
probabilities = calculateClassProbabilities(summaries, inputVector)
print(('Probabilities for each class: {0}').format(probabilities))

Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}


Q8a. Fill out the function which makes the prediction which class a datapoint belongs to.

In [19]:
import operator

def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    return max(probabilities.items(), key=operator.itemgetter(1))[0]

In [20]:
#This cell should run properly
summaries = {'A': [(1, 0.5), (2, 1)], 'B': [(20, 5.0), (20, 1.0)]}  # When our dataset has 2 attributes/features
inputVector = [1.1, 3]
result = predict(summaries, inputVector)
print(('Prediction: {0}').format(result))

Prediction: A


Q8b. Fill out this function for generating predictions for a list of test datapoints.

In [21]:
def getPredictions(summaries, X_test):
    predictions = []
    for i in range(len(X_test)):
        predictions.append(predict(summaries, X_test[i]))
    return predictions

In [22]:
#This cell should run properly
summaries = {'A': [(1, 0.5), (2, 1)], 'B': [(20, 5.0), (20, 1.0)]}
testSet = [[1.1,3], [19.1, 16]]
predictions = getPredictions(summaries, testSet)
print(('Predictions: {0}').format(predictions))

Predictions: ['A', 'B']


### Step 6: Get Accuracy

Q9. Fill out this function which returns the accuracy of the predictions generated by the Naive Bayes Classifier.

In [23]:
def getAccuracy(y_test, y_pred):
    return sum([y_test[i] == y_pred[i] for i in range(len(y_test))]) / float(len(y_test)) * 100

In [24]:
#This cell should run properly
test = ['a', 'a', 'b']
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(test, predictions)
print(('Accuracy: {0}').format(accuracy))

Accuracy: 66.66666666666666


### Step 7: Combine it all

Q10. Fill out this Naive Bayes function which takes in the dataframe and target_label parameters and prints its accuracy.

In [25]:
def NaiveBayesClassifier(dataset, target_label):
    split = 0.7
    X_train, y_train, X_test, y_test = splitDataset(dataset, split, target_label)
    print(('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(y_train), len(y_test)))

    summaries = summarizeByClass(X_train, y_train)
    y_pred = getPredictions(summaries, X_test)

    return getAccuracy(y_test, y_pred)

In [26]:
#This cell should run properly
NaiveBayesClassifier(df, "Outcome")

Split 768 rows into train=537 and test=231 rows


79.65367965367966