### Naive Bayes Classifier
#### Important Formula
$$P(Y=c/X) \propto \prod_{i=1}^nP(x_i/Y=c) \cdot P(Y=c)$$
Essentially,
$$Posterior Probability \propto Likelihood \cdot Prior Probability$$

##### The complete formula: 
$$P(y=c/X) = \frac{\prod\limits_{i=1}^nP(x_i/Y=c) \cdot P(Y=c)}{\sum\limits_{c=0}^k\left( \prod\limits_{i=1}^nP(x_i/Y=c) \right)}$$
where, 
- $k$ is the number of classes
- $n$ is the number of features

> We skip the denominator, as in aggregation it is useless because it is the same for all (a constant)

### Mushroom Dataset
- Using Naive Bayes to classify mushrooms

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use("dark_background")

In [22]:
# Reading the dataset
df = pd.read_csv("mushrooms.csv")
df.head()

Unnamed: 0,type,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


We see we have 23 columns, hence 22 features

This data is not in numerical form, hence we need to convert it first
- We use `Label Encoder` for that
- It essentially maps number to each value and uses that for each instance

In [23]:
# Importing the needy libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

#### Encoding the data

In [24]:
le = LabelEncoder()

# Applies transformation on each column
ds = df.apply(le.fit_transform)
ds.head()

Unnamed: 0,type,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


Now all the features are numerical, hence we can use this data

In [25]:
data = ds.values
data.shape

(8124, 23)

#### Breaking in train test

In [26]:
x_train, x_test, y_train, y_test = train_test_split(data[:,1:], data[:, 0], test_size=0.2)

x_train.shape, y_test.shape

((6499, 22), (1625,))

In [27]:
np.unique(y_train)

array([0, 1])

We can see that we have only 2 types of mushroom, and we need to classify them

### Building Our Classifier

#### Some underlying functions

In [28]:
def priorProbability(y, label):

    total = y.shape[0]
    current = np.sum(y == label)

    return current/total

In [29]:
def conditionalProbability(X, y, feature_column, feature_value, label):
    # Computes P(X, y=c)

    # getting all X where y is label
    X_filtered = X[y == label]

    numerator = np.sum(X_filtered[:, feature_column] == feature_value)
    denominator = np.sum(y == label)

    return numerator/denominator


In [30]:
def predict(X, y, x_test):
    # x_test is a single point, having n features

    classes = np.unique(y)
    n = X.shape[1]

    posteriorProbabilities = []

    # Computing posterior for each class
    for label in classes:

        # Posterior_c = likelihood * prior
        likelihood = 1.0

        for f in range(n):
            cond = conditionalProbability(X, y, f, x_test[f], label)
            likelihood *= cond

        prior = priorProbability(y, label)
        posterior = likelihood * prior

        posteriorProbabilities.append(posterior)

    pred = np.argmax(posteriorProbabilities)
    return pred

In [31]:
# testing the function
output = predict(x_train, y_train, x_test[1])
print(output, y_test[1])

0 0


We see are prediction works fine!

In [36]:
# Scoring
def score(x_train, y_train, x_test, y_test):

    pred = []
    for i in range(x_test.shape[0]):
        pred.append(predict(x_train, y_train, x_test[i]))

    pred = np.array(pred)
    return np.sum(pred == y_test)/y_test.shape[0]

In [37]:
score(x_train, y_train, x_test, y_test)

0.9975384615384615

We get an accuracy of $99.75\%$ which is pretty good