# Part 2: Bayes Theorem & Naive Bayes


Author: Shreya Parjan

7 Oct 2019

This notebook focuses on part 2 of assignment 4, where I create a notebook following the tutorial “Naive Bayes Classifier from Scratch," https://chrisalbon.com/machine_learning/naive_bayes/naive_bayes_classifier_from_scratch/. 

## Table of contents
1. [Preliminaries](#s0)
2. [Create Data](#s1)
3. [Calculate Priors](#s2)
4. [Calculate Likelihood](#s3)
5. [Apply Bayes Classifier To New Data Point](#s4)

## Preliminaries
<a id="s0"></a>

In [15]:
import pandas as pd
import numpy as np

## Create Data
<a id="s1"></a>
Our dataset is contains data on eight individuals. We will use the dataset to construct a classifier that takes in the height, weight, and foot size of an individual and outputs a prediction for their gender.


In [16]:
# Create an empty dataframe
data = pd.DataFrame()

# Create our target variable
data['Gender'] = ['male','male','male','male','female','female','female','female']

# Create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]

# View the data
data


Unnamed: 0,Gender,Height,Weight,Foot_Size
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


In [17]:
# Create an empty dataframe
person = pd.DataFrame()

# Create some feature values for this single row
person['Height'] = [6]
person['Weight'] = [130]
person['Foot_Size'] = [8]

# View the data 
person

Unnamed: 0,Height,Weight,Foot_Size
0,6,130,8


## Calculate Priors
<a id="s2"></a>
Priors can be either constants or probability distributions. In our example, this is simply the probability of being a gender. Calculating this is simple:

In [18]:
# Number of males
n_male = data['Gender'][data['Gender'] == 'male'].count()

# Number of males
n_female = data['Gender'][data['Gender'] == 'female'].count()

# Total rows
total_ppl = data['Gender'].count()
# Number of males divided by the total rows
P_male = n_male/total_ppl

# Number of females divided by the total rows
P_female = n_female/total_ppl

## Calculate Likelihood
<a id="s3"></a>
For each class (e.g. female) and feature (e.g. height) combination we need to calculate the variance and mean value from the data. Pandas makes this easy:

In [19]:
# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()

# View the values
data_means

Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,5.4175,132.5,7.5
male,5.855,176.25,11.25


In [20]:
# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Gender').var()

# View the values
data_variance

Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.097225,558.333333,1.666667
male,0.035033,122.916667,0.916667


I created functions getMeans and getVars to reduce the redundancy from the tutorial.

In [21]:

def getMeans(gender):
    hMean = data_means['Height'][data_variance.index==gender].values[0]
    wMean = data_means['Weight'][data_variance.index==gender].values[0]
    fMean = data_means['Foot_Size'][data_variance.index==gender].values[0]
    return hMean, wMean, fMean
def getVars(gender):
    hVar = data_variance['Height'][data_variance.index==gender].values[0]
    wVar = data_variance['Weight'][data_variance.index==gender].values[0]
    fVar = data_variance['Foot_Size'][data_variance.index==gender].values[0] 
    return hVar, wVar, fVar

male_height_mean, male_weight_mean, male_footsize_mean = getMeans('male')

# Means for male
##male_height_mean = data_means['Height'][data_variance.index == 'male'].values[0]
##male_weight_mean = data_means['Weight'][data_variance.index == 'male'].values[0]
##male_footsize_mean = data_means['Foot_Size'][data_variance.index == 'male'].values[0]


male_height_variance, male_weight_variance, male_footsize_variance = getVars('male')
# Variance for male
##male_height_variance = data_variance['Height'][data_variance.index == 'male'].values[0]
##male_weight_variance = data_variance['Weight'][data_variance.index == 'male'].values[0]
##male_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'male'].values[0]


female_height_mean, female_weight_mean, female_footsize_mean = getMeans('female')
# Means for female
##female_height_mean = data_means['Height'][data_variance.index == 'female'].values[0]
##female_weight_mean = data_means['Weight'][data_variance.index == 'female'].values[0]
##female_footsize_mean = data_means['Foot_Size'][data_variance.index == 'female'].values[0]

female_height_variance, female_weight_variance, female_footsize_variance = getVars('female')
# Variance for female
##female_height_variance = data_variance['Height'][data_variance.index == 'female'].values[0]
##female_weight_variance = data_variance['Weight'][data_variance.index == 'female'].values[0]
##female_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'female'].values[0]

In [22]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

## Apply Bayes Classifier To New Data Point
<a id="s4"></a>

In [23]:
# Numerator of the posterior if the unclassified observation is a male
P_male * \
p_x_given_y(person['Height'][0], male_height_mean, male_height_variance) * \
p_x_given_y(person['Weight'][0], male_weight_mean, male_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], male_footsize_mean, male_footsize_variance)

6.197071843878078e-09

In [24]:
# Numerator of the posterior if the unclassified observation is a female
P_female * \
p_x_given_y(person['Height'][0], female_height_mean, female_height_variance) * \
p_x_given_y(person['Weight'][0], female_weight_mean, female_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], female_footsize_mean, female_footsize_variance)

0.0005377909183630018

Because the numerator of the posterior for female is greater than male, then we predict that the person is female.