# Part 3: Design your own classification problem

Author: Shreya Parjan

7 Oct 2019

This notebook focuses on part 3 of assignment 4, where I use the tutorial from part 2 to create a new dataset on applicants to Wellesley that were accepted or rejected on the basis of GPA and test scores. I write a classifier using Naive Bayes to classify a new student.

## Table of contents
1. [Preliminaries](#s0)
2. [Create Data](#s1)
3. [Calculate Priors](#s2)
4. [Calculate Likelihood](#s3)
5. [Apply Bayes Classifier To New Data Point](#s4)

## Preliminaries
<a id="s0"></a>

In [18]:
import pandas as pd
import numpy as np

## Create Data
<a id="s1"></a>
Our dataset is contains data on eight individuals. We will use the dataset to construct a classifier that takes in the GPA, SAT Math, and SAT Writing of an individual and outputs a prediction for their gender.


In [19]:
# Create an empty dataframe
data = pd.DataFrame()

# Create our target variable
data['Decision'] = ['admit','reject','reject','admit','reject','admit','reject','admit']

# Create our feature variables
data['GPA'] = [3.9,3.2,2.9,4.0,3.6,3.8,3.7,3.8]
data['SAT Math'] = [740,570,700,800,690,700,610,780]
data['SAT Writing'] = [750,600,690,780,690,720,630,800]

# View the data
data


Unnamed: 0,Decision,GPA,SAT Math,SAT Writing
0,admit,3.9,740,750
1,reject,3.2,570,600
2,reject,2.9,700,690
3,admit,4.0,800,780
4,reject,3.6,690,690
5,admit,3.8,700,720
6,reject,3.7,610,630
7,admit,3.8,780,800


In [20]:
# Create an empty dataframe
person = pd.DataFrame()

# Create some feature values for this single row
person['GPA'] = [3.7]
person['SAT Math'] = [740]
person['SAT Writing'] = [770]

# View the data 
person

Unnamed: 0,GPA,SAT Math,SAT Writing
0,3.7,740,770


## Calculate Priors
<a id="s2"></a>
Priors can be either constants or probability distributions. In our example, this is simply the probability of being admitted/rejected. Calculating this is simple:

In [28]:
# Number of admits
n_admit = data['Decision'][data['Decision'] == 'admit'].count()

# Number of rejects
n_reject = data['Decision'][data['Decision'] == 'reject'].count()

# Total rows
total_ppl = data['Decision'].count()
# Number of admits divided by the total rows
P_admit = n_admit/total_ppl

# Number of rejects divided by the total rows
P_reject = n_reject/total_ppl

## Calculate Likelihood
<a id="s3"></a>
For each class (e.g. admit) and feature (e.g. GPA) combination we need to calculate the variance and mean value from the data. Pandas makes this easy:

In [22]:
# Group the data by decision and calculate the means of each feature
data_means = data.groupby('Decision').mean()

# View the values
data_means

Unnamed: 0_level_0,GPA,SAT Math,SAT Writing
Decision,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
admit,3.875,755.0,762.5
reject,3.35,642.5,652.5


In [23]:
# Group the data by decision and calculate the variance of each feature
data_variance = data.groupby('Decision').var()

# View the values
data_variance

Unnamed: 0_level_0,GPA,SAT Math,SAT Writing
Decision,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
admit,0.009167,1966.666667,1225.0
reject,0.136667,3958.333333,2025.0


In [24]:
def getMeans(decision):
    gMean = data_means['GPA'][data_variance.index==decision].values[0]
    mMean = data_means['SAT Math'][data_variance.index==decision].values[0]
    wMean = data_means['SAT Writing'][data_variance.index==decision].values[0]
    return gMean, mMean, wMean
def getVars(decision):
    gVar = data_variance['GPA'][data_variance.index==decision].values[0]
    mVar = data_variance['SAT Math'][data_variance.index==decision].values[0]
    wVar = data_variance['SAT Writing'][data_variance.index==decision].values[0] 
    return gVar, mVar, wVar

admit_gpa_mean, admit_math_mean, admit_writ_mean = getMeans('admit')
admit_gpa_variance, admit_math_variance, admit_writ_variance = getVars('admit')
reject_gpa_mean, reject_math_mean, reject_writ_mean = getMeans('reject')
reject_gpa_variance, reject_math_variance, reject_writ_variance = getVars('reject')

In [25]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

## Apply Bayes Classifier To New Data Point
<a id="s4"></a>

In [26]:
# Numerator of the posterior if the unclassified observation is an accept
P_admit * \
p_x_given_y(person['GPA'][0], admit_gpa_mean, admit_gpa_variance) * \
p_x_given_y(person['SAT Math'][0], admit_math_mean, admit_math_variance) * \
p_x_given_y(person['SAT Writing'][0], admit_writ_mean, admit_writ_variance)

3.710032420366205e-05

In [27]:
# Numerator of the posterior if the unclassified observation is a reject
P_admit * \
p_x_given_y(person['GPA'][0], reject_gpa_mean, reject_gpa_variance) * \
p_x_given_y(person['SAT Math'][0], reject_math_mean, reject_math_variance) * \
p_x_given_y(person['SAT Writing'][0], reject_writ_mean, reject_writ_variance)

1.928756687955893e-07

Because the numerator of the posterior for admit is greater than reject, then we predict that the person is admitted.