# <center> PPOL564: DS1 | Foundations <br><br> Checkpoint Assignment 5 </center>

# Instructions

In this assignment, you'll practice on what you have learned regarding the probability concepts covered in this course. This assignment should be completed independently or with your randomly assigned partner (see Course Policies "Homework Partner" in the Syllabus). 

**Be careful to follow the instructions for each question.** 

Recall that all assignment submissions must adhere to the following guidelines: 

- (i) all code must run; one point will be deducted if the entire notebook doesn't run on the Professor's/TA's computer (<font color = "darkred">Point(s) = 1</font>)
- (ii) solutions should be readable
    + Code should be thoroughly commented (the Professor/TA should be able to understand the code's purpose by reading the comment),
    + Coding solutions should be broken up into individual code chunks in Jupyter notebooks, not clumped together into one large code chunk,
    + Each student defined function must contain a doc string explaining what the function does, each input argument, and what the function returns;
    + All numerical output should be rounded to the second decimal place.
- (iii) Commentary, responses, and/or written solutions should all be written in Markdown and should contain no grammatical or spelling errors;
- (iv) All mathematical formulas should be written in LaTex;

There are a total of **_11 points_** available for this assignment.

In [182]:
name = 'Adam Hearn'
studentid = 'ach154'

In [183]:
import pandas as pd
import pprint as pp # for printing
import scipy.stats as st # for Normal PDF
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Part 1: Probability

Use the following table is composed of respondents in the DMV regarding their transportation preferences to work. Please calculate the probabilities for questions 1 through 3 using this table. 

In [184]:
tab = pd.DataFrame([[100,87,5],[57,301,22],[67,53,12]],
                   columns=["Male","Female","Other"],
                   index=["Drive","Metro","Uber"])
tab

Unnamed: 0,Male,Female,Other
Drive,100,87,5
Metro,57,301,22
Uber,67,53,12


### (1) What is the probability of a respondent taking an uber to work?

(_Point 1_)

In [185]:
index_sums = tab.sum(axis = 1)
total = sum(index_sums)

prob_uber = index_sums['Uber'] / total

print("Respondents who Uber:", index_sums['Uber'])
print("Total number of respondents:", total)
print("Probability of respondent taking Uber:", prob_uber)

Respondents who Uber: 132
Total number of respondents: 704
Probability of respondent taking Uber: 0.1875


$$ Pr(\text{Uber}) = \frac{132}{704} \approx .1874$$

### (2) What is the probability of a respondent driving to work given they identify as a male?

(_Point 1_)

In [186]:
gender_sum = tab.sum(axis = 0)
male_drive = tab.at['Drive','Male']

prob_male_drive = male_drive / gender_sum['Male']

print("Males who drive:", male_drive)
print("Total number of males:", gender_sum['Male'])
print("Probability of respondent driving to work given they identify as a male:", round(prob_male_drive,4))

Males who drive: 100
Total number of males: 224
Probability of respondent driving to work given they identify as a male: 0.4464


$$ Pr(\text{Drive } | \text{ Male}) = \frac{100}{224} \approx .4464 $$

### (3) What is the probability of a respondent identifying as other given that they report taking the Metro?

(_Point 1_)

In [187]:
metro_sum = index_sums['Metro']
other_metro = tab.at['Metro', 'Other']

prob_other_metro = other_metro / metro_sum
print("Respondents who identify as other who take Metro:", other_metro)
print("Total number of respndents who take the Metro:", metro_sum)
print("Probability of a respondent identifying as other given that they report taking the Metro:", round(prob_other_metro,4))

Respondents who identify as other who take Metro: 22
Total number of respndents who take the Metro: 380
Probability of a respondent identifying as other given that they report taking the Metro: 0.0579


$$ Pr(\text{Other } | \text{ Metro}) = \frac{22}{380} \approx .0579 $$

### (4) What is the probability of a recession given that the yield curve inverted? 
(_Point 1_)

- $Pr(\text{recession}) = 2.5\%$
- $Pr(\text{yield curve inverted}) = 12\%$
- $Pr(\text{recession} \cap \text{yield curve inverted}) = .3\%$

$$ Pr(\text{Recession} | \text{Yield Curve Inverted}) = \frac{Pr(\text{Recession} \cap \text{Yield Curve Inverted})}{Pr(\text{Yield Curve Inverted})} $$

In [188]:
rec = .025
inv_yield = .12
rec_inv_yield = .003

prob_rec = rec_inv_yield/inv_yield
prob_rec

0.025

$$ Pr(\text{Recession} | \text{Yield Curve Inverted}) = \frac{Pr(\text{Recession} \cap \text{Yield Curve Inverted})}{Pr(\text{Yield Curve Inverted})} = \frac{0.003}{0.12} = 0.025 $$

### (5) Is the probability of a recession dependent on the yield curve inverting? Explain your answer.

(_Point 1_)

Events A and B are independent if the equation P(A∩B) = P(A) · P(B) holds true. Let's check if this is the case:

In [189]:
rec_inv_yield == rec * inv_yield

True

It appears that the probability of a recession occuring is independent on the yield curve inverting.

# Part 2: Build a Naive Bayesian Classifier

(_Point 5_)

**Can we predict whether someone will vote or not?**

In the last assignment, we explored the `turnout.csv` data, which was drawn from the 2012 National Election Survey. The data records the age, eduction level (of total years in school), income, race (caucasian or not), and whether or not the respondent voted in the 2012 Presidential election. The sample composes 2000 individual respondents in total. I have broken the data up into a training (1600 entries, 80%) and test dataset (400 entries, 20%) (see below). 

Use what we learned to build a Naive Bayesian Classifier that tries to predict whether a respondent will vote in a presidential election or not (Class == Vote). The classifier must be built from scratch. Do not use a third party ML or statistical package.

Feel free to manipulate the data however you see fit. Run your algorithm and see how it predicts the training data. Then report how accurate you were on predicting someone's propensity to vote in the test data. Did you do better or worse than chance (50%)?

When completing this answer, be sure to: 

- comment on all your code
- provide a narrative for what you're doing
- summarize your results and findings

In [192]:
dat = pd.read_csv('turnout.csv')

# Break data up into training and test data
train=dat.sample(frac=0.8,random_state=323)
test=dat.drop(train.index)

# Reset the indices for both the train and test
train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

# Drop id variable
train = train.drop(['id'],axis = 1) 
test = test.drop(['id'],axis = 1) 
#reorder rows
train = train[['vote', 'age', 'educate', 'income', 'white']]
test = test[['vote', 'age', 'educate', 'income', 'white']]

#preview data
train

Unnamed: 0,vote,age,educate,income,white
0,0,46,9.0,1.8429,0
1,1,25,15.0,3.8606,1
2,1,69,17.0,13.3041,1
3,1,53,10.0,3.5800,1
4,1,34,16.0,5.4713,0
...,...,...,...,...,...
1595,0,26,16.0,1.8967,1
1596,1,71,11.0,0.9780,1
1597,0,55,11.0,1.4398,1
1598,1,62,12.0,6.6992,1


First we calculate class probabilities:

In [193]:
y1 = train.query("vote == 1")
y0 = train.query("vote == 0")

# Class probabilities.
pr_y1 = y1.shape[0]/train.shape[0]
pr_y0 = y0.shape[0]/train.shape[0]

print(pr_y1)
print(pr_y0)

0.741875
0.258125


Now we calculate the conditional means/standard deviations

In [194]:
# Collect the mean and standard dev. of each conditional distribution
dist_locs = \
{("age",1):{'mean':y1.age.mean(),'sd':y1.age.std()},
 ("age",0):{'mean':y0.age.mean(),'sd':y0.age.std()},
 ("educate",1):{'mean':y1.educate.mean(),'sd':y1.educate.std()},
 ("educate",0):{'mean':y0.educate.mean(),'sd':y0.educate.std()},
 ("income",1):{'mean':y1.income.mean(),'sd':y1.income.std()},
 ("income",0):{'mean':y0.income.mean(),'sd':y0.income.std()},
 ("white",1):{'mean':y1.white.mean(),'sd':y1.white.std()},
 ("white",0):{'mean':y0.white.mean(),'sd':y0.white.std()}
}

# 
# Print
pp.pprint(dist_locs)

{('age', 0): {'mean': 42.653753026634384, 'sd': 19.127668078553697},
 ('age', 1): {'mean': 46.14827295703454, 'sd': 16.76013458371937},
 ('educate', 0): {'mean': 10.665859564164649, 'sd': 3.2326417027391328},
 ('educate', 1): {'mean': 12.59519797809604, 'sd': 3.2493017317689734},
 ('income', 0): {'mean': 2.8083740920096854, 'sd': 2.2222018360245834},
 ('income', 1): {'mean': 4.258665796124684, 'sd': 2.9009783277578696},
 ('white', 0): {'mean': 0.7796610169491526, 'sd': 0.41497792824208074},
 ('white', 1): {'mean': 0.8761583824768323, 'sd': 0.3295396173140738}}


In [195]:
def predict(data,dist_locs):
    ''''''
    store_preds = []
    for i,row in data.iterrows():
        
        # Get the predictions using a Gaussan distribution
        pr_0 = 1; pr_1 = 1
        for j in range(1,len(row)):
            pr_0 *= st.norm(dist_locs[(row.index[j],0)]['mean'],
                            dist_locs[(row.index[j],0)]['sd']).pdf(row.values[j])
            pr_1 *= st.norm(dist_locs[(row.index[j],1)]['mean'], 
                            dist_locs[(row.index[j],1)]['sd']).pdf(row.values[j])
        pr_0 *= pr_y0
        pr_1 *= pr_y1
        
        # Assign the class designation to the highest probability
        if pr_0 >= pr_1:
            class_pred = 0
        else:
            class_pred = 1
            
        store_preds.append([pr_0,pr_1,class_pred])
        
    return pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])

# Run
preds_train = predict(train,dist_locs)
preds_train.head()

Unnamed: 0,pr_0,pr_1,pred
0,1.540406e-05,4.036699e-06,0
1,2.367139e-05,0.0001142917,1
2,8.099369e-11,4.100905e-07,1
3,7.930411e-05,0.0002188205,1
4,2.215246e-06,4.285584e-06,1


In [196]:
accuracy_train = sum(train.vote == preds_train.pred)/train.shape[0]
accuracy_train

0.726875

In [197]:
preds_test = predict(test,dist_locs)
accuracy_test = sum(test.vote == preds_test.pred)/test.shape[0]
accuracy_test

0.7275

The algorithm can accurately predict voting 72% of the time.