# <center> PPOL564: DS1 | Foundations <br><br> Checkpoint Assignment 5 </center>

# Instructions

In this assignment, you'll practice on what you have learned regarding the probability concepts covered in this course. This assignment should be completed independently or with your randomly assigned partner (see Course Policies "Homework Partner" in the Syllabus). 

**Be careful to follow the instructions for each question.** 

Recall that all assignment submissions must adhere to the following guidelines: 

- (i) all code must run; one point will be deducted if the entire notebook doesn't run on the Professor's/TA's computer (<font color = "darkred">Point(s) = 1</font>)
- (ii) solutions should be readable
    + Code should be thoroughly commented (the Professor/TA should be able to understand the code's purpose by reading the comment),
    + Coding solutions should be broken up into individual code chunks in Jupyter notebooks, not clumped together into one large code chunk,
    + Each student defined function must contain a doc string explaining what the function does, each input argument, and what the function returns;
    + All numerical output should be rounded to the second decimal place.
- (iii) Commentary, responses, and/or written solutions should all be written in Markdown and should contain no grammatical or spelling errors;
- (iv) All mathematical formulas should be written in LaTex;

There are a total of **_11 points_** available for this assignment.

In [1]:
import pandas as pd
import numpy as np
import pprint as pp
import scipy.stats as st # for Normal PDF

# Part 1: Probability

Use the following table is composed of respondents in the DMV regarding their transportation preferences to work. Please calculate the probabilities for questions 1 through 3 using this table. 

In [28]:
tab = pd.DataFrame([[100,87,5],[57,301,22],[67,53,12]],
                   columns=["Male","Female","Other"],
                   index=["Drive","Metro","Uber"])
tab

Unnamed: 0,Male,Female,Other
Drive,100,87,5
Metro,57,301,22
Uber,67,53,12


In [29]:
tab.loc['Column_Total']= tab.sum(numeric_only=True, axis=0)
tab.loc[:,'Row_Total'] = tab.sum(numeric_only=True, axis=1)
tab

Unnamed: 0,Male,Female,Other,Row_Total
Drive,100,87,5,192
Metro,57,301,22,380
Uber,67,53,12,132
Column_Total,224,441,39,704


### (1) What is the probability of a respondent taking an uber to work?

(_Point 1_)

<br>

$$ Pr(\text{Uber}) = \frac{132}{704} = 0.1875$$

<br>

### (2) What is the probability of a respondent driving to work given they identify as a male?

(_Point 1_)

<br>

$$ Pr(\text{Drive|Male}) = \frac{100}{224} \approx 0.446$$

<br>

### (3) What is the probability of a respondent identifying as other given that they report taking the Metro?

(_Point 1_)

<br>

$$ Pr(\text{Other|Metro}) = \frac{22}{380} \approx 0.058$$

<br>

### (4) What is the probability of a recession given that the yield curve inverted? 
(_Point 1_)

- $Pr(\text{recession}) = 2.5\%$
- $Pr(\text{yield curve inverted}) = 12\%$
- $Pr(\text{recession} \cap \text{yield curve inverted}) = .3\%$

<br>

$$ Pr(\text{Recession|yield curve inverted}) = \frac{Pr (\text {recessions} \cap \text {yield curve inverted})}{Pr(\text {yield curve inverted})} = \frac {0.3\%}{12\%} = \text {0.025}$$

<br>

### (5) Is the probability of a recession dependent on the yield curve inverting? Explain your answer.

(_Point 1_)

Probability of a recession is dependent on the yield curve inverting. If probability of a recession is dependent on yield curve inverting, then we would have
<br>
$$ Pr(\text{Recession|yield curve inverted}) = Pr(\text {Recession}) $$
<br>
Given our calculation above, $ Pr(\text{Recession|yield curve inverted}) = 2.5\% $ and $ Pr(\text{Recession}) = 2.5\% $ are equal. Therefore, the probability of a recession is independent on the yield curve inerting.

# Part 2: Build a Naive Bayesian Classifier

(_Point 5_)

**Can we predict whether someone will vote or not?**

In the last assignment, we explored the `turnout.csv` data, which was drawn from the 2012 National Election Survey. The data records the age, eduction level (of total years in school), income, race (caucasian or not), and whether or not the respondent voted in the 2012 Presidential election. The sample composes 2000 individual respondents in total. I have broken the data up into a training (1600 entries, 80%) and test dataset (400 entries, 20%) (see below). 

Use what we learned to build a Naive Bayesian Classifier that tries to predict whether a respondent will vote in a presidential election or not (Class == Vote). The classifier must be built from scratch. Do not use a third party ML or statistical package.

Feel free to manipulate the data however you see fit. Run your algorithm and see how it predicts the training data. Then report how accurate you were on predicting someone's propensity to vote in the test data. Did you do better or worse than chance (50%)?

When completing this answer, be sure to: 

- comment on all your code
- provide a narrative for what you're doing
- summarize your results and findings

In [101]:
dat = pd.read_csv('turnout.csv')

# Break data up into training and test data
train=dat.sample(frac=0.8,random_state=323)
test=dat.drop(train.index)

# Reset the indices for both the train and test
train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

# Preview the training data 
train.head()

Unnamed: 0,id,age,educate,income,vote,white
0,1353,46,9.0,1.8429,0,0
1,122,25,15.0,3.8606,1,1
2,1530,69,17.0,13.3041,1,1
3,162,53,10.0,3.58,1,1
4,1807,34,16.0,5.4713,1,0


In [102]:
## drop id
train = train.drop('id',axis = 1)
train.head(5)

Unnamed: 0,age,educate,income,vote,white
0,46,9.0,1.8429,0,0
1,25,15.0,3.8606,1,1
2,69,17.0,13.3041,1,1
3,53,10.0,3.58,1,1
4,34,16.0,5.4713,1,0


In [103]:
## Step 1: calculate class prob for all continuous variables
y1 = train.query("vote == 1")
y0 = train.query("vote == 0")

# Class probabilities.
pr_y1 = y1.shape[0]/train.shape[0]
pr_y0 = y0.shape[0]/train.shape[0]

In [104]:
# Collect the mean and standard dev. of each conditional distribution
dist_locs = \
{("age",1):{'mean':y1.age.mean(),'sd':y1.age.std()},
 ("age",0):{'mean':y0.age.mean(),'sd':y0.age.std()},
 ("educate",1):{'mean':y1.educate.mean(),'sd':y1.educate.std()},
 ("educate",0):{'mean':y0.educate.mean(),'sd':y0.educate.std()},
 ("income",1):{'mean':y1.income.mean(),'sd':y1.income.std()},
 ("income",0):{'mean':y0.income.mean(),'sd':y0.income.std()}
}

# Print
pp.pprint(dist_locs)

{('age', 0): {'mean': 42.653753026634384, 'sd': 19.127668078553704},
 ('age', 1): {'mean': 46.14827295703454, 'sd': 16.760134583719374},
 ('educate', 0): {'mean': 10.665859564164649, 'sd': 3.2326417027391363},
 ('educate', 1): {'mean': 12.59519797809604, 'sd': 3.249301731768977},
 ('income', 0): {'mean': 2.8083740920096854, 'sd': 2.2222018360245834},
 ('income', 1): {'mean': 4.258665796124673, 'sd': 2.900978327757866}}


In [105]:
## Deal with Continuous 'White'
# First, subset the data to contain only vote, the outcome variable, and the discrete variable
trainwhite = train.filter(['vote','white'])
trainwhite.head(5)

Unnamed: 0,vote,white
0,0,0
1,1,1
2,1,1
3,1,1
4,1,0


In [106]:
## use the function in lecture notes 24 to calculate discrete probability

def calc_probs(data,outcome_var=""):
    '''
    Function calculates the class and conditional probabilities in 
    the binary data. 
    '''
    # Generate empty dictionary containers.
    class_probs = {};cond_probs = {}
    # Locate all variables that are not the outcome.
    vars = [v for v in data.columns if v != outcome_var]
    # iterate through the class outcomes
    for y, d in data.groupby(outcome_var): 
        # calculate the class probabilities
        class_probs.update({y: d.shape[0]/data.shape[0]})
        for v in vars:
            # calculate the conditional probabilities for each variable given the class.
            pr = d[v].sum()/d.shape[0]
            cond_probs[(v,1,y)] = pr 
            cond_probs[(v,0,y)] = 1 - pr
    return class_probs, cond_probs


# Run
class_probs, cond_probs = calc_probs(trainwhite,outcome_var="vote")

# Print
print("class probabilities",end="\n\n")
pp.pprint(class_probs)
print("\n")
print("conditional probabilities",end="\n\n")
pp.pprint(cond_probs)

class probabilities

{0: 0.258125, 1: 0.741875}


conditional probabilities

{('white', 0, 0): 0.22033898305084743,
 ('white', 0, 1): 0.12384161752316769,
 ('white', 1, 0): 0.7796610169491526,
 ('white', 1, 1): 0.8761583824768323}


In [107]:
def predict(data, dist_locs):
    '''
    This function predicts whether a person will vote or not by comparing the value of prediction in the 0 class 
    and the 1 class.
    '''
    store_preds = []
    for i,row in data.iterrows():
        
        # Get the predictions for continuous variables using a Gaussan distribution
        pr_0 = 1; pr_1 = 1
        for j in [0,1,2]: ## because vote is in position 3
            pr_0 *= st.norm(dist_locs[(row.index[j],0)]['mean'],
                            dist_locs[(row.index[j],0)]['sd']).pdf(row.values[j])
            pr_1 *= st.norm(dist_locs[(row.index[j],1)]['mean'], 
                            dist_locs[(row.index[j],1)]['sd']).pdf(row.values[j])
        pr_0 *= pr_y0
        pr_1 *= pr_y1
                     
        ## This loop is for discrete variable 'white'
        pr_1d = 1; pr_0d = 1
        for j in range(4,len(row.index)):
            pr_0d *= cond_probs[(row.index[j],row.values[j],0)]
            pr_1d *= cond_probs[(row.index[j],row.values[j],1)]     
        pr_0d *= class_probs[0]
        pr_1d *= class_probs[1]
            
        pr_0 *= pr_0d
        pr_1 *= pr_1d
        
        # Assign the class designation to the highest probability
        if pr_0d >= pr_1d:
            class_pred = 0
        else:
            class_pred = 1
            
        store_preds.append([pr_0,pr_1,class_pred])
        
    return pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])

# Run
preds_train = predict(train,dist_locs)

In [108]:
preds_train.head()

Unnamed: 0,pr_0,pr_1,pred
0,5.323266e-06,1.050043e-05,1
1,5.705501e-06,6.58556e-05,1
2,1.952186e-11,2.362967e-07,1
3,1.911462e-05,0.0001260857,1
4,7.65535e-07,1.114783e-05,1


In [109]:
accuracy = sum(train['vote'] == preds_train.pred)/train.shape[0]
accuracy

0.741875

The accuracy of our classifier is 74.2%. The prediction is better than chance (50%).

In [110]:
test = test.drop('id',axis = 1)
testwhite = test.filter(['vote','white'])

In [113]:
preds_test = predict(test,dist_locs)

In [114]:
accuracy2 = sum(test.vote == preds_test.pred)/test.shape[0]
accuracy2

0.7625

The prediction accuracy on the test data is 76.25%, which is also better than chance.