<a href="https://colab.research.google.com/github/saravan2/Learning_Python/blob/master/Section_2_Notebook_Gold_Standard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 2 Notebook
In this notebook we will reason about recent presidential approval poll data. We will explore how the concepts of conditional probability, Law of Total Probability and Bayes' Theorem help us better understand a simple survey. Along the way we will learn how the Python data analysis library `pandas` facilitates easy manipulation of data tables.

**Learning Goals:**
1. Analyze poll data with conditional probability, Law of Total Probability and Bayes' Theorem
2. Learn some basic `pandas` skills

# Poll Data - Presidential Approval
**Problem:** You collect data on whether or not people approve of President Trump, a potential candidate in the upcoming election. We have collected real poll data  from the last 13 CNN polls, which can be found [here](https://www.realclearpolitics.com/epolls/other/president_trump_job_approval-6179.html) (link directly to the CNN poll [here](https://cdn.cnn.com/cnn/2020/images/01/20/rel1a.-.trump,.impeachment.pdf)).

Let $A$ be the event that a person says they approve of the way President Trump is handling his job as president. Let $M$ be the event that a user answered "No opinion." We are interested in estimating $P(A)$, however that is hard given the small but significant number of users who answered "No opinion". 

**Note 1:** We assume in our model that given enough information the "No opinion" users would make an approve/disapprove decision.

**Note 2:** The latest CNN poll (Jan 16-19, 2020) had a sample of 1156 respondents. For simplicity we will assume all polls also had this sample size.

In [None]:
dates = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan2020']
data = {}
data['approve'] = [37, 40, 42, 43, 43, 43, 40, 39, 41, 42, 43, 43]
data['disapprove'] = [57, 55, 51, 52, 52, 52, 54, 55, 57, 54, 53, 53]
data['no_opinion'] = [7, 5, 8, 5, 5, 5, 6, 6, 2, 4, 4, 4]

In [None]:
import pandas as pd

polldf = pd.DataFrame(data, index=dates)
polldf

Unnamed: 0,approve,disapprove,no_opinion
Jan,37,57,7
Feb,40,55,5
Mar,42,51,8
Apr,43,52,5
May,43,52,5
June,43,52,5
Aug,40,54,6
Sep,39,55,6
Oct,41,57,2
Nov,42,54,4


## a) **For each month**, what is the fraction of users who responded with their opinion $P(M^C)$?

In [None]:
(polldf['approve'] + polldf['disapprove']) / polldf.sum(axis=1)

Jan        0.930693
Feb        0.950000
Mar        0.920792
Apr        0.950000
May        0.950000
June       0.950000
Aug        0.940000
Sep        0.940000
Oct        0.980000
Nov        0.960000
Dec        0.960000
Jan2020    0.960000
dtype: float64

## b) For each month, what is the probability that a user said they approve, given that they responded to the poll $P(A|M^C)$?


In [None]:
polldf['approve'] / (polldf['approve'] + polldf['disapprove'])

Jan        0.393617
Feb        0.421053
Mar        0.451613
Apr        0.452632
May        0.452632
June       0.452632
Aug        0.425532
Sep        0.414894
Oct        0.418367
Nov        0.437500
Dec        0.447917
Jan2020    0.447917
dtype: float64

## c) Compute $P(A)$ under the following assumptions:
1. $P(A|M) = P(A|M^C)$. That is, people with no opinion will have similar approval ratios as the others.
2. $P(A|M) = 0$. That is, people with no opinion actually disapprove.
3. $P(A^C|M) = 0$. That is, people with no opinion actually approve.

**Solution:** 
1. $P(A) = P(A|M^C)*P(M) + P(A|M^C)*P(M^C) = P(A|M^C) $
2. $P(A) = P(A|M^C)*P(M^C)$
2. $P(A) = 1 - P(A^C) = 1 - P(A^C|M^C)*P(M^C)$

In [None]:
polldf['P(A) w/ A.1'] =  polldf['approve'] / (polldf['approve'] + polldf['disapprove']) # Same as (b)
polldf['P(A) w/ A.2'] = 0.85 * polldf['approve'] / (polldf['approve'] + polldf['disapprove'])
polldf['P(A) w/ A.3'] = 1 - 0.85 * polldf['disapprove'] / (polldf['approve'] + polldf['disapprove'])
polldf

Unnamed: 0,approve,disapprove,no_opinion,P(A) w/ A.1,P(A) w/ A.2,P(A) w/ A.3
Jan,37,57,7,0.393617,0.334574,0.484574
Feb,40,55,5,0.421053,0.357895,0.507895
Mar,42,51,8,0.451613,0.383871,0.533871
Apr,43,52,5,0.452632,0.384737,0.534737
May,43,52,5,0.452632,0.384737,0.534737
June,43,52,5,0.452632,0.384737,0.534737
Aug,40,54,6,0.425532,0.361702,0.511702
Sep,39,55,6,0.414894,0.35266,0.50266
Oct,41,57,2,0.418367,0.355612,0.505612
Nov,42,54,4,0.4375,0.371875,0.521875


## d) Discuss: Which of the assumptions do you think is best? What assumptions would you employ in practice, or what other data would you gather to support arguments using this survey data?