# Programming Assignment 1
This assignment will familiarize you with Python statistical programming. 

## Part 1. Lottery Simulation
A state lottery draws 5 numbers (1-36) on each lottery ticket -- all numbers are drawn independently and uniformly. There is one winning 5 number combination and there can be multiple winners. Each ticket is costs $2. 

The prizes are as follows:
* A match of all 5 numbers in order recieves a prize of $100000

* A match of first 4 of the numbers in order recieves a prize of $5000

* A match of the first 3 numbers in order recieves a prize of $200

* A match of last 3 of the numbers in order recieves a prize of $50

### Q1. How many possible lottery tickets are there?
Calculate the total number of distinct lottery tickets.

In [1]:
total = 36 ** 5
total

60466176

### Q2. Suppose the state sells 100000 lottery tickets, what is their expected profit?
Expected profit is defined as the total sales from the state minus the total expected prizes given out to ticket holders. You may assume that all numbers on the sold tickets are randomly drawn independently and uniformly. 

You should solve this problem using code.

In [2]:
sales = 2 * 100000
prob_all_5 = 1/total
prob_first_4 = 35/total
prob_first_3 = (36*36-1)/total
prob_last_3 = (36*36-1)/total
expected_prizes = 100000 * ((prob_all_5 * 100000) + (prob_first_4 * 5000) + (prob_first_3 * 200) + (prob_last_3 * 50))
expected_profit = sales - expected_prizes
expected_profit

199009.77697018578

## Part 2. Election Simulation
Let's consider the topical problem of election simulation. We're going to assume an election format like the one in the US where there is an electoral college and the popular vote winner of each state "takes all" of the electoral votes (I know that this is not exactly true for all states, but bear with me). 

### Q1. Load Data From The Two Provided Datasets
You are given two datasets `survey.csv` and `states.csv`, write some code to load both datasets into Python (you may use any intermediate format to represent the data, e.g. dictionaries, dataframes, lists, etc.) Here is a brief description of the data.

`survey.csv` - Opinion polling data from the 2020 election. 
* STATE_NAME - the name of the state
* BIDEN - the fraction of people polled supporting Joseph Biden in the state
* TRUMP - the fraction of people polled supporting Donald Trump in the state
* K - the total number of people polled in the state

`states.csv` - A dataset of states and their electoral college votes
* STATE_NAME - the name of the state
* EC - the number of electoral college votes

In [3]:
import pandas as pd

states = pd.read_csv('states.csv') #index_col = 'STATE_NAME'
survey = pd.read_csv('survey.csv')

states.drop_duplicates(subset='STATE_NAME', keep = 'first', inplace = True)
survey.drop_duplicates(subset='STATE_NAME', keep = 'first', inplace = True)
states = states.set_index('STATE_NAME')
survey = survey.set_index('STATE_NAME')

### Q2. Simulate Polling Errors
All opinion polls have a "margin of error", or how far the poll might deviate from reality, because of the way that they are sampled. We will use a simple model to simulate polling errors. Let's assume that the polling errors are "normally" distributed (we will discuss later in the class why this is the case!). In this question, you will use the survey data to build a simulator for different possible state-wise results.

* Let $\mu_{BIDEN}$ be the fraction of support Joseph Biden has in the poll for a particular state (e.g., 44 means 0.44)
* Likewise, let $\mu_{TRUMP}$ be the same for Donald Trump
* Let $K$ be the number of people polled in a particular state
* Assume polling errors in all states are independent

The "true" result of the election will be denoted by two random variables $X_{BIDEN}$ and $X_{TRUMP} = 1 - X_{BIDEN}$.
$$X_{BIDEN} \sim Normal(\mu_{BIDEN}, \frac{1}{4K})$$
$$X_{TRUMP} = 1 - X_{BIDEN}$$

You will write a function that simulates the outcome for a given state. It will take as input the state name and will simulate one possible election day outcome.

In [4]:
import numpy as np
import math

def simulate_state(state_name):
    mu = survey.at[state_name, 'BIDEN'] / 100
    k = survey.at[state_name, 'K']
    sigma = 1 / (2 * math.sqrt(k))
    b = np.random.normal(loc=mu, scale=sigma, size=None)
    t = 1 - b
    if (b > t):
        return 'BIDEN'
    else:
        return 'TRUMP'
simulate_state('New York')

'BIDEN'

### Q3. Simulate National Results
Use your result in Q2 to simulate the overall result of the elction (the number of electoral votes received by each candidate). You may assume that the candidate with the larger fraction of votes in a state recieves all of its electoral votes.

In [5]:
def simulate_overall():
    results = {}
    results['TRUMP'] = 0
    results['BIDEN'] = 0
    
    for index, row in survey.iterrows():
        res = simulate_state(index)
        votes = states.at[index, 'EC']
        results[res] += votes
        
    return results
simulate_overall()

{'TRUMP': 262, 'BIDEN': 276}

### Q4. Repeated Trials
Run 1 million simulations of the function you wrote for Q3. In what fraction of those simulations does Biden win?

In [None]:
n = 1000000
wins = 0

for x in range(n):
    results = simulate_overall()
    if (results['BIDEN'] > results['TRUMP']):
        wins += 1
wins/n