# Assignment 1. Opinion Polling

In this assignment, you will be expected to analyze a dataset on your own and answer questions about your findings. 

*Due date*: Friday Oct 25, 2019 

Completed assignments will be collected in class.

You’ve been hired as a consultant to predict how a state school board election will turn out.
* There are three candidates and all voters must vote for one of them: Pearle Goodman, Masako Holley, Genevieve Gallegos.
* The candidate with the final highest vote count wins the election.
* You are given the list of registered voters here:
https://github.com/sjyk/cmsc21800/blob/master/voters.csv
* The state gives you two samples of data one collected by SurveyMonkey and one collected by Qualtrics:
https://github.com/sjyk/cmsc21800/blob/master/survey_monkey.csv 
https://github.com/sjyk/cmsc21800/blob/master/qualtrics.csv

## Initial Steps
Let's first get our data analysis environment setup by loading all of the datasets:

In [16]:
import pandas as pd

voter_roll = pd.read_csv('voters.csv')
survey_monkey = pd.read_csv('survey_monkey.csv')
qualtrics = pd.read_csv('qualtrics.csv')


#the voter roll contains duplicate names due to a bug (my fault :( ), let's remove them
def remove_duplicate_rows(df):
    #get counts per name
    name_cts = df.groupby('Voter')['Voter'].count()
    
    #find all counts greater than or equal to 2
    dups = name_cts[name_cts >= 2]
    
    indexes_to_remove = []
    for d in dups.index:
        dup_pair = voter_roll[ voter_roll['Voter'] == d]
        first_index = dup_pair.index[0]
        indexes_to_remove.append(first_index)
    
    df_cpy = df.drop(indexes_to_remove)
    return df_cpy

#merge the datasets
voter_roll = remove_duplicate_rows(voter_roll)
survey_monkey = survey_monkey.merge(voter_roll, on='Voter')
qualtrics = qualtrics.merge(voter_roll, on='Voter')

### Q1. The SurveyMonkey data shows Genevieve Gallegos winning 59% vote of 100 people polled and the Qualtrics data shows her losing with 42% vote of 50 people polled.  Which of the following best describes the likelihood that a difference this large (>17%) happened purely by random chance and not an error in the polling process?

In [17]:
"""
Let's calculate the likelihood that a single poll could be off by 17%
"""

import numpy as np

#the maximum variance of a biased coin flip is
MAX_VARIANCE = 0.25

def ci(size):
    se = np.sqrt(MAX_VARIANCE/size)
    return {'68% +/-': se, '95% +/-': 1.96*se, '99% +/-': 2.57*se}

print('50 polled: ', ci(50))
print('100 polled: ', ci(100))

50 polled:  {'68% +/-': 0.07071067811865475, '95% +/-': 0.13859292911256332, '99% +/-': 0.1817264427649427}
100 polled:  {'68% +/-': 0.05, '95% +/-': 0.098, '99% +/-': 0.1285}


Clearly, it is very unlikely. So (c) is the right answer.

### Q2. The data provider suspects that the SurveyMonkey dataset is biased. What do you think?

In [22]:
#Let's look at all the variables of interest
cols = ['County', 'Age', 'Gender']
for col in cols:
    print(survey_monkey.groupby(col)[col].count()/100)
    print(qualtrics.groupby(col)[col].count()/50)
    print()

County
Bailey           0.03
Black            0.09
Mountain Farm    0.79
Riverside        0.09
Name: County, dtype: float64
County
Bailey           0.06
Black            0.02
Mountain Farm    0.80
Riverside        0.12
Name: County, dtype: float64

Age
26-35    0.04
36-45    0.17
46-55    0.46
56-65    0.14
65+      0.19
Name: Age, dtype: float64
Age
26-35    0.06
36-45    0.20
46-55    0.36
56-65    0.26
65+      0.12
Name: Age, dtype: float64

Gender
female    0.23
male      0.77
Name: Gender, dtype: float64
Gender
female    0.42
male      0.58
Name: Gender, dtype: float64



Clearly it looks like the dataset is gender-biased. Let's see if this could have happened by chance.

In [30]:
voter_roll.groupby('Gender')['Gender'].count()/4239
# There are 0.521585 % men in the whole population
observed_difference = 0.77-0.521585

#calculate the worst case standard error
se = np.sqrt(MAX_VARIANCE/100)

print('Number of se\'s from the expected value', observed_difference/se)
# rougly equal to 5, very unlikely!!!

Number of se's from the expected value 4.968300000000001


### Q3. Which of the following best describes the margin of error of the Qualtrics poll

In [32]:
print('50 polled: ', ci(50)) #answer is (b)

50 polled:  {'68% +/-': 0.07071067811865475, '95% +/-': 0.13859292911256332, '99% +/-': 0.1817264427649427}


### Q4.  A news report suggests that Pearle Goodman is dropping out of the election. Is it clear which candidate benefits from her departure?

In [41]:
combined_dataset = pd.concat([survey_monkey, qualtrics])
candidates = ['Genevieve Gallegos', 'Masako Holley', 'Pearle Goodman']
for cand in candidates:
    filtered = combined_dataset[combined_dataset['Vote'] == cand] #get those rows that voted for each candidate
    
    print("--- Breakdown for", cand ,"---")
    
    cols = ['County', 'Age', 'Gender']
    for col in cols:
        
        print(filtered.groupby(col)[col].count()/len(filtered))
    print("++")
    print()

--- Breakdown for Genevieve Gallegos ---
County
Bailey           0.0250
Black            0.0875
Mountain Farm    0.7750
Riverside        0.1125
Name: County, dtype: float64
Age
26-35    0.0750
36-45    0.1625
46-55    0.4250
56-65    0.1750
65+      0.1625
Name: Age, dtype: float64
Gender
female    0.0875
male      0.9125
Name: Gender, dtype: float64
++

--- Breakdown for Masako Holley ---
County
Bailey           0.066667
Black            0.050000
Mountain Farm    0.800000
Riverside        0.083333
Name: County, dtype: float64
Age
26-35    0.016667
36-45    0.200000
46-55    0.433333
56-65    0.166667
65+      0.183333
Name: Age, dtype: float64
Gender
female    0.6
male      0.4
Name: Gender, dtype: float64
++

--- Breakdown for Pearle Goodman ---
County
Mountain Farm    0.9
Riverside        0.1
Name: County, dtype: float64
Age
36-45    0.2
46-55    0.4
56-65    0.3
65+      0.1
Name: Age, dtype: float64
Gender
female    0.1
male      0.9
Name: Gender, dtype: float64
++



As you can see above Pearle Goodman has the same male-female break down as Genevieve Gallagos. So it would be reasonable to assume her votes would got to her. However, we also accepted arguments that the sample size was too small to tell.

### Q5. How likely are the following scenarios to affect the polling results (choose between “in favor of Genevieve Gallegos”,  “in favor of Masako Holley”, or statistically insignificant/unclear), explain.

We accepted all reasonable arguments here.