# Opinion Polling

<strong>Q1.</strong> The SurveyMonkey data shows Genevieve Gallegos winning 59% vote of 100 people polled and the Qualtrics data shows her losing with 42% vote of 50 people polled.  Which of the following best describes the likelihood that a difference this large (>17%) happened purely by random chance and not an error in the polling process?

a) 20% chance of a variation greater than 17% in independent polls.\
b) 10% chance of a variation greater than 17% in independent polls.\
<strong>c) <5% chance of a variation greater than 17% in independent polls.</strong> \
d) Not enough information to determine this.

In [43]:
import pandas as pd


sm = pd.read_csv('survey_monkey.csv')
q = pd.read_csv('qualtrics.csv')

# Convert each sample to binary dataset (1 if voted Genevieve)

sm_votes = (sm['Vote'] == 'Genevieve Gallegos')
q_votes = (q['Vote'] == 'Genevieve Gallegos')
sm_votes = sm_votes.astype(int)
q_votes = q_votes.astype(int)

var = (sm_votes.var()/sm_votes.shape[0]) + (q_votes.var()/q_votes.shape[0])
diff = (sm_votes.mean() - q_votes.mean())
print('diff =', diff, ', diff^2 =', diff**2)
print('var =', var, ', 4 * var =', 4 * var)

diff = 0.16999999999999998 , diff^2 = 0.028899999999999995
var = 0.007414862914862911 , 4 * var = 0.029659451659451642


Let $\hat{\mu}_1$ and $\hat{\mu}_2$ denote the sample means of the two surveys (the fractions of the sample populations that voted Genevieve).
We assume that $\hat{\mu}_1 - \hat{\mu}_2$ has a normal distribution centered on 0, and has a variance of $\frac{\sigma_1^2}{N_1} + \frac{\sigma_2^2}{N_2}$. Given that we observed a difference of 0.17, whose square (~0.0289) is roughly the same as $4 \cdot (\frac{\sigma_1^2}{N_1} + \frac{\sigma_2^2}{N_2}) \approx 0.029$, we can say that there's about less than (100% - 95%) = 5% chance of a variation greater than 17%.

<strong>Q2.</strong> The data provider suspects that the SurveyMonkey dataset is biased. What do you think? \
<strong>a) Yes, the SurveyMonkey dataset shows a clear bias in data collection</strong> \
b) No, the observed bias is likely due to the natural variation in randomly sampled data \
c) The sample size is too small to determine this

In [78]:
v = pd.read_csv('voters.csv')
voters_survey_monkey = v[v.Voter.isin(sm.Voter)]
print('SurveyMonkey Voters: ')
print('Gender stats')
sm_gender = voters_survey_monkey['Gender']
print(sm_gender.value_counts())
print('m:f ratio:', sm_gender.value_counts().male/sm_gender.value_counts().female)
print('Age stats')
print(voters_survey_monkey['Age'].value_counts())
print('County stats')
print(voters_survey_monkey['County'].value_counts())

print('----------------------------')
voters_qualtrics = v[v.Voter.isin(q.Voter)]
print('Qualtrics Voters: ')
print('Gender stats')
q_gender = voters_qualtrics['Gender']
print(q_gender.value_counts())
print('m:f ratio:', q_gender.value_counts().male/q_gender.value_counts().female)
print('Age stats')
print(voters_qualtrics['Age'].value_counts())
print('County stats')
print(voters_qualtrics['County'].value_counts())

SurveyMonkey Voters: 
Gender stats
male      77
female    23
Name: Gender, dtype: int64
m:f ratio: 3.347826086956522
Age stats
46-55    47
65+      18
36-45    16
56-65    15
26-35     4
Name: Age, dtype: int64
County stats
Mountain Farm    79
Riverside        10
Black             8
Bailey            3
Name: County, dtype: int64
----------------------------
Qualtrics Voters: 
Gender stats
male      29
female    21
Name: Gender, dtype: int64
m:f ratio: 1.380952380952381
Age stats
46-55    17
56-65    13
36-45    11
65+       6
26-35     3
Name: Age, dtype: int64
County stats
Mountain Farm    40
Riverside         6
Bailey            3
Black             1
Name: County, dtype: int64


Let $x_i$ be an observation of the $i$-th participant's gender (1 if male, 0 if female) in the SurveyMonkey poll. The observed mean, then, is $\bar{x} = \frac{77}{100} = 0.77$.

In [79]:
print('Population: ')
print('Gender stats')
p_gender = v['Gender']
print(p_gender.value_counts())
print('m:f ratio:', p_gender.value_counts().male/p_gender.value_counts().female)
print('Age stats')
print(v['Age'].value_counts())
print('County stats')
print(v['County'].value_counts())

Population: 
Gender stats
male      2210
female    2028
Name: Gender, dtype: int64
m:f ratio: 1.0897435897435896
Age stats
46-55    1627
56-65     872
36-45     849
65+       742
26-35      98
18-25      50
Name: Age, dtype: int64
County stats
Mountain Farm    3388
Riverside         380
Black             334
Bailey            136
Name: County, dtype: int64


In the population, the mean of the gender random variable is $\mu = \frac{2210}{4238} \approx 0.521$. The margin of error (95% of the time) for the SurveyMonkey sample is $\frac{(1-0)}{\sqrt{100}}=0.1$. However, since $|0.77-0.521| = 0.249 > 0.1$, the sample is highly likely to be biased with regards to gender (>95%). 

<strong>Q3.</strong> Which of the following best describes the margin of error for the winning candidate of the
Qualtrics poll: \
a) +/- 10% with 99% confidence \
<strong>b) +/- 15% with 95% confidence</strong> \
c) +/- 20% with 95% confidence

In [106]:
import math
moe = 1.96 * math.sqrt(q_votes.var() / q_votes.shape[0])
print(moe)

0.13819638200763432


The margin of error with 95% confidence is $\approx \pm0.14$, whose nearest answer being $\pm15\%$.

<strong>Q4.</strong> How likely are the following scenarios to meaningfully affect the polling results (choose
between “in favor of Genevieve Gallegos”,  “in favor of Masako Holley”, or statistically
insignificant/unclear), explain.

a) Only half of the registered voters from Mountain Farm County turn out to vote.

In [136]:
combined = sm.append(q, ignore_index=True)
voters_combined = v[v.Voter.isin(combined.Voter)]
voters_mf = voters_combined[voters_combined['County'] == 'Mountain Farm']
MF_votes = voters_mf.merge(combined, on='Voter')
MF_votes = (MF_votes['Vote'] == 'Genevieve Gallegos')

# get the votes for Genevieve Gallegos from people living in Mountain Farm County
MF_votes = MF_votes.astype(int)
print('mean:', MF_votes.mean())
print('var:', MF_votes.var())
MF_moe = 1.96 * math.sqrt(MF_votes.var() / MF_votes.shape[0])
print('MOE with 95% confidence: (+/-)', MF_moe)

total = v.shape[0]
MF_count = v[v['County'] == 'Mountain Farm'].shape[0] / 2
print('Registered voters who are voting:', total - MF_count)
print('Mountain Farm voters who turned out to vote:', MF_count)

mean: 0.5210084033613446
var: 0.2516735507762424
MOE with 95% confidence: (+/-) 0.09013664289354112
Registered voters who are voting: 2544.0
Mountain Farm voters who turned out to vote: 1694.0


About 52% of surveyed voters from Mountain Farm would vote for Genevieve Gallegos, meaning that the other two candidates would get less than 48% of the votes each. With 95% confidence, Genevieve Gallegos would get 43-61% of the votes from Mountain Farm, which would be 728-1033 votes. Taking into account that there could be 2544 voters max (since only half of the Mountain Farm registered voters are voting), and Genevieve Gallegos supporters from different counties, the figure is a significant number. We can say that this scenario <strong> is in favor of Genevieve Gallegos </strong>.

b) The elections are held during the regional college’s final exam week leading to a poor
turnout for the 18-25 and 26-35 age group.

In [143]:
surveyd_voters = voters_combined.merge(combined, on='Voter')
young_voters = surveyd_voters[(surveyd_voters['Age'] == '26-35') | (surveyd_voters['Age'] == '18-25')]
print(young_voters)

               Voter Gender    Age         County                Vote
27   Darrell Vadnais   male  26-35  Mountain Farm       Masako Holley
40    Kenneth Heilig   male  26-35  Mountain Farm  Genevieve Gallegos
50     Thomas Badger   male  26-35  Mountain Farm  Genevieve Gallegos
60        Hank Baker   male  26-35          Black  Genevieve Gallegos
118   Warren Berumen   male  26-35  Mountain Farm  Genevieve Gallegos
129    Elijah Arnold   male  26-35      Riverside  Genevieve Gallegos
142   Esteban Tipton   male  26-35  Mountain Farm  Genevieve Gallegos


Since we have no information on the voters aged 18-25 and very small amount of information for voters aged 26-35 (7 people), the scenario is <strong>statistically insignificant</strong>.

c) A women’s organization in Mountain Farm County endorsed Masako Holley.

In [151]:
MF_female_voters = v[(v['Gender'] == 'female') & (v['County'] == 'Mountain Farm')]
print(MF_female_voters)
print(total)

               Voter  Gender    Age         County
0      Jessica Perez  female  56-65  Mountain Farm
2       Ellen Delrio  female  56-65  Mountain Farm
3        Betty Lewis  female  36-45  Mountain Farm
5        Gloria Lowe  female  36-45  Mountain Farm
6       Sarah Ybarra  female  56-65  Mountain Farm
...              ...     ...    ...            ...
4222   Johnetta King  female  56-65  Mountain Farm
4228  Judith Flatley  female  36-45  Mountain Farm
4230     Donna Dixon  female  36-45  Mountain Farm
4233  Chelsea Butler  female  56-65  Mountain Farm
4236   Thuy Thompson  female  56-65  Mountain Farm

[1626 rows x 4 columns]
4238


Given that there are a total of 4328 voters, 1626 likely female voters for Masako Holley is a significant number once we add the other group of Masako voters. Thus, the <strong>scenario is in favor of Masako Holley</strong>.

