### Use the state data (the state of your choice) generated in Stage II to fit a distribution to the number of COVID-19 cases. (25 points)

#### Graphically plot the distribution and describe the distribution statistics. If using discrete values, calculate the Probability Mass Function for the individual values or range (if using histogram) and plot that.

In [None]:
# Imports
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import scipy.stats as stats

from IPython.display import Image

In [None]:
# Load cases, deaths, and county population data
dataFileCases = '../../Team/covid_confirmed_usafacts.csv'
dataFileDeaths = '../../Team/covid_deaths_usafacts.csv'
dataFilePopulation = '../../Team/covid_county_population_usafacts.csv'

dfCases = pd.read_csv(dataFileCases)
dfDeaths = pd.read_csv(dataFileDeaths)
dfPopulation = pd.read_csv(dataFilePopulation)

In [None]:
# Get only the entries relating to the state of Georgia
dfCasesGA = dfCases.loc[dfCases['State'] == 'GA']

In [None]:
# Select case data on a weekly basis (Wednesday from each week)
# July 1, 2020 is a Wednesday
count = 2
weeklyCasesGA = [] # where Wednesday from each week is chosen
for col in dfCasesGA.columns:
    count = count + 1
    if count % 7 == 0:
        thisSum = dfCasesGA[col].sum()
        thisSum = thisSum / (10800000 * 100000)
        weeklyCasesGA.append(thisSum)
dfWeeklyCasesGA = pd.Series(weeklyCasesGA)
dfWeeklyCasesGA.head()

In [None]:
# Graph case data
fig = go.Figure()

fig.add_trace(go.Scatter(x=dfWeeklyCasesGA.index, y=dfWeeklyCasesGA,
                         mode='lines', name='Cases'))
fig.update_layout(
    title='Weekly Georgia COVID-19 Cases',
    xaxis=dict(title='Weeks From July 2020'),
    yaxis=dict(title='Number of Cases'),
    showlegend=True,
    width=800,
    height=500
)

fig.write_image("ga_cases.png")
Image(filename="ga_cases.png")

#### Describe the type of distribution (modality) and its statistics (moments of a distribution - center, variance, skewness, kurtosis) in the report and the notebook.

In [None]:
# Calculate center, variance, skewness and kurtosis
print("Center:", dfWeeklyCasesGA.mean())
print("Variance:", dfWeeklyCasesGA.var())
print("Skewness:", dfWeeklyCasesGA.skew())
print("Kurtosis:", dfWeeklyCasesGA.kurt())

#### Compare the distribution and its statistics to 5 other states of your choosing. Describe if the distributions look different and what does that imply.

In [None]:
# [1] Case data for Florida (FL)
dfCasesFL = dfCases.loc[dfCases['State'] == 'FL']

# [2] Case data for Michigan (MI)
dfCasesMI = dfCases.loc[dfCases['State'] == 'MI']

# [3] Case data for North Carolina (NC)
dfCasesNC = dfCases.loc[dfCases['State'] == 'NC']

# [4] Case data for New Jersey (NJ)
dfCasesNJ = dfCases.loc[dfCases['State'] == 'NJ']

# [5] Case data for Texas (TX)
dfCasesTX = dfCases.loc[dfCases['State'] == 'TX']

In [None]:
# Gather weekly Florida cases
count = 2
weeklyCasesFL = []
for col in dfCasesFL.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyCasesFL.append(dfCasesFL[col].sum())
dfWeeklyCasesFL = pd.Series(weeklyCasesFL)
dfWeeklyCasesFL.head()

In [None]:
# Gather weekly Michigan cases
count = 2
weeklyCasesMI = []
for col in dfCasesMI.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyCasesMI.append(dfCasesMI[col].sum())
dfWeeklyCasesMI = pd.Series(weeklyCasesMI)
dfWeeklyCasesMI.head()

In [None]:
# Gather weekly North Carolina cases
count = 2
weeklyCasesNC = []
for col in dfCasesNC.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyCasesNC.append(dfCasesNC[col].sum())
dfWeeklyCasesNC = pd.Series(weeklyCasesGA)
dfWeeklyCasesNC.head()

In [None]:
# Gather weekly New Jersey cases
count = 2
weeklyCasesNJ = []
for col in dfCasesNJ.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyCasesNJ.append(dfCasesNJ[col].sum())
dfWeeklyCasesNJ = pd.Series(weeklyCasesNJ)
dfWeeklyCasesNJ.head()

In [None]:
# Gather weekly Texas cases
count = 2
weeklyCasesTX = []
for col in dfCasesTX.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyCasesTX.append(dfCasesTX[col].sum())
dfWeeklyCasesTX = pd.Series(weeklyCasesTX)
dfWeeklyCasesTX.head()

In [None]:
# Graph all state case data
fig = go.Figure()

fig.add_trace(go.Scatter(x=dfWeeklyCasesGA.index, y=dfWeeklyCasesGA,
                         mode='lines', name='Georgia'))
fig.add_trace(go.Scatter(x=dfWeeklyCasesFL.index, y=dfWeeklyCasesFL,
                         mode='lines', name='Florida'))
fig.add_trace(go.Scatter(x=dfWeeklyCasesMI.index, y=dfWeeklyCasesMI,
                         mode='lines', name='Michigan'))
fig.add_trace(go.Scatter(x=dfWeeklyCasesNC.index, y=dfWeeklyCasesNC,
                         mode='lines', name='North Carolina'))
fig.add_trace(go.Scatter(x=dfWeeklyCasesNJ.index, y=dfWeeklyCasesNJ,
                         mode='lines', name='New Jersey'))
fig.add_trace(go.Scatter(x=dfWeeklyCasesTX.index, y=dfWeeklyCasesTX,
                         mode='lines', name='Texas'))
fig.update_layout(
    title='Weekly COVID-19 Cases',
    xaxis=dict(title='Weeks From July 2020'),
    yaxis=dict(title='Number of Cases'),
    showlegend=True,
    width=800,
    height=500
)

fig.write_image("all_cases.png")
Image(filename="all_cases.png")

In [None]:
# Calculate center, variance, skewness and kurtosis for other states
dfs = {"Florida": dfWeeklyCasesFL, "Michigan": dfWeeklyCasesMI, "North Carolina": dfWeeklyCasesNC, "New Jersey": dfWeeklyCasesNJ, "Texas": dfWeeklyCasesTX }

for key, value in dfs.items():
    print("~~~~~~~~ " + key + " ~~~~~~~~")
    print("Center:", value.mean())
    print("Variance:", value.var())
    print("Skewness:", value.skew())
    print("Kurtosis:", value.kurt())
    print()

<span style="color:red">**The distribution of Georgia and Florida is different compared to Georgia.**</span>

### Model a poission distribution of COVID-19 cases and deaths of a state and compare to other 5 states. Describe how the poission modeling is different from the first modeling you did. (25 points)

In [None]:
# [1] Death data for Georgia (GA)
dfDeathsGA = dfDeaths.loc[dfDeaths['State'] == 'FL']

# [2] Death data for Florida (FL)
dfDeathsFL = dfDeaths.loc[dfDeaths['State'] == 'FL']

# [3] Death data for Michigan (MI)
dfDeathsMI = dfDeaths.loc[dfDeaths['State'] == 'MI']

# [4] Death data for North Carolina (NC)
dfDeathsNC = dfDeaths.loc[dfDeaths['State'] == 'NC']

# [5] Death data for New Jersey (NJ)
dfDeathsNJ = dfDeaths.loc[dfDeaths['State'] == 'NJ']

# [6] Death data for Texas (TX)
dfDeathsTX = dfDeaths.loc[dfDeaths['State'] == 'TX']

In [None]:
# Gather weekly Georgia Deaths
count = 2
weeklyDeathsGA = []
for col in dfDeathsGA.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsGA.append(dfDeathsGA[col].sum())
dfWeeklyDeathsGA = pd.Series(weeklyDeathsGA)
dfWeeklyDeathsGA.head()

In [None]:
# Gather weekly Florida Deaths
count = 2
weeklyDeathsFL = []
for col in dfDeathsFL.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsFL.append(dfDeathsFL[col].sum())
dfWeeklyDeathsFL = pd.Series(weeklyDeathsFL)
dfWeeklyDeathsFL.head()

In [None]:
# Gather weekly Michigan Deaths
count = 2
weeklyDeathsMI = []
for col in dfDeathsMI.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsMI.append(dfDeathsMI[col].sum())
dfWeeklyDeathsMI = pd.Series(weeklyDeathsMI)
dfWeeklyDeathsMI.head()

In [None]:
# Gather weekly North Carolina Deaths
count = 2
weeklyDeathsNC = []
for col in dfDeathsNC.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsNC.append(dfDeathsNC[col].sum())
dfWeeklyDeathsNC = pd.Series(weeklyDeathsNC)
dfWeeklyDeathsNC.head()

In [None]:
# Gather weekly New Jersey Deaths
count = 2
weeklyDeathsNJ = []
for col in dfDeathsNJ.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsNJ.append(dfDeathsNJ[col].sum())
dfWeeklyDeathsNJ = pd.Series(weeklyDeathsNJ)
dfWeeklyDeathsNJ.head()

In [None]:
# Gather weekly Texas Deaths
count = 2
weeklyDeathsTX = []
for col in dfDeathsTX.columns:
    count = count + 1
    if count % 7 == 0:
        weeklyDeathsTX.append(dfDeathsTX[col].sum())
dfWeeklyDeathsTX = pd.Series(weeklyDeathsTX)
dfWeeklyDeathsTX.head()

In [None]:
# Graph Poisson distribution of COVID-19 Cases
sampleSizeGA = np.sum(weeklyCasesGA)
meanSizeGA = np.mean(weeklyCasesGA)
poissonCasesGA = stats.poisson.rvs(size=sampleSizeGA, mu=meanSizeGA)
pd.DataFrame(poissonCasesGA).hist(range=(-0.5,max(poissonCasesGA)+0.5), bins=max(poissonCasesGA)+1, ec='black')

In [None]:
# Graph Poisson distribution of COVID-19 Deaths
sampleSizeDeathsGA = np.sum(weeklyDeathsGA)
meanSizeDeathsGA = np.mean(weeklyDeathsGA)
poissonDeathsGA = stats.poisson.rvs(size=sampleSizeDeathsGA, mu=meanSizeDeathsGA)
pd.DataFrame(poissonDeathsGA).hist(range=(-1,max(poissonDeathsGA)+1), bins=max(poissonDeathsGA)+1, ec='black')

### Perform corelation between Enrichment data valiables and COVID-19 cases to observe any patterns. (20 points)

In [None]:
# Load election data
dfGovernor = pd.read_csv("../election_data/governors_county_candidate.csv")
dfPresident = pd.read_csv("../election_data/president_county_candidate.csv")
dfSenator = pd.read_csv("../election_data/senate_county_candidate.csv") 

### Formulate hypothesis between Enrichment data and number of cases to be compared against states. Choose 3 different variables to compare against. (30 points)

I propose the following hypothesis below based on the variables that I chose for the previous section. Looking at the data, it is unlikely that any of the variables I analyzed in the previous section are correlated with COVID-19. Still, it is perhaps worth confirming this through hypothesis testing.

* Does an increase in COVID-19 cases affect whether a canditate from the DEM or REP party wins?
* Does an increase in COVID-19 cases affect voter participation?
* Does an increase in COVID-19 cases affect the percentage of U.S. House and U.S. Senate seat flips?