### Prototype: Statistical Analysis Phase

# Correlations between demographic and socio-economic factors and incidence of Covid 19 infection and mortality in U.S. Counties

#### Objectives
1. Gain a greater understanding of the relationship between race/ethnicity, gender, poverty and severe health conditions and Covid 19 morbidity and mortality.
2. Apply skills recently acquired via Data Science skills to advocacy that addresses health indisparities.

#### Method
From previous step:
1. Source data on race/ethnicity, gender, poverty and severe health conditions and Covid 19 morbidity and mortality at the U.S county level
2. Clean and pre-process data according to unique identifiers

In this step:
3. Conduct exploratory data analysis
4. Test hypothesis that no relationship exists between features using statistical regression (Ordinary Least Squares).
5. Articulate conclusions and next steps for Statistical Analysis Phase.

In the next step:
6. Test hypothesis that features with highest importance are unable to predict Covid 19 morbidity and mortality using machine learning (Random Forest).
7. Visualise findings: a) Highly correlated features and possible Simpsons Paradox; b) outcomes of machine learning.
8. Articulate conclusions and next steps for Machine Learning Phase

### Acknowledgments
- Thanks to my instructors Andrew Worsely, Lydia Peabody, the team at General Assembly and my peers in GA Data Science June-August 2020.
- Julian Hatwell

First step was to create a correlation matrix of the features. All numeric features are log transformed. 

In [None]:
import pandas as pd

all_data_5 = pd.read_csv("../input/covid-19-race-gender-poverty-risk-us-county/covid_data_log_200922.csv")

In [None]:
all_data_5.corr()

In [None]:
all_data_5.info()

Next step was to visualise the correlations

In [None]:
import seaborn as sns
sns.heatmap(all_data_5.corr())

#### Heatmap 1. Feature Correlations

A negative correlation was identified between 'Cases' and 'Risk Index' prompting the need to validate.

In [None]:
corr = all_data_5.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

#### Heatmap 2. Feature Correlations

A comparison heatmap was made for visual validation purposes.

### Observations for further investigation

1. An inverse relationship between Risk Index and Cases/Deaths seems to be inverse
2. Cases and Deaths seem to be highly correlated
3. Deaths and Black Populations (Male & Female) seem to be highly correlated

Side note: Poverty seems to be perfectly correlated with White Populations (Male & Female), likely due to population.

### Evaluating Distribution of Risk Index Feature

#### Plot 1. x = Risk Index (%; log); y = Cases (log)

In [None]:
# sns.scatterplot(all_data_5["Risk_Index"], all_data_5["Cases"])

sns.jointplot(all_data_5["Risk_Index"], all_data_5["Cases"], data=all_data_5);


#### Plot 2. x = Risk Index (%; log); y = Deaths (log)

In [None]:
# sns.scatterplot(all_data_5["Risk_Index"], all_data_5["Deaths"])

sns.jointplot(all_data_5["Risk_Index"], all_data_5["Deaths"], data=all_data_5);

#### Plot 3. x = Risk Index (%; log); y = Black Females (log)

In [None]:
# sns.scatterplot(all_data_5["Risk_Index"], all_data_5["B_Female"])

sns.jointplot(all_data_5["Risk_Index"], all_data_5["B_Female"], data=all_data_5);

#### Plot 4. x = Risk Index (%; log); y = Hispanic Females (log)

In [None]:
# sns.scatterplot(all_data_5["Risk_Index"], all_data_5["B_Female"])

sns.jointplot(all_data_5["Risk_Index"], all_data_5["H_Female"], data=all_data_5);

#### Plot 5. x = Risk Index (%; log); y = Poverty (log)

In [None]:
# sns.scatterplot(all_data_5["Risk_Index"], all_data_5["Poverty"])

sns.jointplot(all_data_5["Risk_Index"], all_data_5["Poverty"], data=all_data_5);

#### Histogram 1. x = Risk Index (%; log)

In [None]:
sns.distplot(all_data_5["Risk_Index"]);

In [None]:
all_data_5.loc[:,["Risk_Index"]].describe()

### Evaluating the relationship between Cases and Risk Index

#### OLS Regression (Cases / Risk Index (log) only)

Given the inverse relation between Cases and Risk Index illustrated in Heatmap 1, further investigation into the relationships between the features was required.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Data 
data = pd.read_csv('../input/covid-19-race-gender-poverty-risk-us-county/covid_data_log_200922.csv') 
df = pd.DataFrame(data)

# Fit Model & Output Regression Results Summary

# Import Package
import statsmodels.api as sm
from statsmodels.api import add_constant

# Build Model
X = data.loc[:,["Risk_Index"]]
y = data.loc[:,["Cases"]]

X = sm.add_constant(X)
model1 = sm.OLS(y,X)
results = model1.fit()

# MSE of the residuals
print(f"MSE: {results.mse_resid}")

# Output Results
results.summary()

In [None]:
# Define function to output plot of the model coefficients

def coefplot(results):
    '''
    Takes in results of OLS model and returns a plot of 
    the coefficients with 95% confidence intervals.
    
    Removes intercept, so if uncentered will return error.
    '''
    # Create dataframe of results summary 
    coef_df = pd.DataFrame(results.summary().tables[1].data)
    
    # Add column names
    coef_df.columns = coef_df.iloc[0]

    # Drop the extra row with column labels
    coef_df=coef_df.drop(0)

    # Set index to variable names 
    coef_df = coef_df.set_index(coef_df.columns[0])

    # Change datatype from object to float
    coef_df = coef_df.astype(float)

    # Get errors; (coef - lower bound of conf interval)
    errors = coef_df['coef'] - coef_df['[0.025']
    
    # Append errors column to dataframe
    coef_df['errors'] = errors

    # Drop the constant for plotting
    coef_df = coef_df.drop(['const'])

    # Sort values by coef ascending
    coef_df = coef_df.sort_values(by=['coef'])

    ### Plot Coefficients ###

    # x-labels
    variables = list(coef_df.index.values)
    
    # Add variables column to dataframe
    coef_df['variables'] = variables
    
    # Set sns plot style back to 'poster'
    # This will make bars wide on plot
    sns.set_context("poster")

    # Define figure, axes, and plot
    fig, ax = plt.subplots(figsize=(15, 10))
    
    # Error bars for 95% confidence interval
    # Can increase capsize to add whiskers
    coef_df.plot(x='variables', y='coef', kind='bar',
                 ax=ax, color='none', fontsize=22, 
                 ecolor='steelblue',capsize=0,
                 yerr='errors', legend=False)
    
    # Set title & labels
    plt.title('Coefficients of Features w/ 95% Confidence Intervals',fontsize=30)
    ax.set_ylabel('Coefficients',fontsize=22)
    ax.set_xlabel('',fontsize=22)
    
    # Coefficients
    ax.scatter(x=pd.np.arange(coef_df.shape[0]), 
               marker='o', s=80, 
               y=coef_df['coef'], color='steelblue')
    
    # Line to define zero on the y-axis
    ax.axhline(y=0, linestyle='--', color='red', linewidth=1)
    
    return plt.show()

#### Coefficient Plot 1. Coefficent Plot Cases / Risk Index (log)

In [None]:
coefplot(results)

### Evaluating Morbidity (Cases)

A series of regressions were run with different feature combinations for evaluation and comparision purposes. Models were compared based on the statistical significance of features, plus AIC and R-Square.

#### Table. Risk Index Data Summary

In [None]:
sns.scatterplot(all_data_5["Poverty"], all_data_5["Cases"])

#### Plot 5. x = Poverty (log); y = Cases (log)

#### OLS Regression (Cases / Poverty (log) only)

In [None]:
import statsmodels.formula.api as smf
fit1 = smf.ols("Cases ~ Poverty", data=all_data_5).fit()

fit1.summary()

#### Note: AIC: 9959; R-Squared: 0.745

In [None]:
print(f"MSE: {fit1.mse_resid}")

#### Plot 6. Residual analysis: Cases ~ Poverty (log)

In [None]:
sns.scatterplot(fit1.fittedvalues, fit1.resid)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

#### OLS Regression (All Features (log) Approach)

In [None]:
fit_all = smf.ols("Cases ~ Poverty + Population + W_Male + W_Female + B_Male + B_Female + H_Male + H_Female + I_Male + I_Female + A_Male + A_Female + NH_Male + NH_Female + Risk_Index", data=all_data_5).fit()

fit_all.summary()

#### Note: AIC: 8754; R-Squared: 0.828

In [None]:
print(f"MSE: {fit_all.mse_resid}")

#### Plot 7. Residual analysis: All Features (log)

In [None]:
sns.scatterplot(fit_all.fittedvalues, fit_all.resid)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

#### OLS Regression (Cases / White, Black and Hispanic Populations of Males and Females (log) only)

In [None]:
fit_cases_v1 = smf.ols("Cases ~ Poverty + W_Male + W_Female + B_Male + B_Female + H_Male + H_Female", data=all_data_5).fit()

fit_cases_v1.summary()

#### Note: AIC: 8983.; R-Squared: 0.814

In [None]:
print(f"MSE: {fit_cases_v1.mse_resid}")

#### OLS Regression (Cases / Poverty & Black Populations, Male & Female (log) only)

In [None]:
fit_cases_v2a = smf.ols("Cases ~ Poverty + B_Male + B_Female", data=all_data_5).fit()

fit_cases_v2a.summary()

#### Note: AIC: 9119; R-Squared: 0.805

In [None]:
print(f"MSE: {fit_cases_v2a.mse_resid}")

#### OLS Regression (Cases / Poverty & Hispanic Populations, Male & Female (log) only)

In [None]:
fit_cases_v2b = smf.ols("Cases ~ Poverty + H_Male + H_Female", data=all_data_5).fit()

fit_cases_v2b.summary()

In [None]:
print(f"MSE: {fit_cases_v2b.mse_resid}")

#### OLS Regression (Cases / Poverty, Female Populations (log) only)

In [None]:
fit_cases_v3 = smf.ols("Cases ~ Poverty + W_Female + B_Female + H_Female + I_Female + A_Female + NH_Female", data=all_data_5).fit()

fit_cases_v3.summary()

#### Note: AIC: 8890.; R-Squared: 0.819

In [None]:
print(f"MSE: {fit_cases_v3.mse_resid}")

#### Plot 6. Linear Regression Plot: Risk Index / Cases (log)

In [None]:
sns.set(color_codes=True)


sns.lmplot(x="Risk_Index", y="Cases", data=all_data_5);

In [None]:
fit_cases_v6 = smf.ols("Cases ~ W_Female + B_Female + H_Female + I_Female + A_Female + NH_Female", data=all_data_5).fit()

fit_cases_v6.summary()

#### Note: AIC: 8982.; R-Squared: 0.814

In [None]:
print(f"MSE: {fit_cases_v6.mse_resid}")

### Conclusion (Morbitity)

After evaluating features combinations across various regression models, the statistical model that evaluated Cases to Risk Index performed best in terms of AIC and R-Squared indicators. However, data from the Cases to Risk Index model revealed an inverse relationship, pointing to a potential Possions Distribution. 

For example, while the data attempts to illustrate the distribution of Covid risk, the distribution of infection is not a constant. Epidemiological factors that determine the spread of infection are beyond the scope of this model.

Further observations include the significance of Covid 19 morbidity in relation to the female population (all races, particulatly non-White) and populations living in poverty.

### Evaluating Mortality (Deaths)

A series of regressions were run with different feature combinations for evaluation and comparision purposes. Models were compared based on the statistical significance of features, plus AIC and R-Square.

#### OLS Regression (Death / Poverty (log) only)

In [None]:
fit_deaths_v1 = smf.ols("Deaths ~ Poverty", data=all_data_5).fit()

fit_deaths_v1.summary()

#### Note: AIC: 1.368e+04; R-squared: 0.566

#### OLS Regression (Death / Black Populations, Male & Female (log) only)

In [None]:
fit_deaths_v2a = smf.ols("Deaths ~ B_Male + B_Female", data=all_data_5).fit()

fit_deaths_v2a.summary()

#### Note: AIC: 1.338e+04; R-squared: 0.606

In [None]:
print(f"MSE: {fit_deaths_v2a.mse_resid}")

#### OLS Regression (Death / Black Populations, Male & Female (log) only)

In [None]:
fit_deaths_v2b = smf.ols("Deaths ~ H_Male + H_Female", data=all_data_5).fit()

fit_deaths_v2b.summary()

#### Note: AIC: 1.436e+04; R-squared: 0.462

In [None]:
print(f"MSE: {fit_deaths_v2b.mse_resid}")

#### OLS Regression (Death / Cases & Poverty (log) only)

In [None]:
fit_deaths_v3 = smf.ols("Deaths ~ Cases + Poverty", data=all_data_5).fit()

fit_deaths_v3.summary()

#### Note: AIC: 1.256e+04; R-squared: 0.696

In [None]:
print(f"MSE: {fit_deaths_v3.mse_resid}")

#### OLS Regression (Death / Cases & Risk Index (log) only)

In [None]:
fit_deaths_v4 = smf.ols("Deaths ~ Cases + Risk_Index", data=all_data_5).fit()

fit_deaths_v4.summary()

#### Note: AIC: 1.261e+04; R-squared: 0.691

In [None]:
print(f"MSE: {fit_deaths_v4.mse_resid}")

### Model 1: Death / Cases, Poverty, Black Populations - Males & Female

### Coefficient Tree (Model 1)

Recalculate OLS summary with MSE metric and X-values as constant

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Data 
data = pd.read_csv('../input/covid-19-race-gender-poverty-risk-us-county/covid_data_log_200922.csv') 
df = pd.DataFrame(data)

# Fit Model & Output Regression Results Summary

# Import Package
import statsmodels.api as sm
from statsmodels.api import add_constant

# Build Model
X = data.loc[:,["Cases", "Poverty", "B_Male", "B_Female"]]
y = data.loc[:,["Deaths"]]

X = sm.add_constant(X)
model1 = sm.OLS(y,X)
results = model1.fit()

# MSE of the residuals
print(f"MSE: {results.mse_resid}")

# Output Results
results.summary()

#### Coefficient Plot 2. Death / Cases, Poverty, Black Population (Male & Female) (log)

In [None]:
coefplot(results)

In [None]:
import statsmodels.api as sm

fig = sm.graphics.plot_partregress_grid(results)

### Model 2: Death / Cases, Risk Index, Black Populations - Males & Female

In [None]:
# Fit Model & Output Regression Results Summary

# Import Package
import statsmodels.api as sm
from statsmodels.api import add_constant

# Build Model
X = data.loc[:,["Cases", "Risk_Index", "B_Male", "B_Female"]]
y = data.loc[:,["Deaths"]]

X = sm.add_constant(X)
model2 = sm.OLS(y,X)
results = model2.fit()

# MSE of the residuals
print(f"MSE: {results.mse_resid}")

# Output Results
results.summary()

#### Coefficient Plot 3. Death / Cases, Risk Index, Black Population (Male & Female) (log)

In [None]:
coefplot(results)

### Model 3: Deaths / Females Populations Only

In [None]:
# Fit Model & Output Regression Results Summary

# Import Package
import statsmodels.api as sm
from statsmodels.api import add_constant

# Build Model
X = data.loc[:,["W_Female", "B_Female", "H_Female", "I_Female", "A_Female", "NH_Female"]]
# X = data.loc[:,["W_Female", "B_Female", "H_Female"]]
y = data.loc[:,["Deaths"]]

X = sm.add_constant(X)
model3 = sm.OLS(y,X)
results = model3.fit()

# MSE of the residuals
print(f"MSE: {results.mse_resid}")

# Output Results
results.summary()

#### Coefficient Plot 4. Death / Female Populations (all races) (log)

In [None]:
coefplot(results)

### Conclusion (Mortality)

Poverty and Risk Index are virtually interchangable in their statistical significance to Covid 19 Deaths when Covid 19 Morbidity (Cases) and Black Population (Male & Female) are a constant.

While it was unsurprising to see that eliminating the Cases feature from Model 3 (Female Populations, all race) produced a higher MSE score, however the model illustrates inequalities in mortality according to race.

# Next Steps (Statistical Analysis)

1. Add feature to identify counties according to racial, socioeconomic profile of majority populations (i.e. counties with majority black population, majority are persons living in poverty, etc).
2. Add feature illustrating population by age groups.
3. Add transgender population stats once new Census data is available.
4. Add feature to illustrate capacity of health systems.