### Prototype: Machine Learning Phase

# Correlations between demographic and socio-economic factors and incidence of Covid 19 infection and mortality in U.S. Counties

#### Objective
Gain a greater understanding of the relationship between race/ethnicity, gender, poverty and severe health conditions and Covid 19 morbidity and mortality.
Apply skills recently acquired via part-time Data Science course at General Assembly Australia.

#### Method
Previous project phases completed:
1. Source data on race/ethnicity, gender, poverty and severe health conditions and Covid 19 morbidity and mortality at the U.S county level
2. Clean and pre-process data according to unique identifiers
3. Conduct exploratory data analysis
4. Test hypothesis that no relationship exists between features using statistical regression (Ordinary Least Squares).

In this phase:
5. Test hypothesis that features with highest importance are <b>unable</b> to predict Covid 19 morbidity and mortality using machine learning (Random Forest).
6. Compare accuracy of Random Forest algorhythm against an alternative algorhythm.
7. Articulate conclusions and next steps.

### Data Sources
Data for this prototype was sourced and cleaned from the following sources:
1. Covid 19 Morbidity by U.S Count (USA Facts/U.S CDC, 2020): timeseries from 22/01/2020 to 31/07/2020 
2. Covide 19 Mortality by U.S. County (USA Facts/U.S. CDC, 2020): timeseries from 22/01/2020 to 31/07/2020
3. Poverty Universe, All ages, by U.S County (SAIPE, U.S Census, 2019)
4. Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin (U.S Census, 2019)
5. Severe COVID-19 Health Risk Index by U.S County (Policy Map/NY Times/2017 SMART-BRFSS, U.S CDC, 2017) 

### Acknowledgments
- Thanks to my instructors Andrew Worsely, Lydia Peabody, the team at General Assembly and my peers in GA Data Science June-August 2020.
- Julian Hatwell

## Model 1: Predicting Morbidity

Apply statistical insights to create and test a machine learning model where y = morbitity (cases) using Random Forest algorithm as a baseline.

### Iteration 1A: All Features Used

### Feature & Target Selection

Pre-processed data from Prototype Phase 1 imported where all population values and values for Risk Index are log-transformed.  

In [None]:
import pandas as pd

all_data_4 = pd.read_csv("../input/covid-19-race-gender-poverty-risk-us-county/covid_data_log_200922.csv")

Input and output variable defined. Non-numerical features dropped. 

In [None]:
y = all_data_4["Cases"]

X = all_data_4.drop(["Deaths", "Cases", "FIPS", "stateFIPS"
                     , "countyFIPS_2d", "County", "State", "Risk_Cat"],  axis=1)

X

### Train / Test Split

In [None]:
# train-test split
from sklearn.model_selection import train_test_split

# allocate 70% at random to training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

### Algorithm Selected

Random Forest Regressor selected due to likelihood that it handles non-normalised population data that has an extremely large range with greater efficiency, plus feature importance evaluation.

Use the features importance methods in Random Forest (out-of-bag=TRUE), look for variable importance results, test and evaluate.

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=2, random_state=10, oob_score=True, bootstrap=True)

### Model Fitting

In [None]:
reg.fit(X_train, y_train)

### Feature Importances

In [None]:
# Get numerical feature importances
importances = list(reg.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

### Model Evaluation

In [None]:
preds = reg.predict(X_test)

In [None]:
evaluate = pd.DataFrame({
    "actual" : y_test
    , "predicted" : preds
})

evaluate["error"] = evaluate["actual"] - evaluate["predicted"]

evaluate.head()

In [None]:
import numpy as np

# Calculate the absolute errors
errors = abs(preds - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

### Iteration 1B: Most Important Features Only

### Feature & Target Selection

Second model iteration guided by feature importance data from 1st Iteration. Input and output variable defined. Non-numeric features dropped.

In [None]:
all_data_5 = all_data_4.copy()

y = all_data_5["Cases"]

X = all_data_5.drop(["Deaths", "Cases", "FIPS", "stateFIPS", "countyFIPS_2d", "County"
                     , "State", "Risk_Cat", "Risk_Index", "H_Male", "H_Female", "I_Male", "I_Female"
                    , "A_Male", "A_Female", "NH_Male", "NH_Female"],  axis=1)

X

### Train / Test Split

In [None]:
# train-test split
from sklearn.model_selection import train_test_split

# allocate 70% at random to training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Use the features importance methods in Random Forest (out-of-bag=TRUE), look for variable importance results, test and evaluate.

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=2, random_state=10, oob_score=True, bootstrap=True)

In [None]:
reg.fit(X_train, y_train)

### Feature Importance

In [None]:
# Get numerical feature importances
importances = list(reg.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

### Prediction

In [None]:
preds = reg.predict(X_test)

### Evaluation

In [None]:
evaluate = pd.DataFrame({
    "actual" : y_test
    , "predicted" : preds
})

evaluate["error"] = evaluate["actual"] - evaluate["predicted"]

evaluate.head()

In [None]:
# Calculate the absolute errors
errors = abs(preds - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

### Conclusion (Morbidity)

There is no difference in MAE between Model Iteration 1A and 1B. The high importance ranking of Black Females and Poverty supports advocacy that promotes universal access health care for Black women and people living in poverty as essential intervention that may contribute to the reduction of Covid 19 morbidity. 

## Model 2: Predicting Mortality

Apply statistical insights to create and test a machine learning model where y = mortality (deaths) using Random Forest algorithm as a baseline.

### Iteration 2A: All Features

### Feature & Target Selection

Pre-processed data from Prototype Phase 1 imported where all population values and values for Risk Index are log-transformed.  

In [None]:
all_data_6 = all_data_4.copy()

In [None]:
y = all_data_6["Deaths"]

X = all_data_6.drop(["Deaths", "FIPS", "stateFIPS"
                     , "countyFIPS_2d", "County", "State", "Risk_Cat"],  axis=1)

X

In [None]:
# train-test split
from sklearn.model_selection import train_test_split

# allocate 70% at random to training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Use the features importance methods in Random Forest (out-of-bag=TRUE), look for variable importance results, test and evaluate.

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=2, random_state=10, oob_score=True, bootstrap=True)

### Fit Training Data

In [None]:
reg.fit(X_train, y_train)

### Feature Importance

In [None]:
# Get numerical feature importances
importances = list(reg.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

### Prediction

In [None]:
preds = reg.predict(X_test)

### Evaluation

In [None]:
evaluate = pd.DataFrame({
    "actual" : y_test
    , "predicted" : preds
})

evaluate["error"] = evaluate["actual"] - evaluate["predicted"]

evaluate.head()

In [None]:
# Calculate the absolute errors
errors = abs(preds - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

### Iteration 2B: Important Features (with Cases)

### Feature & Target Selection

In [None]:
all_data_7 = all_data_4.copy()

In [None]:
y = all_data_7["Deaths"]

X = all_data_7.drop(["Deaths", "FIPS", "stateFIPS", "countyFIPS_2d", "County"
                     , "State", "Risk_Cat", "I_Male", "I_Female"
                    , "A_Male", "A_Female", "NH_Male", "NH_Female"],  axis=1)

X

In [None]:
# train-test split
from sklearn.model_selection import train_test_split

# allocate 70% at random to training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

Use the features importance methods in Random Forest (out-of-bag=TRUE), look for variable importance results, test and evaluate.

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=2, random_state=10, oob_score=True, bootstrap=True)

### Fitting Training Data

In [None]:
reg.fit(X_train, y_train)

### Feature Importances

In [None]:
# Get numerical feature importances
importances = list(reg.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

### Prediction

In [None]:
preds = reg.predict(X_test)

### Evaluation

In [None]:
evaluate = pd.DataFrame({
    "actual" : y_test
    , "predicted" : preds
})

evaluate["error"] = evaluate["actual"] - evaluate["predicted"]

evaluate.head()

In [None]:
# Calculate the absolute errors
errors = abs(preds - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

### Iteration 2B: Female Population Features (without cases)

### Feature & Target Selection

In [None]:
all_data_8 = all_data_4.copy()

In [None]:
y = all_data_8["Deaths"]

X = all_data_8.drop(["Cases", "Deaths", "FIPS", "stateFIPS", "countyFIPS_2d", "County"
                     , "State", "Risk_Cat", "W_Male", "B_Male", "H_Male"
                     , "I_Male", "A_Male", "NH_Male"],  axis=1)

X

In [None]:
# train-test split
from sklearn.model_selection import train_test_split

# allocate 70% at random to training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=2, random_state=10, oob_score=True, bootstrap=True)

### Fitting Training Data

In [None]:
reg.fit(X_train, y_train)

### Feature Importances

In [None]:
# Get numerical feature importances
importances = list(reg.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_train, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

### Prediction

In [None]:
preds = reg.predict(X_test)

### Evaluation

In [None]:
evaluate = pd.DataFrame({
    "actual" : y_test
    , "predicted" : preds
})

evaluate["error"] = evaluate["actual"] - evaluate["predicted"]

evaluate.head()

In [None]:
# Calculate the absolute errors
errors = abs(preds - y_test)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

### Conclusion (Mortality)
The presence of Covid Case data diminished the importance of race, gender and poverty in this model. However, when Cases are not a constant, the importance of Black Females in determining mortality outcomes was significantly more significant than other features. Risk Index did not play an important role in the model, possibly due to the potential Poisson Distribution that was observed in Phase 2 of this project.

# Next Steps (Machine Learning)
1. Produce Mean Squared Error (MSE) metrics for all models and create a comparsion table
2. Compare model using alternative algorithms and illustrate finding using MSE metrics.
3. Test model with fresh morbitity and mortality data from the period (1 August to 30 September)
4. Present prototype and seek out peer reviewers and collaborators 
5. Replicate for other geographies.