# Overview

COVID-19 has been widely spreaded for almost 9 months. United States is one of the country suffered most from the notorious virus and the confirmed cases increases dramatically each day. However, the confirmed cases does not distributed evenly within the country. Some states have less confirmed cases when comparing to others.

We are interested to find out what are the significant aspects that contribute to the high confirmed cases. In the following analysis, we will try to find out the high correlated features that affecting the confirmed cases and build the model to predict the confirmed cases within United States. After that, we can try to predict how many confirmed cases would drop if we could reduce the chance of spreading virus.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import os
import datetime
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Data Sourcing

We will walk through few datasets and collect metrics that might be related to the confirmed cases.

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Dataset 1 - covid19testing

In the covid19testing dataset, we can collect the confirmed cases in United States on daily basis.

In [None]:
df_covid19 = pd.read_csv("/kaggle/input/covid19testing/tested_worldwide.csv")
df_covid19 = df_covid19[(df_covid19['Country_Region'] == "United States") & (df_covid19.Province_State != "All States")].fillna(0).reset_index(drop=True)
df_covid19.drop(['Country_Region', 'active', 'recovered', 'death', 'hospitalizedCurr', 'hospitalized', 'daily_positive', 'total_tested'], axis=1, inplace=True)
df_covid19.head()

## Dataset 2 - covid19-state-data

In the covid19-state-data dataset, we can collect additional metrics of states in United States on different areas (economic, healthiness, population, age ...).

In [None]:
df_state = pd.read_csv("/kaggle/input/covid19-state-data/COVID19_state.csv")
df_state.drop(['Infected', 'Deaths', 'Tested', 'ICU Beds', 'Population', 'Income', 'School Closure Date'], axis=1, inplace=True)
df_state.head()

## Dataset 3 - covid19-mobility-data

In the covid19-mobility-data dataset, we can collect the traffic trends on daily basis.

In [None]:
df_traffic = pd.read_csv("https://covid19-static.cdn-apple.com/covid19-mobility-data/2014HotfixDev15/v3/en-us/applemobilitytrends-2020-08-14.csv")
df_traffic = df_traffic.loc[df_traffic['country'] == "United States"].reset_index(drop=True)
df_traffic.drop(['geo_type', 'region', 'country', 'alternative_name'], axis=1, inplace=True)
df_traffic.head()

## Dataset 4 - covid19-mobility-report

In the covid19-mobility-report dataset, we can collect the community visits to places trends on daily basis.

In [None]:
df_visit = pd.read_csv("https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv", low_memory=False)
df_visit = df_visit.loc[df_visit['country_region_code'] == "US"].fillna(0).reset_index(drop=True)
df_visit = df_visit[df_visit.sub_region_1 != 0].drop(['country_region', 'sub_region_2', 'metro_area', 'iso_3166_2_code', 'census_fips_code'], axis=1)
df_visit.head()

# Data Cleansing

In this step, we will combine the dataset previously collected and merge into single one. In addition, we will remove the corrupted or invalid values and encode the value to ensure the model can process it afterwards.

## Merge dataset into single dataframe

Let's start by converting the state name to ISO code, which will be used as the key to merge the datasets.

In [None]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}
df_covid19 = df_covid19.rename(columns={'Province_State': 'State'})
df_covid19['State'].replace(us_state_abbrev.keys(), us_state_abbrev.values(), inplace=True)
df_state['State'].replace(us_state_abbrev.keys(), us_state_abbrev.values(), inplace=True)

In [None]:
df_covid19 = pd.merge(df_covid19, df_state, on='State')
df_covid19.head()

Here we will convert the date to YY-MM-DD format, which will be used as the key to merge reset of the datasets.

In [None]:
df_covid19['Date'] = pd.to_datetime(df_covid19.Date).dt.strftime("%Y-%m-%d")
df_covid19['Date'].head()

In [None]:
df_traffic_agg = pd.DataFrame()
for state in df_traffic['sub-region'].unique():
    df_traffic_state = df_traffic.loc[df_traffic['sub-region'] == state].groupby(df_traffic['transportation_type'], sort=False).aggregate({date: 'sum' for date in df_traffic.columns[4:]}).T
    df_traffic_state['sub-region'] = state
    df_traffic_state = df_traffic_state.reset_index()
    df_traffic_agg = df_traffic_state if df_traffic_agg.empty else pd.concat([df_traffic_agg, df_traffic_state], ignore_index=True)
df_traffic_agg = df_traffic_agg.rename(columns={'sub-region': 'State', 'index': 'Date'})
df_traffic_agg['State'].replace(us_state_abbrev.keys(), us_state_abbrev.values(), inplace=True)
df_traffic_agg['Date'] = pd.to_datetime(df_traffic_agg.Date).dt.strftime("%Y-%m-%d")
# Filter the rows with common date in df_traffic_agg and df_covid19
df_traffic_agg = df_traffic_agg.loc[df_traffic_agg['Date'].isin(df_covid19['Date'].unique())].reset_index(drop=True)
df_traffic_agg.head()

In [None]:
df_visit_agg = df_visit.groupby(['sub_region_1', 'date'], sort=False).aggregate({date: 'sum' for date in df_visit.columns[3:]}).reset_index()
df_visit_agg = df_visit_agg.rename(columns={'sub_region_1': 'State', 'date': 'Date'})
df_visit_agg['State'].replace(us_state_abbrev.keys(), us_state_abbrev.values(), inplace=True)
df_visit_agg['Date'] = pd.to_datetime(df_visit_agg.Date).dt.strftime("%Y-%m-%d")
# Filter the rows with common date in df_visit_agg and df_covid19
df_visit_agg = df_visit_agg.loc[df_visit_agg['Date'].isin(df_covid19['Date'].unique())].reset_index(drop=True)
df_visit_agg.head()

In [None]:
df_covid19 = pd.merge(df_covid19, df_traffic_agg, on=['Date', 'State'])
df_covid19 = pd.merge(df_covid19, df_visit_agg, on=['Date', 'State'])

# Reorder the columns
df_covid19_cols = df_covid19.columns.tolist()
df_covid19_cols = df_covid19_cols[0:2] + df_covid19_cols[4:] + [df_covid19_cols[3], df_covid19_cols[2]]
df_covid19 = df_covid19[df_covid19_cols]
df_covid19.head()

## Replace invalid value

The missing values are filled with nan. We will replace the missing value by zero.

In [None]:
df_covid19.fillna(0, inplace=True)

Negative value is not allowed. We will replace the negative value by zero.

In [None]:
df_covid19.loc[df_covid19['daily_tested'] < 0, 'daily_tested'] = 0

## Transform non-numerical data

As the training model cannot process the string values, we would encode the string class into discrete numerical value.

In [None]:
le = LabelEncoder()
le.fit(df_covid19['State'])
df_covid19['State'] = le.transform(df_covid19['State'])
df_covid19['State'].unique()

The date format can expressed as the difference between dates

In [None]:
df_covid19['Date'] = pd.to_datetime(df_covid19['Date'])
df_covid19['Date'] = df_covid19['Date'].sub(df_covid19['Date'].min()).dt.days
df_covid19['Date'].unique()

## Normalize and scaling the data

It's good practice to normalize the data to avoid the model bias on particular metrics

In [None]:
scaler = MinMaxScaler()
df_covid19['Pop Density'] = scaler.fit_transform(np.array(df_covid19['Pop Density']).reshape(-1, 1))
df_covid19['GDP'] = scaler.fit_transform(np.array(df_covid19['GDP']).reshape(-1, 1))
df_covid19['retail_and_recreation_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['retail_and_recreation_percent_change_from_baseline']).reshape(-1, 1))
df_covid19['grocery_and_pharmacy_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['grocery_and_pharmacy_percent_change_from_baseline']).reshape(-1, 1))
df_covid19['parks_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['parks_percent_change_from_baseline']).reshape(-1, 1))
df_covid19['transit_stations_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['transit_stations_percent_change_from_baseline']).reshape(-1, 1))
df_covid19['workplaces_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['workplaces_percent_change_from_baseline']).reshape(-1, 1))
df_covid19['residential_percent_change_from_baseline'] = scaler.fit_transform(np.array(df_covid19['residential_percent_change_from_baseline']).reshape(-1, 1))
df_covid19[['Pop Density', 'GDP',
            'retail_and_recreation_percent_change_from_baseline', 'grocery_and_pharmacy_percent_change_from_baseline',
            'parks_percent_change_from_baseline', 'transit_stations_percent_change_from_baseline',
            'workplaces_percent_change_from_baseline', 'residential_percent_change_from_baseline']].head()

# Feature Selection

When we have gathered a lot of data, we always want to avoid using irrelevant features which might have adverse impact on the model training. Selecting appropriate features not only help to increase acurrancy of model, but also reduceing overffiting and training time.

We are going to try on different approaches and select features based on the following criteria.
1. Select all features
2. Correlation
3. Statistical test scores
4. Featuer importance

## Approach 1 - Select all features

We simply select all features from the dataframe for training.

In [None]:
X_all = df_covid19.iloc[:, :-1]
y_all = df_covid19.iloc[:, -1]
X_all.head()

## Approach 2 - Correlation

We can visualize the correlation of data by using heat map and select the top 10 correlated features for training.

In [None]:
df_covid19_corr = df_covid19.corr()
df_covid19_corr_selected = df_covid19_corr['positive'].sort_values(ascending=False).iloc[1:].head(11)
display(df_covid19_corr_selected)

plt.figure(figsize=(10, 10))
display(sns.heatmap(df_covid19_corr, cmap="RdYlGn"))

In [None]:
X_corr = df_covid19[df_covid19_corr_selected.index]
X_corr.head()

## Approach 3 - Statistical test scores

We can select the features by statistic test. Here we will use chi-squared test to select best 10 features.

In [None]:
selector = SelectKBest(chi2, k=10)
selector.fit(X_all, y_all)
selector_score = pd.Series(selector.scores_, index=X_all.columns).nlargest(10)
display(selector_score)

display(selector_score.plot(kind='barh'))

In [None]:
X_stat = df_covid19[selector_score.index]
X_stat.head()

## Approach 4 - Featuer importance

We can also choose the features by relevance corresponding to the output. Here we use Extra Tree Classifier to select best 10 features.

In [None]:
clf = ExtraTreesClassifier()
# Limit the number of samples to avoid OOM
clf.fit(X_all.iloc[-100:] ,y_all.iloc[-100:])
clf_score = pd.Series(clf.feature_importances_, index=X_all.columns).nlargest(10)
display(clf_score)

display(clf_score.plot(kind='barh'))

In [None]:
X_importance = df_covid19[clf_score.index]
X_importance.head()

# Model Learning

We have well-prepared the data and it's time to move on to the model learning. Here we will try to use the linear regression model for learning.

The dataset is split into training and test set in 8:2 ratio. The model is trainied with the training set and the the accuracy is validated by the test set.

In [None]:
X_all_train, X_all_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.8, random_state=0)
X_corr_train, X_corr_test = train_test_split(X_corr, train_size=0.8, random_state=0)
X_stat_train, X_stat_test = train_test_split(X_stat, train_size=0.8, random_state=0)
X_importance_train, X_importance_test = train_test_split(X_importance, train_size=0.8, random_state=0)

display(X_all_train.index)
display(X_all_test.index)

## Linear regression - Training

Train the linear regression model and interpret the goodness of fit of a model by R2 score.

In [None]:
reg_all = LinearRegression(normalize=True).fit(X_all_train, y_train)
reg_corr = LinearRegression(normalize=True).fit(X_corr_train, y_train)
reg_stat = LinearRegression(normalize=True).fit(X_stat_train, y_train)
reg_importance = LinearRegression(normalize=True).fit(X_importance_train, y_train)

print("Score (Select all features)    : {:.4f}".format(reg_all.score(X_all_train, y_train)))
print("Score (Correlation)            : {:.4f}".format(reg_corr.score(X_corr_train, y_train)))
print("Score (Statistical test scores): {:.4f}".format(reg_stat.score(X_stat_train, y_train)))
print("Score (Featuer importance)     : {:.4f}".format(reg_importance.score(X_importance_train, y_train)))

## Linear regression - Evaluation

Evaluate the performance of model by plotting learning curve.

In [None]:
def plot_learning_curve(estimator, title, X, y, axes=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    axes.set_title(title)
    axes.set_xlabel("Training examples")
    axes.set_ylabel("Score")

    train_sizes, train_scores, test_scores = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot learning curve
    axes.grid()
    axes.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes.plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes.plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes.legend(loc="best")

    return plt

_, axes = plt.subplots(1, 4, figsize=(25, 5))
plot_learning_curve(reg_all, "Learning curves (Select all features)", X_all_train, y_train, axes[0])
plot_learning_curve(reg_corr, "Learning curves (Correlation)", X_corr_train, y_train, axes[1])
plot_learning_curve(reg_stat, "Learning curves (Statistical test scores)", X_stat_train, y_train, axes[2])
plot_learning_curve(reg_importance, "Learning curves (Featuer importance)", X_importance_train, y_train, axes[3])

Evaluate the accuracy of prediction model on training and test set by mean absolute error.

In [None]:
y_all_pred = reg_all.predict(X_all_train)
y_corr_pred = reg_corr.predict(X_corr_train)
y_stat_pred = reg_stat.predict(X_stat_train)
y_importance_pred = reg_importance.predict(X_importance_train)

print("Accuracy of prediction model on training set")
print("Mean absolute error (Select all features)     : {:.4f}".format(mean_absolute_error(y_train, y_all_pred)))
print("Mean absolute error (Correlation)             : {:.4f}".format(mean_absolute_error(y_train, y_corr_pred)))
print("Mean absolute error (Statistical test scores) : {:.4f}".format(mean_absolute_error(y_train, y_stat_pred)))
print("Mean absolute error (Featuer importance)      : {:.4f}".format(mean_absolute_error(y_train, y_importance_pred)))

y_all_pred = reg_all.predict(X_all_test)
y_corr_pred = reg_corr.predict(X_corr_test)
y_stat_pred = reg_stat.predict(X_stat_test)
y_importance_pred = reg_importance.predict(X_importance_test)

print("")
print("Accuracy of prediction model on test set")
print("Mean absolute error (Select all features)     : {:.4f}".format(mean_absolute_error(y_test, y_all_pred)))
print("Mean absolute error (Correlation)             : {:.4f}".format(mean_absolute_error(y_test, y_corr_pred)))
print("Mean absolute error (Statistical test scores) : {:.4f}".format(mean_absolute_error(y_test, y_stat_pred)))
print("Mean absolute error (Featuer importance)      : {:.4f}".format(mean_absolute_error(y_test, y_importance_pred)))

In general, the overall performance of the prediction model is not satisfactory. Here we list some common problems that we are facing.

1. Bad features
> The features have low correlation. Previously, we select features by correlation matrix or statistical scores, you can notice those features do not perform well in the test, which means those features might not help for building the prediction model. To build a high accuracy model, feature selection is one of the key component. We do want to avoid using redundant or irrelevant features, which adversely impact the learning process of model.
>
> We could spend more time to collect data and extract features by taking a deep look into the dataset. There are lots of factors that can influecnce the number of confirmed cases (e.g. The cities under lockdown, the number of testing conducted on daily basis, ...). If we can identify the high corrleated features, the performance would be getting much better.

2. Underfitting
> The model is suffered from high bias. The training score and cross-validation score converge early at low value, which is the symptom of the model suffer from underfitting. The mean absolute error of prediction on training and test set are closed to each other but the errors are pretty high.
>
> There are several possible improvements to address the high bias problem. For exmaple, we could get more training data and use more high corrleated features for training. We could also use a complex model for prediction as the data might not have linear correlation with the outcome.

## Linear regression - Prediction

Even thought the model does not provide high accuracy prediction, we wil try to predict the confirmed cases by altering the value of certain metrics.

Let's take New York state as the example as it has the highest confirmed cases. The Government has urges people to stay at home in order to reduce the chance of spreading virus. We are wondering how the infected rate would drop if we follow the policy strictly. To simulate the situation, we are going to predict the confirmed cases by reducing the mobility trends by halve, and comparing the confirmed cases with original data.

In [None]:
X_pred = X_all.loc[X_all['State'] == le.transform(['NY'])[0]]

X_pred_halved = X_pred.copy()
X_pred_halved[:]['driving'] = X_pred_halved['driving'] * .5
X_pred_halved[:]['transit'] = X_pred_halved['transit'] * .5
X_pred_halved[:]['walking'] = X_pred_halved['walking'] * .5
X_pred_halved[:]['retail_and_recreation_percent_change_from_baseline'] = X_pred_halved['retail_and_recreation_percent_change_from_baseline'] * .5
X_pred_halved[:]['grocery_and_pharmacy_percent_change_from_baseline'] = X_pred_halved['grocery_and_pharmacy_percent_change_from_baseline'] * .5
X_pred_halved[:]['parks_percent_change_from_baseline'] = X_pred_halved['parks_percent_change_from_baseline'] * .5
X_pred_halved[:]['transit_stations_percent_change_from_baseline'] = X_pred_halved['transit_stations_percent_change_from_baseline'] * .5
X_pred_halved[:]['workplaces_percent_change_from_baseline'] = X_pred_halved['workplaces_percent_change_from_baseline'] * .5
X_pred_halved[:]['residential_percent_change_from_baseline'] = X_pred_halved['residential_percent_change_from_baseline'] * .5

print("The confirmed cases in New York state would drop {} if we reduce the mobility trends by half !".
      format(np.mean(y_all[X_pred.index][-7:]) - np.mean(reg_all.predict(X_pred_halved[-10:]))))

# Stay Home, Stay Safe!