# Corona Virus Correlations

TL;DR there is definitely a correlation in the US between public transport usage and the number of corona virus cases currenlty across the states. The Spearman correlation coefficient is calculated at 0.55 and p-value highly significant. 

In the **first part** of this notebook we explore the possible correlation between public transport usage in the US and the number of confirmed corona virus cases, by State.

The tranportation research I am referencing can be found here: https://www.usnews.com/news/best-states/rankings/infrastructure/transportation
It is, in part, based on research conducted by the U.S. DEPARTMENT OF TRANSPORTATION.

Transportation is further broken down into four metrics: commute time, road quality, bridge quality and public transit usage.

Each metric is represented by a ranking for the state. The main metric for this study is the public transit usage metric and is defined as:

**Public Transit Usage**

This metric measures the average miles traveled on public transportation by one state resident in 2016. With an average of about 1 mile traveled per person, Mississippi’s public transit system earned the state the bottom slot, and with about 42 miles per person, New York was No. 1.

The **null hypothesis** for this study is that there is no correlation between Public Transit Usage and the number of confirmed cases of corona virus currently. My alpha is at the 5% level.

Obviously I am thinking that before lockdown the virus had chance to spread via public transport infrastructure.

In the **second part** of this notebook we explore the possible correlation between the American Human Development Index and the number of confirmed corona virus cases, by State.

The data were taken from the [American Human Development Report](http://http://measureofamerica.org/human-development/#american%20human%20development%20index).
Human development according to them is defined as:

**Human development**

Human development is defined as the process of enlarging people’s freedoms and opportunities and improving their well-being. Human development is about the real freedom ordinary people have to decide who to be, what to do, and how to live.

Follow the link to read more about it, its quite interesting.
The **null hypothesis** for this study is that there is no correlation between American Human Develpment Index and the number of confirmed cases of corona virus currently. My alpha is at the 5% level.

# Part 1: Public Transport Usage

In [None]:

import numpy as np
import pylab as pl
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from scipy.stats import spearmanr
import os

import the data:

In [None]:
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-3/train.csv")
transport_rankings = pd.read_csv("/kaggle/input/transport-rankings-by-state/US_Transport_Rankings.csv")
train.head(2)

In [None]:
transport_rankings.head(5)

Rename external cols and filter on US data

In [None]:
transport_rankings.columns=('OverallTransportRank','Province_State','CommuteTime','PublicTransitUsage','RoadQuality','BridgeQuality')
train = train[train['Country_Region'] == 'US']

Group by State and sum the confirmed cases (yes its cumulative, this is a measure of magnitude, so all good), then rank the states by confirmed cases:

In [None]:
unique = pd.DataFrame(train.groupby(['Country_Region', 'Province_State'],as_index=False)['ConfirmedCases'].sum())
unique['ConfirmedCases_rank'] = unique['ConfirmedCases'].rank(ascending=False)
unique.sort_values(by=['ConfirmedCases_rank'], inplace=True)
unique.head(5)

Merge the data on the state column:

In [None]:
combined = pd.DataFrame(unique.merge(transport_rankings, on='Province_State'))
combined.head(5)

So now we are ready to run some rank correlation tests. I will be using Spearmans and kendalls coefficients for my tests.

In [None]:
coef, p = spearmanr(combined['ConfirmedCases_rank'], combined['PublicTransitUsage'])
print('Spearmans rank correlation coefficient and p-value respectively: %.3f' % coef,p)
# interpret the significance
alpha = 0.05
if p > alpha:
    print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
    print('Samples are correlated (reject H0) p=%.3f' % p)

The p value is highly significant, and we have reject the null hypothesis.
Lets try Kendalls tau:

In [None]:
from scipy.stats import kendalltau
# calculate kendall's correlation
coef, p = kendalltau(combined['ConfirmedCases_rank'], combined['PublicTransitUsage'])
print('Kendall correlation coefficient and p-value respectively: %.3f' % coef, p)
# interpret the significance
alpha = 0.05
if p > alpha:
    print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
    print('Samples are correlated (reject H0) p=%.3f' % p)

Again the p-value is significant and we have to reject the null hypothesis that there is no correlation between Public Transit Usage and the number of confirmed cases of corona virus currently.

In [None]:
plt.scatter(combined['ConfirmedCases_rank'], combined['PublicTransitUsage'])
# show line plot
plt.title('Confirmed Cases Rank vs Public Transport Usage Rank')
plt.show()

I know you want to see the correlations with the other columns. I'll run Spearmans for these:

In [None]:
coef, p = spearmanr(combined['ConfirmedCases_rank'], combined['RoadQuality'])
print('Spearmans rank correlation coefficient and p-value respectively (RoadQuality): %.3f' % coef,p)
coef, p = spearmanr(combined['ConfirmedCases_rank'], combined['BridgeQuality'])
print('Spearmans rank correlation coefficient and p-value respectively (BridgeQuality): %.3f' % coef,p)
coef, p = spearmanr(combined['ConfirmedCases_rank'], combined['OverallTransportRank'])
print('Spearmans rank correlation coefficient and p-value respectively (OverallTrasportationRank): %.3f' % coef,p)

Nothing to write home about re those three correlations. 

So what comes out of this analysis is the idea that we could use the engagement of citizens with public transport infrastructure as an indicator of when to start instilling extra measures against biological threats to human health.

# Part 2: American Human Development Index

In [None]:
HDI_rankings = pd.read_csv("/kaggle/input/american-human-development-index/US_HDI_Rankings.csv")
HDI_rankings.head()

In [None]:
HDI_rankings.columns=('HDI_rank','Province_State','HDI')
combined = pd.DataFrame(unique.merge(HDI_rankings, on='Province_State'))
combined.head(5)

Now we test for significance again on the rank columns.
Spearmans rank correlation test:

In [None]:
coef, p = spearmanr(combined['ConfirmedCases_rank'], combined['HDI_rank'])
print('Spearmans rank correlation coefficient and p-value respectively: %.3f' % coef,p)
# interpret the significance
alpha = 0.05
if p > alpha:
    print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
    print('Samples are correlated (reject H0) p=%.3f' % p)

Kendall's tau:

In [None]:
# calculate kendall's correlation
coef, p = kendalltau(combined['ConfirmedCases_rank'], combined['HDI_rank'])
print('Kendall correlation coefficient: %.3f' % coef)
# interpret the significance
alpha = 0.05
if p > alpha:
    print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
    print('Samples are correlated (reject H0) p=%.3f' % p)

In [None]:
plt.scatter(combined['ConfirmedCases_rank'], combined['HDI_rank'])
# show line plot
plt.title('Confirmed Cases Rank vs Human Development Index')
plt.show()

Well both the above tests fail to reject the null hypothesis, that there is no correlation between the American Human Development Index and the current confirmed cases rank re corona virus.
What is interesting (although the graph is perhaps not so forgiving visually) is that they *almost* do reject the null hypothesis, because they are significant at the 10% level.

Now if we ascribe anything to that then we would do well to use the index as a barometer for when we need to instill extra measures against such natural evils as a virus bent on taking us out!