##### D.Tom (April, 2020)
# <center>COVID19 Global Forecasting (Week 3):</center>
<b>Task: </b>*Forecast daily COVID-19 spread in regions around the world.  Predict the cumulative number of confirmed COVID-19 cases in various locations across the world, as well as the number of resulting fatalaties, for future dates.*

### Introduction

The White House Office of Science and Technology Policy alongside research groups and companies have prepared a COVID-19 Open Research Dataset ('CORD-19').

CORD-19 alongside the United Nation's dataset [('SYB62_1_201907')](https://data.un.org/_Docs/SYB/CSV/SYB62_1_201907_Population,%20Surface%20Area%20and%20Density.csv) on Population, Surface Area and Density will identify specific variables asociated to the infection and fatality rates based on location.

Proper Data Science techniques will afford governments around the world to better appropraite resources in an attempt to slow the infection, decrease mortality rates, and eradicate the virus.  

<b>No prediction is guarenteed</b>:
    - An appropriate predictive model will save time, money, and lives through proper planning.
    - More location specific data should accurately predict future cases. (ie. City population)
    - More social specific data should accurately predict future cases. (ie. People's movements)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1. [Setting up the Environment](#0)<br>
2. [Combine the *pandas* dataframes](#2)<br>
3. [Analyze correlations and probability values](#4)<br>
4. [Plot the training data analyzing 'ConfirmedCases'](#6)<br>
5. [Build a Long-Term Predictive Model based on Hubei province (Polynomial Regression)](#8)<br>
6. [Generate a Sigmoid Predictive Model based on Hubei Province](#10)<br>
6. [Conclusion](#12) <br>
</div>
<hr>

# Setting up the Environment<a id="0"></a>

The below environment will allow for most machine learning methods to be quickly loaded and incorporated into the project.  Such modules include loading datasets, analyzing and plotting values across various diagrams, optimizating dependencies, and prebuilt machine learning algorithms.  Additional optimization methods can be performed to remove and omit certain modules based on one's given project.

##### Download and import dependencies

In [None]:
%%capture
!pip install -upgrade pip
!pip install ipywidgets matplotlib==2.2.0 pyproj==1.9.6 numpy pandas proj pillow cython sklearn datetime seaborn pylab

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
#import pylab as pl
import numpy as np
from scipy import stats
import seaborn as sns
import datetime as dt

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
#from pylab import rcParams

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from scipy.optimize import curve_fit

mpl.style.use('ggplot') # optional: for ggplot-like style
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0 # check for latest version of Matplotlib

%matplotlib inline

## Download and load files

##### It is suggested to download and unzip the files manually from:
    1. https://www.kaggle.com/ 
          - Competition requires approvaland verification before downloads can be authorized
          
    2. https://data.un.org/_Docs/SYB/CSV/SYB62_1_201907_Population,%20Surface%20Area%20and%20Density.csv

In [None]:
#fname = 'SYB62_1_201907_Population, Surface Area and Density.csv'
#site = 'https://data.un.org/_Docs/SYB/CSV/SYB62_1_201907_Population,%20Surface%20Area%20and%20Density.csv'
#!wget -O  'SYB62_1_201907_Population, Surface Area and Density.csv' 'https://data.un.org/_Docs/SYB/CSV/SYB62_1_201907_Population,%20Surface%20Area%20and%20Density.csv'
#!unzip -o -j 'SYB62_1_201907_Population, Surface Area and Density.csv'

##### Load the Kaggle Competition training data 'train.csv'.  
Clean the dataset:
     - parse dates
     - create DateKey column

In [None]:
path = '../input/covid19-global-forecasting-week-3/'
csv = 'train.csv'
filepath = path + csv

pdf = pd.read_csv(filepath, parse_dates=['Date'])

pdf['Year'] = pdf['Date'].dt.year
pdf['Month'] = pdf['Date'].dt.month
pdf['Week'] = pdf['Date'].dt.week
pdf['Day'] = pdf['Date'].dt.day
pdf['DateKey'] = (pdf['Date'] - pdf['Date'].min()).astype(int)

bkdf = pdf
today = pd.datetime.now()

print(' Filepath: ', filepath, '\n', 'Shape: ', pdf.shape, '\n', 'Date: ', today)
pdf.head()

##### Load the United Nations Population, Surface Area and Density dataset
Clean the dataset: 
     - allign the dataset with the Kaggle dataset
     - create Total Population column

In [None]:
#path2 = '../output/kaggle/working/'
csv2 = 'SYB62_1_201907_Population, Surface Area and Density.csv'
filepath2 = csv2 #path2 + csv2

asdf = pd.read_csv(filepath2, encoding='iso-8859-1', header=1)
asdf = asdf.pivot_table(values='Value', index=['Region/Country/Area', 'Unnamed: 1', 'Year'], columns='Series').reset_index()
asdf = asdf.rename(columns={'Unnamed: 1':'Country_Region'}, inplace=False)
asdf['Population mid-year Total'] = ((asdf['Population mid-year estimates for females (millions)'] 
                                    + asdf['Population mid-year estimates for males (millions)'])
                                    * 1000000)
asdf = asdf.drop(['Year'], axis=1, inplace=False)
asdf = asdf.dropna()

print(' Filepath: ', filepath2, '\n', 'Shape: ', asdf.shape, '\n', 'Date: ', today)
asdf.head()

##### Load the Kaggle Competition testing data 'test.csv'.  
Clean the dataset:
     - parse dates

In [None]:
tp = '../input/covid19-global-forecasting-week-3/'
tf = 'test.csv'
filepath3 = tp + tf

test = pd.read_csv(filepath3, parse_dates=['Date'])

print(' Filepath: ', filepath3, '\n', 'Shape: ', test.shape, '\n', 'Date: ', today)
test.head()

##### At this point there are three *pandas* dataframes loaded into the notebook:
    
    pdf = training data
    asdf = additional data
    test = testing data
    
*These dataframes will be merged into two separate dataframes:*
    
    tes = (test + asdf) 'Testing + Additional'
    res = (pdf + asdf) 'Training' + Additional'

# Combine the *pandas* dataframes:<a id="2"></a>

Create the testing dataframe:

    tes = (test + asdf) 'Testing + Additional'
    
##### Cleanup the data by parsing dates and creating a 'DateKey' similar to the above dataframes.  

    - Drop unassociated null values 
    - Create column for 'Cases per pop'
    - Create column for 'Case Density'

In [None]:
tes = pd.concat([pdf, test])
tes = tes.reset_index(drop=True)
tes = pd.merge(tes, asdf, on='Country_Region', how='left')

tes['Year'] = tes['Date'].dt.year
tes['Month'] = tes['Date'].dt.month
tes['Week'] = tes['Date'].dt.week
tes['Day'] = tes['Date'].dt.day
tes['DateKey'] = (tes['Date'] - tes['Date'].min()).astype(int)

tes['ConfirmedCases'] = tes['ConfirmedCases'].replace(np.nan, 0)
tes['ConfirmedCases'] = tes['ConfirmedCases'].dropna()

tes['Fatalities'] = tes['Fatalities'].replace(np.nan, 0)
tes['Fatalities'] = tes['Fatalities'].dropna()

tes['Cases per pop'] = (tes['Population mid-year Total'] / (tes['ConfirmedCases'].max()))
casp = (tes['Population mid-year Total'].mean() / (tes['ConfirmedCases'].max()))
tes['Cases per pop'] = tes['Cases per pop'].fillna(casp).astype(int)
tes['Case Density'] = tes['Cases per pop'] / tes['Population density']

print('Shape: ', tes.shape)
tes.tail()

Create a training dataframe:

    res = (pdf + asdf) 'Training' + Additional'
    
##### Cleanup the data by parsing dates and creating a 'DateKey' similar to the above dataframes.  

    - Drop unassociated null values 
    - Create column for 'Cases per pop'
    - Create column for 'Case Density'

In [None]:
res = pd.merge(pdf, asdf, on='Country_Region', how='left')
res = res.reset_index(drop=True)


res['Cases per pop'] = (res['Population mid-year Total'] / (res['ConfirmedCases'].max()))
casp2 = (res['Population mid-year Total'].mean() / (res['ConfirmedCases'].max()))
res['Cases per pop'] = res['Cases per pop'].fillna(casp).astype(int)
res['Case Density'] = res['Cases per pop'] / res['Population density']


print('Shape: ', res.shape)
res.tail()

## Analyze correlations and probability values:<a id="4"></a>
    tes = testing dataframe
    res = training dataframe

In [None]:
tes['Country_Region'].value_counts()

In [None]:
res['Country_Region'].value_counts()

##### Identify any Pearson Correlation values greater than or equal to 0.3

Notice the correlations between 'Case Density', 'Cases per pop', and 'Surface area (thousand km2)'

In [None]:
name = res
name.corr()[name.corr() >= 0.3]

##### Identify any probability values less than or equal to 0.2

Notice the significance betwen 'ConfirmedCases', 'Week', and 'Month'.

In [None]:
name = res.dropna()

def calculate_pvalues(name):
    name = name._get_numeric_data()
    col = pd.DataFrame(columns=name.columns)
    pval = col.transpose().join(col, how='outer')
    for r in name.columns:
        for c in name.columns:
            pval[r][c] = round(pearsonr(name[r], name[c])[1], 4)
    return pval

# 0 SIGNIFICANT, >0.1 NO SIGNIFICANCE
calculate_pvalues(name)[calculate_pvalues(name) <= 0.2]

##### Initial analysis of the available datasets indicates a correlation between 'ConfirmedCases' and 'Fatalities' with regards to the following variables:

    1. 'Case Density'
    2. 'Cases per pop'
    3. 'Month'
    4. 'Week'
    
<b>A rough predictive model for 'Predicted Cases' and 'Predicted Fatalaties' can be built around the four above variables.</b>

*It must be noted that additional data may indicate additional correlations, or lack there of.*

# Plot the training data analyzing 'ConfirmedCases':<a id="6"></a>

##### Below is a diagram indicating the True Values of 'ConfirmedCases' in reference to 'Date'

In [None]:
xydf = res

xx = 'Date'
yy = 'ConfirmedCases'

xxdf = xydf[xx]
yydf = xydf[yy]

plt.scatter(xxdf, yydf,  color='blue')
plt.xlabel("True Values: " + xx)
plt.ylabel("True Values: " + yy)
plt.show()

*It should be noted that the above dataset contains outliers that will require additional data to properly correct for.  Such data may include the amount of tests with regards to testing, population, and location.*  
*Without the exact confirmed case value it will be nearly impossible to accurately predict any values, however the available data will give insight as to what may occur globally over the course of the next few years.*

#### It is well documented that China, the specific location being Wuhan within Hubei, was the first known carrier of the COVID-19 outbreak.  This location will be further analyzed to determine a long term correlation of Predicted Values to that of confirmed cases.

Create a specific testing China and Hubei dataframe:
    - Identify/remove outliers (ie. Hubei Province)
    - Identify a long term prediction for the ConfirmedCases and Fatalities based on the below variables: 

    1. 'Case Density'
    2. 'Cases per pop'
    3. 'Month'
    4. 'Week'

In [None]:
resch = res.Country_Region.str.contains('China')
resch = res[resch]
resch

In [None]:
hubeidf = resch.dropna()
hubeidf = hubeidf.reset_index(drop=True)
hubei2df = hubeidf.Province_State.str.contains('Hubei')
hubei2df = hubeidf[hubei2df]
hubei2df

# Build a Long-Term Predictive Model based on Hubei province (Polynomial Regression)<a id="8"></a>

A predictive model will be analyzed based on the Hubei province using a multivariate polynomial regression plot.  The variables used in this predictive model will be based on all the available numerical data and will use sklearn for the automated machine learning pipeline.

##### ConfirmedCases

In [None]:
y = hubei2df['ConfirmedCases']
Z = hubei2df.drop(['ConfirmedCases', 'Country_Region', 'Date', 'Province_State'], axis=1, inplace=False)


Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)

hubei2df['Predicted Cases'] = ypipe
hubei2df['Predicted Cases'] = hubei2df['Predicted Cases'].astype(int)
r_squared = r2_score(y, ypipe)
print('The R-square value is: ', r_squared)
hubei2df.head()

##### Fatalities

In [None]:
y2 = hubei2df['Fatalities']
Z2 = hubei2df.drop(['Fatalities', 'Country_Region', 'Date', 'Province_State'], axis=1, inplace=False)


Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z2,y2)

y2pipe=pipe.predict(Z2)

hubei2df['Predicted Fatalities'] = y2pipe
hubei2df['Predicted Fatalities'] = hubei2df['Predicted Fatalities'].astype(int)
r_squared2 = r2_score(y2, y2pipe)
print('The R-square value is: ', r_squared2)
hubei2df.head()

##### To further analyze the polynomial regression, an equation will be based on the Confirmed Values and the Date.
*Note: A 16-order polynomial was found to accurately represent the number of Confirmed Values in reference to the 'DateKey'.*

In [None]:
x = hubei2df['DateKey']
cr = hubei2df['ConfirmedCases']

f = np.polyfit(cr, x, 16)
p = np.poly1d(f)
print(p)

r_squared = r2_score(x, p(cr))
mse = mean_squared_error(x, p(cr))

plt.title('Polynomial fit')
plt.xlabel('DateKey')
plt.ylabel('ConfirmedCases')
plt.plot(x, ypipe, '-', x, cr, '.')
plt.show()

print('The R-square value is: ', r_squared)
print('The MSE value is: ', mse)

In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Confirmed Cases ~ Predicted Cases')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('ConfirmedCases')
    plt.show()
    plt.close()

In [None]:
pc = hubei2df['Predicted Cases']
f = np.polyfit(y, pc, 16)
p = np.poly1d(f)
print(p)
PlotPolly(p, y, pc, 'Predicted Cases')
print(np.polyfit(y, pc, 16))

Further analyze the polynomial model's 'Predicted Cases' and 'Predicted Fatalaties' by calculating their pearson correlation to that of 'ConfirmedCases' and 'Fatalities'.

In [None]:
hubei2df.corr()[hubei2df.corr() >= 0.1 ]

##### It should be noted that the above analysis proves that a model can be generated to predict cases with relative accuracy, however, a comparison of the True Values Predicted Values shows that the polynomial regression does perform as intended and may require more data or a separate model.

# Generate a Sigmoid Predictive Model based on Hubei Province<a id="10"></a>

A new attempt will be made to generate a predictive model for the Hubei Province based on a Sigmoid function.  This will use scipy for the automation to better fit the beta_1 and beta_2 values.

*It should be noted that ConfirmedCases should both rise and fall based on whether or not the infection is spreading or regressing.  A sigmoid function does not allow for such ebb and flow, but a sigmoid predictive model might accurately predict a long-term correlation associated to both 'ConfirmedCases' and 'Date' to that of the potential effects of possible quarentine measures.*

##### Note: The more effective the quarentine, the lower the rate of infection.  The infection might spread to the entire population, but it might take a certain amount of time (such as [arbitrary value: 2 years] with effective quarentine measures) to reach that point.  If a vaccine is found before the entire population is infected then the sigmoid will reverse giving a bell curve of infected cases to that of the vaccinated population.

In [None]:
def sigmoid(x, Beta_1, Beta_2):
     y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
     return y

In [None]:
xdata = hubei2df['DateKey'].values/max(hubei2df['DateKey'].values)
ydata = hubei2df['ConfirmedCases'].values/max(hubei2df['ConfirmedCases'].values)
popt, pcov = curve_fit(sigmoid, xdata, ydata)

print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))

x = np.linspace(0, 51, 52)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)

plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('ConfirmedCases')
plt.xlabel('Date')
plt.show()

In [None]:
# split data into train/test
msk = np.random.rand(len(hubei2df)) < 0.8
train_x = xdata[msk]
test_x = xdata[~msk]
train_y = ydata[msk]
test_y = ydata[~msk]

# build the model using train set
popt, pcov = curve_fit(sigmoid, train_x, train_y)

# predict using test set
y_hat = sigmoid(test_x, *popt)

# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - test_y) **2))
from sklearn.metrics import r2_score
print("R2-Score: %.2f" % r2_score(y_hat, test_y))

#logistic function
Y_pred = sigmoid(xdata, popt[0], popt[1])

#plot initial prediction against datapoints
plt.plot(xdata, Y_pred)
plt.plot(xdata, ydata, 'ro')

The above sigmoid fits very well, it should be noted however that the number of confirmed cases may vary greatly based on testing per location (ie. Tests per 1000 population) and may require an additional dataset based on estimated 'Un-ConfirmedCases'.

*Note: If the date of a vaccine is 2 years and a known rate of infection is calculated based on the above sigmoid model, then it can be reasonably assumed that a prediction can be built based on the Total Population.*

*Note2: If the amount of tests performed is equal to that of the population, then the number of 'ConfirmedCases' would be completely accurate, however this value is not the case and may need a new variable to consititute the lack of testing.*

### Create a column of Predicted Cases based on the Sigmoid model

In [None]:
xdata2 = tes['ConfirmedCases'].values/max(tes['ConfirmedCases'].values)
sigpred = sigmoid(xdata2, *popt)#[0], popt[1])
print(len(sigpred))
print(sigpred)
tes['Predicted Cases (Sig)'] = (sigpred * popt[0]).astype(int)

In [None]:
xdata2 = tes['Fatalities'].values/max(tes['Fatalities'].values)
sigpred = sigmoid(xdata2, *popt)#[0], popt[1])
print(len(sigpred))
print(sigpred)
tes['Predicted Fatalities (Sig)'] = (sigpred * popt[0]).astype(int)
tes [32300:32370]

# Conclusion:<a id="12"></a>

<b>A 16-order Polynomial Function will accurately predict COVID-19 cases and fatalaties across various locations.  Main factors associated to this conclusion are location specific and will require more data for to accurately predict the case count.  Datasets required for a more accurate analysis include:</b>

    Datasets Required:
    - More location specific data should accurately predict future cases. (ie. City population)
    - More social specific data should accurately predict future cases. (ie. People's movements)
    - More testing specific data should accurately predict future cases. (ie. Time and Volume of tests)
    - More quarentine specific data should accurately predict future cases. (ie. Limiting People's movements)
    
    Minimum Necessary Variables:
    1. 'Case Density'
    2. 'Cases per pop'
    3. 'Month'
    4. 'Week'
    

<b>A sigmoid function based on China's Hubei province was successful in visually representing the appropriate values, however results for the predictive model cannot be accurately utilized in the project without further analysis of the values.  It must be directly stated that it appears an issue occured somewhere in the above prediction values, the error will be submitted based on the time constraints of the project.</b>

##### Prediction based on Sigmoid

In [None]:
sub = pd.read_csv(path + 'submission.csv')
sub['ConfirmedCases'] = tes['Predicted Cases (Sig)']
sub['Fatalities'] = tes['Predicted Fatalities (Sig)']
sub.to_csv('submission.csv', index=False)
sub.head()