# This Kaggle challenge aims to predict the future contagious and fatality cases for COVID-19 in California US, in a way to prepare all medical infrastructure to treat this terrible desease that is hitting so strongly our society.

To do that, Kaggle provided a dataset with the information related to confirmed cases and fatalities in California US since the dates of the outbreak.

First We check the files provided by Kaggle:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

To check the information we load the files into Pandas dataframes:

In [None]:
# Load data into Pandas dataframes
df_train = pd.read_csv('/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_train.csv')
df_test = pd.read_csv('/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_test.csv')
df_submission = pd.read_csv('/kaggle/input/covid19-local-us-ca-forecasting-week-1/ca_submission.csv')

In [None]:
# Check a preview of the data
df_train.tail()

As we can see, the information we need is the date, confirmed cases and fatalities, nevertheless we check that the other columns doesn't change their values:

In [None]:
# Check the properties of the data

print(df_train['Province/State'].unique())
print(df_train['Country/Region'].unique())
print(df_train['Lat'].unique())
print(df_train['Long'].unique())
print(df_train.dtypes)

We describe the data:

In [None]:
df_train.describe()

In [None]:
# Check the distribution of the confirmed cases

df_train.hist(column='ConfirmedCases')

In [None]:
# Check the distribution of the fatalities

df_train.hist(column='Fatalities')

As we can see, most of the registers are from dates when there was no confirmed cases. We update the dataframe to include just the needed columns:

In [None]:
# Take only what we need: date, confirmed cases and fatalities

df_train = df_train[['Date', 'ConfirmedCases', 'Fatalities']]

We transform the date column to the Pandas date format and sort the dataframe:

In [None]:
# Convert Date column to Pandas date and orther to get chronological data

df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.sort_values(by=['Date'])

We check the trend of confirmed cases and fatalities in a bar plot:

In [None]:
# Check the trend in a chart

df_train.plot.bar(x='Date', y=['ConfirmedCases','Fatalities'])

The growth seems to be exponential, but too many registers contains 0 confirmed cases, just to see closer we plot without this data:

In [None]:
# As the confirmed cases are far away from the start we will focus in that time

df_train2 = df_train.query('ConfirmedCases != 0.0')

df_train2.plot.bar(x='Date', y=['ConfirmedCases', 'Fatalities'])

Now, to make the predictions we first need to expand the features, in this case, we expand the date column:

In [None]:
df_train['Week'] = df_train['Date'].dt.week
df_train['Day'] = df_train['Date'].dt.day
df_train['WeekDay'] = df_train['Date'].dt.dayofweek
df_train['YearDay'] = df_train['Date'].dt.dayofyear

Now we import all models, create them, fit them with data and check the scores for the best result:

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

from sklearn.model_selection import train_test_split

predictors = df_train.drop(['Date', 'ConfirmedCases', 'Fatalities'], axis=1)
target = df_train[['ConfirmedCases', 'Fatalities']]
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=1)

def scores(alg):
    lin = alg()
    lin.fit(x_train, y_train['ConfirmedCases'])
    y_pred = lin.predict(x_test)
    lin_r = r2_score(y_test['ConfirmedCases'], y_pred)
    s.append(lin_r)
    
    lin.fit(x_train, y_train['Fatalities'])
    y_pred = lin.predict(x_test)
    lin_r = r2_score(y_test['Fatalities'], y_pred)
    s2.append(lin_r)
    
algos = [KNeighborsRegressor, LinearRegression, RandomForestRegressor, GradientBoostingRegressor, Lasso, ElasticNet, DecisionTreeRegressor]

s = []
s2 = []

for algo in algos:
    scores(algo)
    
models = pd.DataFrame({
    'Method': ['KNeighborsRegressor', 'LinearRegression', 'RandomForestRegressor', 'GradientBoostingRegressor', 'Lasso', 'ElasticNet', 'DecisionTreeRegressor'],
    'ScoreCC': s,
    'ScoreF' : s2
})

models.sort_values(by=['ScoreCC', 'ScoreF'], ascending=False)

As we can see, the best prediction score is for the random forest regressor model, and with no hyperparameter tuning, that's amazing. Nevertheless, it can be a good practice to check if the ARIMA model which is highly used nowadays, can perform better.

First we need to check the autocorrelation plot, to fill the ARIMA model parameters:

In [None]:
# Now let's try for last an ARIMA model

# First we see that data is not stationary, so we need to check the autocorrelation of the time series

from pandas.plotting import autocorrelation_plot

autocorrelation_plot(df_train['ConfirmedCases'])

The highest autocorrelation significant value is in near the 5th lag, now let's create the model, train it and check coeficients and residuals:

In [None]:
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import*

arima_model = ARIMA(df_train['ConfirmedCases'], order=(4,1,0)).fit(disp=0, transparams=True, trend='c')
print(arima_model.summary())

residuals = pd.DataFrame(arima_model.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())

arima_model2 = ARIMA(df_train['Fatalities'], order=(4,1,0)).fit(disp=0, transparams=True, trend='c')
print(arima_model2.summary())

residuals2 = pd.DataFrame(arima_model2.resid)
residuals2.plot()
pyplot.show()
residuals2.plot(kind='kde')
pyplot.show()
print(residuals2.describe())

The coefficients are bad for confirmed cases and fatalities, also the residuals show there is a lot of variation that needs to be taken into account. Let's check the predictions in a plot:

In [None]:
predictions_arima = list(arima_model.predict())
predictions_arima.append(arima_model.forecast()[0][0])
predictions_arima.append(arima_model.forecast()[0][0])

df_train['arima'] = predictions_arima

predictions_arima2 = list(arima_model2.predict())
predictions_arima2.append(arima_model2.forecast()[0][0])
predictions_arima2.append(arima_model2.forecast()[0][0])

df_train['arima2'] = predictions_arima2

df_train.plot.bar(x='Date', y=['ConfirmedCases', 'arima'])
df_train.plot.bar(x='Date', y=['Fatalities', 'arima2'])

In fact, no good results at all, the best model for this case is the random forest regressor.

In [None]:
df_submission.head()

In [None]:
print(df_test['Date'].values)
print(len(df_test['Date']))

We need to provide the predictions using the test dataset, so we process the date column just as we did with the training dataset:

In [None]:
df_test = df_test[['ForecastId', 'Date']]

df_test['Date'] = pd.to_datetime(df_test['Date'])
df_test['Week'] = df_test['Date'].dt.week
df_test['Day'] = df_test['Date'].dt.day
df_test['WeekDay'] = df_test['Date'].dt.dayofweek
df_test['YearDay'] = df_test['Date'].dt.dayofyear

df_test.head()

We create the random forest regresor, fit it, predict using the test dataset registers and add the results to the dataframe:

In [None]:
model = RandomForestRegressor()
model.fit(x_train, y_train['ConfirmedCases'])

model2 = RandomForestRegressor()
model2.fit(x_train, y_train['Fatalities'])


df_test['ConfirmedCases'] = model.predict(df_test.drop(['Date', 'ForecastId'], axis=1))
df_test['Fatalities'] = model2.predict(df_test.drop(['Date', 'ForecastId', 'ConfirmedCases'], axis=1))

In [None]:
df_final = df_test[['ForecastId', 'ConfirmedCases', 'Fatalities']] 
df_final['ConfirmedCases'] = df_final['ConfirmedCases'].astype(int)
df_final['Fatalities'] = df_final['Fatalities'].astype(int)

df_final.head()

In [None]:
df_final.plot.bar(x='ForecastId', y=['ConfirmedCases', 'Fatalities'])

Those are the predicted values, now let's submit and we are done.

In [None]:
df_final.to_csv('submission.csv', index=False)