# Introduction

This notebook contains a detailed description of the models that I tried for the COVID19 global forecasting challenge. The challenge was to predict the number of cases and casualties in every state of every country every day for the following 6 weeks.

Since the challenge require quite a bit of extrapolation, instead of using typical machine learning tools (random forests, regression, ARIMA models), that are not the best extrapolating non-seasonal data so far in the future, it is better to use some domain knowledge about epidemiology.

We will start loading the necessary libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from datetime import timedelta
from scipy.optimize import curve_fit
from scipy.stats import linregress
from scipy.special import erf
from sklearn.metrics import mean_squared_error
import warnings

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Data preparation

Now we read the files and prepare the dataframes for analysis.

This involves telling pandas how to handle the date column, how to fill missing values, small cleanup tricks, plus adding two columns with logarithmic values of the confirmed cases and fatalities. Because of the nature of the data, it is better to predict the logarithmic values than the values themselves.

To get back to the confirmed cases, we just have to exponentiate the predictions.

In [None]:
train_data = pd.read_csv('../input/covid19-global-forecasting-week-1/train.csv')
test_data = pd.read_csv('../input/covid19-global-forecasting-week-1/test.csv')

# Change column names: '/' character may cause problems

train_data = train_data.rename(columns={ 'Province/State' : 'State', 'Country/Region' : 'Country',
                                         'Date' : 'DateAsString' })

test_data = test_data.rename(columns={ 'Province/State' : 'State', 'Country/Region' : 'Country',
                                         'Date' : 'DateAsString' })

# Put dates as datetime64 datatype

train_data['Date'] = pd.to_datetime(train_data['DateAsString'], format='%Y-%m-%d')
test_data['Date'] = pd.to_datetime(test_data['DateAsString'], format='%Y-%m-%d')

# If there are no states, there is only one state, called 'All'

train_data['State'] = train_data['State'].fillna('All')
test_data['State'] = test_data['State'].fillna('All')

# Take out aposthrophes in countries with aposthropes (it messes with the string definitions!)

train_data = train_data.replace("Cote d'Ivoire","Cote d Ivoire")
test_data = test_data.replace("Cote d'Ivoire","Cote d Ivoire")


# Add logaritmic values, because it is often the best metric here

train_data[['LogConfirmed']] = train_data[['ConfirmedCases']].apply(np.log)
train_data[['LogFatalities']] = train_data[['Fatalities']].apply(np.log)

test_data['ConfirmedPred'] = np.zeros(test_data.shape[0])
test_data['FatalitiesPred'] = np.zeros(test_data.shape[0])

Finally, a couple of variable definitions that will be necessary later...

In [None]:
last_day = np.datetime64('2020-04-23')
first_prediction = np.datetime64('2020-03-12')

# Finding a model for prediction

## Logistic-like model

There are several models to predict the number of infected people over time. A well-known one is the [SIR model](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model). Unfortunately, the model cannot be written in a [functional form easily](https://arxiv.org/abs/1403.2160), therefore, it is not the best one to fit the data to.

Therefore, we will try with a simpler functional form. We need a "test" case, *i. e*. data from a country where the epidemic is already in a "stable" stage. The paradigmatic case is South Korea. Let's have a look at the evolution (in logarithmic scale) of the pandemic after 100 confirmed cases (before, the data is too noisy).

In [None]:
train_data_korea = train_data.query("Country == 'Korea, South'")

train_data_korea_100 = train_data_korea.query("ConfirmedCases > 100")

plt.plot(train_data_korea_100['LogConfirmed'])
plt.plot(train_data_korea_100['LogFatalities'])

Seems like a simple [logistic function](https://en.wikipedia.org/wiki/Logistic_function) would be a good fit. We are going to try it and two other functions that look similar: the [error function](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model) and the [log-normal function](https://en.wikipedia.org/wiki/Log-normal_distribution) (the log-normal function has [skewness](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model), like the SIR model function).

We are going to use the mean squared error (MSE) as the metric to decide which function fits better the logarithm of confirmed cases over time. The one with the lower MSE is the better one for the fit.

One may argue that we have to divide the data in train and test subsets, and calculate MSE over the latter. But here we are looking for a descriptive model, more than a predictive one, so using all the data is justified. Again, why may argue that we *do* want to predict future cases, so we should look at the predictive capacity of the model. And this is true, but not here. The model that predicts better future cases in a country where the epidemics is in a "stable" stage is not necessarily good in countries where the number of cases are increasing fast. Once again, *here* we are just looking for a descriptive model. 

In [None]:
def logistic_to_fit(x,k,L):
    return L/(1 + np.exp(-k*x))

def error_to_fit(x,k,L):
    return L*(1 + erf(k*x))

def log_normal_to_fit(x,k,L,x_0):
    return L*(1+ erf(np.log(x)-x_0)/k)

x_tofit = np.arange(train_data_korea_100.shape[0])
y_tofit = train_data_korea_100['LogConfirmed'].to_numpy()

model_to_fit = [ logistic_to_fit, error_to_fit, log_normal_to_fit]
popt_confirmed = []
pcov_confirmed = []

for i_model in model_to_fit:
    i_popt, i_pcov = curve_fit(i_model, x_tofit, y_tofit)
    popt_confirmed.append(i_popt)
    pcov_confirmed.append(i_pcov)
    mse = mean_squared_error(y_tofit,i_model(x_tofit,*i_popt))
    print('For model {0} the MSE is {1}'.format(str(i_model).split()[1],mse))

It looks like the logistic function is the one that fits best! The other models don't fit the data bad either. Other kagglers have used the logistic function to fit the data, but this is not exactly what we have done here. Here we are using the logistic function on the logarithm of the number of cases. Mathematically, our fitting function is:

$\displaystyle \ln y = \frac{L}{1+e^{-k(x-x_{0})}} \Rightarrow y = \exp\left(\frac{L}{1+e^{-k(x-x_{0})}}\right)$

while other kagglers have tried:

$\displaystyle y = \frac{L}{1+e^{-k(x-x_{0})}}$

## Exponential model

The logistic equation is a good fit when the epidemics has stabilized or in countries heading towards stabilization. At the beginning of the epidemics, however, the number of confirmed cases increases exponentially (so we can fit it with a linear equation in the logarithmic space).

How do we know if logistic regression or linear regression is the best? As we did it in the previous section we can use the MSE to decide which equation fit better the results (linear or logistic).

But we can do it even better. Why choosing between linear and logistic fit? Let's use both!

Wait... what? The idea is easy, instead of taking only one solution, we take a weighted average of both $\ln y = ax + b$ and $\ln y = L/1+e^{-k(x-x_{0})}$. Which are the weights? Let's use precisely MSE as a weight for that:

$\displaystyle w_{\text{logistic}} = \frac{MSE_{\text{linear}}^{2}}{MSE_{\text{logistic}}^{2} + MSE_{\text{linear}}^{2}}$ ; $\displaystyle w_{\text{linear}} = \frac{MSE_{\text{logistic}}^{2}}{MSE_{\text{logistic}}^{2} + MSE_{\text{linear}}^{2}}$

It sounds weird that the weight of the logistic regression involves the MSE of the linear regression, but it makes kind of sense. At the end, we want to give *less* weight to the logistic regression, if the linear regression performs better (*i. e.* its MSE is *lower*).

**Note:** There are better ways of getting a prediction that is a mixture of two regression models (such as the [voting regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html), implemented in scikit-learn). Unfortunately, given the time constrains of this competition, I didn't have the time to implement it.

OK. Enough text (and math). Let's implement a function that predict the number of cases if we give it a dataframe with the data from this Country/State.

In [None]:
def do_fitting(state_data):
    x_tofit = np.arange(state_data.shape[0])
    y_tofit = state_data['LogConfirmed'].to_numpy()
    offset_array = state_data['LogConfirmed'].to_numpy() - state_data['LogFatalities'].to_numpy()
    offset = offset_array.mean()
    not_fit = False
    warnings.filterwarnings('error')
    try:
        popt, pcov = curve_fit(logistic_to_fit, x_tofit, y_tofit, p0=[0.23, 9])
    except Warning:
        not_fit = True
    slope, intercept, rvalue, pvalue, stderr = linregress(x_tofit, y_tofit)
    warnings.resetwarnings()

    if not_fit:
        def linear_fitted(x):
            y = intercept + slope * x
            return y
        def logistics_fitted(x):
            y = intercept + slope * x
            return y

        mse_logistic = 0.5
        mse_linear = 0.5
        popt = (1e+10,1e+10)

    else:
        def logistics_fitted(x):
            return logistic_to_fit(x,*popt)
        def linear_fitted(x):
            y = intercept + slope * x
            return y

        mse_logistic = mean_squared_error(y_tofit, logistics_fitted(x_tofit))
        mse_linear = mean_squared_error(y_tofit,linear_fitted(x_tofit))

    weight_logistic2 = mse_linear*mse_linear / (mse_linear*mse_linear+mse_logistic*mse_logistic)
    weight_linear2 = mse_logistic*mse_logistic / (mse_linear * mse_linear + mse_logistic * mse_logistic)

    def weighted_fitted(x):
        y = weight_logistic2 * logistics_fitted(x) + weight_linear2 * linear_fitted(x)
        return y

    return (weighted_fitted,logistics_fitted,linear_fitted,weight_logistic2,slope,np.exp(popt[1]),offset)


The number of fatalities is too small and too noisy to give accurate predictions after one month. However, one can see in South Korea that both graphs are quite similar, there is just an offset between the two graphs. This offset in the logarithmic graph translates to a factor between both: the death rate.

Therefore, a good way to predict the number of fatalities would be predicting the number of cases and then using a state-dependent offset.

It is also a good idea predict differently states with more and less than 100 cases. Data from states with less than 100 cases are just too noisy, so it is hard to get a model from the data. To fit the latter, we are going to just predict an exponential increase in the number of cases (it makes sense that if there are so few cases, we are at the first step of the epidemic: the exponential increase). Which exponent (*i. e.* slope in the logarithmic graph) do we use? Let's use the average of the exponents (slopes) on the states with more than 100 cases. The offset between cases an fatalities will be also the average of the exponents on the states with more than 100 cases.

The average slope and offset used here is obtained [*a posteriori*](https://www.kaggle.com/enriqueabad/exponential-and-logistic-like-fit/#slope_and_offset). Again, not best practices for producing clean code, but the time constraint of the competition was tight.

In [None]:
def predictions_100plus(key,dictionary,plot_it=True):
    country_values = dictionary[key]
    country, state = key.split('; ')
    texto = ''
    weighted_fitted = country_values[0]
    logistic_fitted = country_values[1]
    linear_fitted = country_values[2]
    weight_logistic2 = country_values[3]
    slope = country_values[4]
    stabilization = round(country_values[5])
    offset = country_values[6]
    first_day = country_values[7]
    first_day_100 = country_values[8]
    if weight_logistic2 > 0.5:
        texto = texto + 'Stabilyzing phase. Odds {0:4.2f} to 1.\n'.format(weight_logistic2/(1-weight_logistic2))
    else:
        texto = texto + 'Exponential phase. Odds {0:4.2f} to 1\n'.format((1-weight_logistic2)/weight_logistic2)
    t_double = np.log(2) / slope
    texto = texto + 'Duplication every {0:5.2f} days\n'.format(t_double)
    texto = texto + 'The state may stabilize with {0} cases\n'.format(stabilization)

    difference = first_day_100 - first_day
    first_day_int = difference.astype(int)
    difference = last_day - first_day_100
    last_day_int = difference.astype(int)
    x_tofit = np.arange(-first_day_int, last_day_int)
    x_date = np.arange(first_day, last_day)
    y_exponential = np.exp(linear_fitted(x_tofit))
    y_logistic = np.exp(logistic_fitted(x_tofit))
    y_weighted = np.exp(weighted_fitted(x_tofit))
    y_fatalities = np.exp(weighted_fitted(x_tofit) - offset)

    if plot_it:
        fig = plt.figure()
        plt.plot(x_date, y_exponential, label='Exponential fit')
        # This is because the annoying way of displaying dates by matplotlib
        plt.xticks(np.arange(np.datetime64('2020-01-30'), np.datetime64('2020-04-15'),timedelta(days=15)),
                   labels=np.datetime_as_string(np.arange(np.datetime64('2020-01-30'), np.datetime64('2020-04-15'),timedelta(days=15)),unit='D'))
        plt.plot(x_date, y_logistic, label='Logistic fit')
        plt.plot(x_date, y_weighted, label='Weighted fit')
        plt.plot(x_date, y_fatalities, label='Fatalities fit')
        plt.title('{0} ({1})'.format(country, state))
        plt.legend()
        plt.text(first_day, 4.9, texto)
        fig.autofmt_xdate()
    else:
        for i_x, i_date in enumerate(x_date):
            if i_date >= first_prediction:
                query_text = "Country == '{0}' & State == '{1}' & Date == '{2}'".format(country,state,i_date)
                index = test_data.query(query_text)['ForecastId'].to_numpy()[0]
                y_weighted = np.exp(weighted_fitted(x_tofit[i_x]))
                y_fatalities = np.exp(weighted_fitted(x_tofit[i_x]) - offset)
                test_data.at[index,'ConfirmedPred'] = y_weighted
                test_data.at[index,'FatalitiesPred'] = y_fatalities

def predictions_100minus(country,state,data_this_state):
    #print(data_this_state)
    avg_slope = 0.19266443693901186
    avg_offset = 4.577772921762378
    avg_growth_rate = np.exp(avg_slope)
    avg_fatality_rate = np.exp(-avg_offset)
    first_date = data_this_state.head(1)['Date'].to_numpy()[0]
    last_date = data_this_state.tail(1)['Date'].to_numpy()[0]
    this_date = first_date
    while this_date <= last_date:
        query_text = "Country == '{0}' & State == '{1}' & Date == '{2}'".format(country, state, this_date)
        try:
            index_test = test_data.query(query_text)['ForecastId'].index[0]
            index_train = train_data.query(query_text)['Id'].index[0]
            #print(index_train,index_test)
            test_data.at[index_test, 'ConfirmedPred'] = train_data.loc[index_train,'ConfirmedCases']
        except IndexError:
            pass
        this_date = this_date + np.timedelta64(1,'D')
    last_confirmed = data_this_state.tail(1)['ConfirmedCases'].to_numpy()[0]
    this_date = last_date + np.timedelta64(1,'D')
    while this_date <= last_day:
        query_text = "Country == '{0}' & State == '{1}' & Date == '{2}'".format(country, state, this_date)
        index = test_data.query(query_text)['ForecastId'].index[0]
        this_confirmed = last_confirmed * avg_growth_rate
        this_fatality = this_confirmed * avg_fatality_rate
        #print(this_date,this_confirmed,this_fatality)
        test_data.at[index, 'ConfirmedPred'] = this_confirmed
        test_data.at[index, 'FatalitiesPred'] = this_fatality
        this_date = this_date + np.timedelta64(1, 'D')
        last_confirmed = this_confirmed



# Prediction

OK, we have our model defined and described, and functions to predict the number of cases in countries with more and less than 100 cases. Let's put everything together in a loop and do predictions for all countries and all states.

In [None]:
large_countries = train_data['Country'].unique()

country_confirmed_parameters = {}

for i_country in large_countries:
    print(i_country)
    data_this_country = train_data.query("Country == '{0}'".format(i_country))
    states = data_this_country.State.unique()
    for i_state in states:
        data_this_state = data_this_country.query("State == '{0}'".format(i_state))
        #print(data_this_state)
        data_this_state_100 = data_this_state.query("ConfirmedCases > 100 & Fatalities > 0")
        if data_this_state_100.shape[0] < 2:
            predictions_100minus(i_country, i_state, data_this_state)
        else:
            confirmed_parameters_this_state = do_fitting(data_this_state_100)
            first_day = data_this_state['Date'].iloc[0].to_numpy().astype('datetime64[D]')
            first_day_100 = data_this_state_100['Date'].iloc[0].to_numpy().astype('datetime64[D]')
            confirmed_parameters_this_state = confirmed_parameters_this_state + (first_day, first_day_100)
            state_key = '{0}; {1}'.format(i_country, i_state)
            country_confirmed_parameters[state_key] = confirmed_parameters_this_state
            predictions_100plus(state_key,country_confirmed_parameters,plot_it=False)



And... that's it. Everything is predicted.

For states with more than 100 confirmed cases we can even see the graph!

In [None]:

from ipywidgets import widgets, interactive

i_country = widgets.Dropdown(options=train_data['Country'].unique().tolist(),description='Country:',value='Iran')
i_state = widgets.Dropdown(description='State')
#i_state = widgets.Dropdown(options=train_data['State'].unique().tolist(),description='State:')

def update(*args):
    i_state.options = train_data.query("Country == '{0}'".format(i_country.value)).State.unique()
    
i_country.observe(update)

def plotit(i_country,i_state):
    try:
        state_key = '{0}; {1}'.format(i_country, i_state)
        if state_key == 'Iran; None':
            predictions_100plus('Iran; All',country_confirmed_parameters)
        else:
            predictions_100plus(state_key,country_confirmed_parameters)
    
    except KeyError:
        print('This Country/State combination has less than 100 cases')
    

interactive(plotit,i_country=i_country,i_state=i_state)


<a id='slope_and_offset'></a>

Here we got the average slope and offset

In [None]:
list_confirmed_parameters = list(country_confirmed_parameters.values())

slopes = []
offsets = []

for i_element in list_confirmed_parameters:
    slopes.append(i_element[4])
    offsets.append(i_element[6])

average_slope = np.array(slopes).mean()
average_offset = np.array(offsets).mean()


And finally, we save the results

In [None]:
prediction = test_data[['ForecastId','ConfirmedPred','FatalitiesPred']]

prediction.to_csv('submission.csv',
                  header=['ForecastId','ConfirmedCases','Fatalities'],index=False)
