![title](https://raw.githubusercontent.com/emdemor/Covid-Brasil/main/source/title.png)

<center> <h2>SUMMARY</h2> </center>


*I developed a model to predict the curves of contamination, deaths and recoveries in the cases of COVID-19. Unlike time series analyzes, the model uses regression techniques to predict the rates of change as a function of the current number of contaminations, deaths and recoveries over time. With this, I hope that the model will be able to learn how the time derivatives change as we observe variations (even sudden ones) in the numbers of cases and, after a numerical integration, be able to predict future data with greater precision than simple temporal projection with regressors in time series analysis techniques.*

<img src="https://raw.githubusercontent.com/emdemor/Covid-Brasil/main/source/india_results.png">

<a id="toc"></a>
  
  
  <div  style="margin-top: 9px; background-color: #efefef; padding-top:10px; padding-bottom:10px;margin-bottom: 9px;box-shadow: 5px 5px 5px 0px rgba(87, 87, 87, 0.2);">
    <center>
        <h2>Content</h2>
    </center>

   
<ol>
    <li><a href="#01" style="color: #37509b;">Description</a></li>
    <li><a href="#02" style="color: #37509b;">Introduction</a></li>
    <li><a href="#03" style="color: #37509b;">Dataset</a></li>
    <li><a href="#04" style="color: #37509b;">Model</a></li>

</ol>


</div>

<a id="01" style="
  background-color: #37509b;
  border: none;
  color: white;
  padding: 2px 10px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 10px;" href="#toc">TOC ↻</a>
  
  
<div  style="margin-top: 9px; background-color: #efefef; padding-left:35px; padding-top:10px; padding-bottom:10px;margin-bottom: 9px;box-shadow: 5px 5px 5px 0px rgba(87, 87, 87, 0.2);">
    
<h1>1. Description</h1>

   
   
<ol type="i">
<!--     <li><a href="#0101" style="color: #37509b;">Inicialização</a></li>
    <li><a href="#0102" style="color: #37509b;">Pacotes</a></li>
    <li><a href="#0103" style="color: #37509b;">Funcoes</a></li>
    <li><a href="#0104" style="color: #37509b;">Dados de Indicadores Sociais</a></li>
    <li><a href="#0105" style="color: #37509b;">Dados de COVID-19</a></li>
 -->
</ol>




<a id="0101"></a>
<h2>1.1 Epidemiological Models <a href="#01"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

>The reader familiar with epidemiological models, can go to <a href="#0102">section 1.2</a>.


The simplest epidemiological models used for modeling SARS-COVID-19 are based on first principles and defined in terms of differential equations on quantities that describe the epidemiological status of the population. In the simplest cases, these quantities are:

* $D(t)$: cumulative number of deaths on time $t$;
* $C(t)$: cumulative number of confirmed cases on time $t$; 
* $R(t)$: cumulative number of recovered cases on time $t$;
* $I(t)$: current number infections on time $t$

Fixing the number of people constant, only three of these four functions are independent, and, in general, the function $ C(t) $ is eliminated by the relation $ C (t) = I (t) + R (t ) + D (t) $.

![title](https://raw.githubusercontent.com/emdemor/Covid-Brasil/main/source/Disease-Propagation-Simulation.gif)

An additional concept that appears in the models is that of *susceptibility* and takes into account that the entire population cannot, in fact, be subject to infection. It is to be expected that the number of susceptible people will be a dynamic variable, since cases of reinfection are practically non-existent (or certainly very unlikely in relation to cases of first infection). This variable is called:


* $S(t)$: number of individuals susceptible to infection over time $t$


Susceptible people, when infected, leave the group of $S(t)$ and become $C(t)$. As a consequence:

$\\\\$

<center>
$ \frac{dS}{dt}+\frac{dC}{dt} = 0, \tag{1}$
</center>

$\\\\$


Intuitively, the rate of infection is correlated with the number of susceptible people and the number of people actively infected. A simple model of this relationship is:

<center>
$ \frac{dC}{dt} = k_s S(t) I(t) \label{eq:propC} \tag{2a} ,$
</center>

$\\\\$

which, through equation (1), naturally leads to:

$\\\\$

<center>
$\frac{dS}{dt} = - k_s S(t) I(t) \label{eq:propS} \tag{2b} ,$
</center>

$\\\\$


where $k_s > 0$. 

Another expected behavior is that both the rate of people recovered and the rate of deaths are proportional in a given time $ t $ to the number of people actively infected:

$\\\\$


<center>
$ \frac{dR}{dt} = k_r I(t) \label{eq:propR} \tag{3a}$
</center>

$\\\\$

e

$\\\\$

<center>
$ \frac{dD}{dt} = k_d I(t) \label{eq:propD} \tag{3b}$
</center>

$\\\\$

where $k_r,k_d > 0$. 

Alternatively, you can use the link $ C(t) = I (t) + R (t) + D (t) $ to deduct the differential equation for $ I (t) $:

$\\\\$

<center>
$ \frac{dI}{dt} = \left( k_s S(t) - k_r - k_d\right)I(t)  \label{eq:propI} \tag{4}$
</center>

$\\\\$

The model described above is well founded, but takes several hypotheses into account. It has been successfully applied to describe the evolution of Covid-19 in different regions. However, these hypotheses are associated with idealized cases and, because of this, the model is not general enough to adapt to the modeling of Covid-19 cases. When this occurs, the authors call for generalizations of the models, generally based on different equations of the type:

$\\\\$

<center>
$ \frac{dS}{dt} = f_s(S(t),I(t),D(t),R(t)), \tag{5a}$
</center>

$\\\\$

<center>
$ \frac{dI}{dt} = f_i(S(t),I(t),D(t),R(t)), \tag{5b}$
</center>

$\\\\$

<center>
$ \frac{dD}{dt} = f_d(S(t),I(t),D(t),R(t)), \tag{5c}$
</center>

$\\\\$

<center>
$ \frac{dR}{dt} = f_r(S(t),I(t),D(t),R(t)), \tag{5d}$
</center>

$\\\\$

where the algebraic functions $ f_s $, $ f_i $, $ f_d $ and $ f_r $ are determined according to the problem under study.

In my opinion, the biggest problem with the application of epidemiological models to Covid-19 modeling is in the human factor. Several times, pathological behaviors were observed in the curves of Covid-19 (as in the case of China and the USA), which are abrupt or even mild changes, but which do not respect the equations in the model. What is wrong in this case is, in general, due to human action. Whether by changing collective behavior (collective abandonment of isolation rules), political influence on social behavior, changes in public policy rules, etc. In such cases, Machine Learning models can be used.

> [1] Ranjan, R. (2020). *Predictions for COVID-19 outbreak in India using Epidemiological models*. medRxiv.


> [2] Hethcote, H. W. (2009). The basic epidemiology models: models, expressions for R0, parameter estimation, and applications. In *Mathematical understanding of infectious disease dynamics* (pp. 1-61).

![title](https://raw.githubusercontent.com/emdemor/Covid-Brasil/main/source/211462.gif)

<a id="0102"></a>
<h2>1.2 Machine Learning Models <a href="#01"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

Unlike epidemiological models, machine learning models are, in general, structured in a different way. There is a wide range of approaches, but the dynamic variables $ C (t) $, $ D (t) $, $ R (t) $ and $ I (t) $ are generally taken as time series. As this approach is data oriented, there is no reference to the dynamic variable $ S (t) $, since it carries an abstract notion (that of the number of susceptible people) that is something inaccessible. This type of approach, in the short term, is quite satisfactory because it allows forecasts of the evolution of the disease in the coming days. However, in the long run, predictions are more dispersed and inaccurate. The reason for this is that the learning models with this characteristic relate each of the dynamic variables only with time. However, as discussed in the topic of epidemiological models, there are correlations between the variables $ C (t) $, $ D (t) $, $ R (t) $ and $ I (t) $.

The model proposed here is described as follows: I want to obtain a set of different equations similar to that of Eqs. (5), however, considering $ C (t) $ instead of $ I (t) $.

* Since it is not possible to accurately infer a time series for $ S (t) $, this variable won't be considered.
* In addition, I want to allow $ f $ functions to depend directly on time to.

In these considerations, the system will be reduced to:

$\\\\$


<center>
$ \frac{dC}{dt} = f_c(t,C(t),D(t),R(t)), \tag{6a}$
</center>

$\\\\$

<center>
$ \frac{dD}{dt} = f_d(t,C(t),D(t),R(t)), \tag{6b}$
</center>

$\\\\$

<center>
$ \frac{dR}{dt} = f_r(t,C(t),D(t),R(t)), \tag{6c}$
</center>

$\\\\$

* The $ f $ functions will be determined by applying regression models. With this, I will obtain three models of machine learning capable of predicting the variation rates of $ C $, $ D $ and $ R $ receiving data from $ t_ {in} $, $ C (t_ {in}) $, $ D (t_ {in}) $ and $ R (t_ {in}) $ as input.

* With the variation rates model, I can integrate the model in the opposite direction, taking the last data as an initial condition. In this process, the machine learning model will be validated

* As the model validated, future predictions can be performed

![title](https://raw.githubusercontent.com/emdemor/Covid-Brasil/main/source/tenor.gif)

<a id="02" style="
  background-color: #37509b;
  border: none;
  color: white;
  padding: 2px 10px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 10px;" href="#toc">TOC ↻</a>
  
  
<div  style="margin-top: 9px; background-color: #efefef; padding-left:35px; padding-top:10px; padding-bottom:10px;margin-bottom: 9px;box-shadow: 5px 5px 5px 0px rgba(87, 87, 87, 0.2);">
    
<h1>2. Introduction</h1>

   
   
<ol type="i">
<!--     <li><a href="#0101" style="color: #37509b;">Inicialização</a></li>
    <li><a href="#0102" style="color: #37509b;">Pacotes</a></li>
    <li><a href="#0103" style="color: #37509b;">Funcoes</a></li>
    <li><a href="#0104" style="color: #37509b;">Dados de Indicadores Sociais</a></li>
    <li><a href="#0105" style="color: #37509b;">Dados de COVID-19</a></li>
 -->
</ol>




<a id="0201"></a>
<h2>2.1 Modules e Packages <a href="#02"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings


from datetime import datetime,timedelta
from scipy.integrate import odeint

!pip install xtlearn
from xtlearn.feature_selection import FeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.metrics import r2_score,mean_squared_log_error,mean_absolute_error

from sklearn.base import BaseEstimator,TransformerMixin

<a id="0202"></a>
<h2>2.2 Settings<a href="#02"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In [None]:
warnings.filterwarnings("ignore")

# Setting seaborn style
sns.set_style("darkgrid")
colors = ["#449353","#ff9999","#4f7cac", "#80e4ed","#f8f99e","#b5ddbd"]
sns.set_palette(sns.color_palette(colors))
plt.rcParams["figure.figsize"] = [8,5]

# Setting Pandas float format
pd.options.display.float_format = '{:,.1f}'.format

SEED = 42   #The Answer to the Ultimate Question of Life, The Universe, and Everything
np.random.seed(SEED)

<a id="0203"></a>
<h2>2.3 Classes e Functions<a href="#02"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

The creation of this class was necessary to introduce the possibility of calculating the moving average in the model's pipelines.

In [None]:
class RollingMean(BaseEstimator,TransformerMixin):
    '''
    Description
    ----------
    Provide rolling window calculations.
   
    Arguments
    ----------
    window: int
        Size of the moving window. This is the number of observations used for calculating the statistic.
        
    min_periods: int, default None
        Minimum number of observations in window required to have a value 
        (otherwise result is NA). For a window that is specified by an offset, 
        min_periods will default to 1. Otherwise, min_periods will default 
        to the size of the window.
        
    center: bool, default False
        Set the labels at the center of the window.
        

    active: boolean
        This parameter controls if the selection will occour. This is useful in hyperparameters searchs to test the contribution
        in the final score
        
    '''
    
    def __init__(self,window,
                 min_periods = None,
                 center = False,
                 active=True,
                 columns = 'all'
                ):
        self.columns = columns
        self.active = active
        self.window = window
        self.min_periods = min_periods
        self.center = center

        
    def fit(self,X,y=None):
        return self
        
    def transform(self,X):
        if not self.active:
            return X
        else:
            return self.__transformation(X)

    def __transformation(self,X_in):
        X = X_in.copy()
        
        if type(self.columns) == str:
            if self.columns == 'all':
                self.columns = list(X.columns)
        
        for col in self.columns:  
            X[col] = X[col].fillna(0).rolling(window = self.window,
                                              min_periods = self.min_periods,
                                              center = self.center
                                             ).mean()
        return X.dropna()
        
    def inverse_transform(self,X):
        return X

This class allows the application of the logarithm in the attributes to be one of the steps in the pipeline.

In [None]:
class ApplyLog1p(BaseEstimator,TransformerMixin):
    '''
    Description
    ----------
    Apply numpy.log1p to specified features.
   
    Arguments
    ----------
        
    columns: list, default False
        Column names to apply numpy.log1p.
        

    active: boolean
        This parameter controls if the selection will occour. This is useful in hyperparameters searchs to test the contribution
        in the final score
        
    '''
    
    def __init__(self,active=True,columns = 'all'):
        self.columns = columns
        self.active = active
        
    def fit(self,X,y=None):
        return self
        
    def transform(self,X):
        if not self.active:
            return X
        else:
            return self.__transformation(X)

    def __transformation(self,X_in):
        X = X_in.copy()
        
        if type(self.columns) == str:
            if self.columns == 'all':
                self.columns = list(X.columns)
        
        for col in self.columns:  
            X[col] = np.log1p(X[col])
            
        return X
        
    def inverse_transform(self,X):
        if not self.active:
            return X
        else:
            return self.__inverse_transformation(X)

    def __inverse_transformation(self,X_in):
        X = X_in.copy()
        
        if type(self.columns) == str:
            if self.columns == 'all':
                self.columns = list(X.columns)
        
        for col in self.columns:  
            X[col] = np.expm1(X[col])
            
        return X

<a id="03" style="
  background-color: #37509b;
  border: none;
  color: white;
  padding: 2px 10px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 10px;" href="#toc">TOC ↻</a>
  
  
<div  style="margin-top: 9px; background-color: #efefef; padding-left:35px; padding-top:10px; padding-bottom:10px;margin-bottom: 9px;box-shadow: 5px 5px 5px 0px rgba(87, 87, 87, 0.2);">
    
<h1>3. Dataset</h1>

   
   
<ol type="i">
<!--     <li><a href="#0101" style="color: #37509b;">Inicialização</a></li>
    <li><a href="#0102" style="color: #37509b;">Pacotes</a></li>
    <li><a href="#0103" style="color: #37509b;">Funcoes</a></li>
    <li><a href="#0104" style="color: #37509b;">Dados de Indicadores Sociais</a></li>
    <li><a href="#0105" style="color: #37509b;">Dados de COVID-19</a></li>
 -->
</ol>




In [None]:
country = 'India'
df = pd.read_csv(
    'https://media.githubusercontent.com/media/microsoft/Bing-COVID-19-Data/master/data/Bing-COVID19-Data.csv',
parse_dates=['Updated'])
filter_ = True
filter_ &= df['Country_Region'] == country
filter_ &= df['AdminRegion1'].isna()


dataset = df[filter_].rename(columns={
    "Updated": "date",
    "Country_Region": "country",
    "Confirmed": "cases",
    "ConfirmedChange": "change_cases",
    "Deaths": "deaths",
    "DeathsChange": "change_deaths",
    "Recovered": "recovered",
    "RecoveredChange": "change_recovered",
})[["date","country","cases","deaths",
   "recovered","change_cases","change_deaths","change_recovered"]].fillna(0)

In [None]:
# dataset = pd.read_csv('data/brazil_covid19_macro.csv',parse_dates=['date']).drop(columns=['monitoring'])
first_day = dataset.iloc[0]['date']
dataset['days'] = (dataset['date']-first_day).dt.days

dataset['change_cases'] = (dataset['cases']-dataset['cases'].shift())
dataset['change_deaths'] = (dataset['deaths']-dataset['deaths'].shift())
dataset['change_recovered'] = (dataset['recovered']-dataset['recovered'].shift())

In [None]:
plt.scatter(dataset['date'],dataset['cases'],linewidth=1,s=5,label='Confirmed Cases')
plt.scatter(dataset['date'],dataset['deaths'],linewidth=1,s=5, label='Deaths')
plt.scatter(dataset['date'],dataset['recovered'],linewidth=1,s=5, label='Recupered Cases')
plt.xticks(rotation=45)
plt.title('Number of Cases')
plt.xlabel('Data')
plt.ylabel('Millions of Cases')
plt.legend()
plt.show()

In [None]:
plt.plot(dataset['date'],dataset['change_cases']/1000,linewidth=1,label='Daily Cases',alpha = 0.4,color = colors[0])
plt.plot(dataset['date'],0.001*dataset['change_cases'].rolling(window = 14,center=True).mean(),
         label='Rolling Mean (14 days)',color = colors[0])

plt.xticks(rotation=45)
plt.title('Daily Cases of Covid-19')
plt.xlabel('Date')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.plot(dataset['date'],dataset['change_deaths'],linewidth=1,label='Daily Deaths',color = colors[1],alpha=0.4)
plt.plot(dataset['date'],dataset['change_deaths'].rolling(window = 14,center=True).mean(),
         label='Rolling Mean (14 days)',color = colors[1])


plt.xticks(rotation=45)
plt.title('Daily Deaths of Covid-19')
plt.xlabel('Date')
plt.ylabel('Number of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.plot(dataset['date'],dataset['change_recovered']/1000,linewidth=1,label='Daily Recoveries',color = colors[2],alpha=0.4)
plt.plot(dataset['date'],0.001*dataset['change_recovered'].rolling(window = 14,center=True).mean(),
         label='Rolling Mean (14 days)',color = colors[2])


plt.xticks(rotation=45)
plt.title('Daily Recoveries of Covid-19')
plt.xlabel('Date')
plt.ylabel('Thousands of Cases')
plt.legend()
plt.show()

<a id="04" style="
  background-color: #37509b;
  border: none;
  color: white;
  padding: 2px 10px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 10px;" href="#toc">TOC ↻</a>
  
  
<div  style="margin-top: 9px; background-color: #efefef; padding-left:35px; padding-top:10px; padding-bottom:10px;margin-bottom: 9px;box-shadow: 5px 5px 5px 0px rgba(87, 87, 87, 0.2);">
    
<h1>4. Model</h1>

   
   
<ol type="i">
<!--     <li><a href="#0101" style="color: #37509b;">Inicialização</a></li>
    <li><a href="#0102" style="color: #37509b;">Pacotes</a></li>
    <li><a href="#0103" style="color: #37509b;">Funcoes</a></li>
    <li><a href="#0104" style="color: #37509b;">Dados de Indicadores Sociais</a></li>
    <li><a href="#0105" style="color: #37509b;">Dados de COVID-19</a></li>
 -->
</ol>




<a id="0401"></a>
<h2>4.1 Pre-Processing Pipelines<a href="#02"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

The steps for pre-processing the data consist of:
1. Apply the 14-day rolling mean
2. Select only the attributes * ('days', 'cases', 'deaths', 'recovered', 'change_cases', 'change_deaths', 'change_recovered') *
3. Application of the Logarithmic Scale in the Attributes

To make the rolling mean and the logarithmic scale, I created two classes in the 'Classes and Functions' Section

In [None]:
# Pipeline for preprocessing
preproc = Pipeline(steps = [
    ('rolling_mean',RollingMean(window = 21,columns = [
        'cases', 'deaths', 'recovered',
        'change_cases','change_deaths','change_recovered'],center = True)),
    
    ('select',FeatureSelector(features = ['days','cases', 'deaths',
        'recovered','change_cases','change_deaths','change_recovered'])
    ),
])

# Log Scalling
log_apply = ApplyLog1p(columns = 'all')

<a id="0402"></a>
<h2>4.2 Splitting Training and Test Data<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In [None]:
# Applying pre-processing pipeline
df = log_apply.transform(preproc.transform(dataset))

# full dataset
X = df[['days','cases', 'deaths', 'recovered']]
yc = df['change_cases']
yd = df['change_deaths']
yr = df['change_recovered']

train_size = 0.85
index_split = int(round(train_size*len(X)))

# training dataset
X_trn  = X.iloc[:index_split]
yc_trn = yc.iloc[:index_split]
yd_trn = yd.iloc[:index_split]
yr_trn = yr.iloc[:index_split]

#test dataset
X_tst  = X.iloc[index_split:]
yc_tst = yc.iloc[index_split:]
yd_tst = yd.iloc[index_split:]
yr_tst = yr.iloc[index_split:]

<a id="0403"></a>
<h2>4.3 Regression of the Time Derivatives of Cases, Deaths and Recovery<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

### Pipelines

In [None]:
# Pipeline for regression
regression_c = Pipeline(steps = [
    ('polinomial',PolynomialFeatures(degree = 2)),
    ('regressor',LinearRegression()),
])
regression_d = Pipeline(steps = [
    ('polinomial',PolynomialFeatures(degree = 2)),
    ('regressor',LinearRegression()),
])
regression_r = Pipeline(steps = [
    ('polinomial',PolynomialFeatures(degree = 2)),
    ('regressor',LinearRegression()),
])

### Metrics

In [None]:
regression_c.fit(X_trn,yc_trn)
print('M.A.E. of cases (train)= %.4f'%mean_absolute_error(yc,regression_c.predict(X)))
print('M.A.E. of cases (test) = %.4f'%mean_absolute_error(yc_tst,regression_c.predict(X_tst)))

regression_d.fit(X_trn,yd_trn)
print('\nM.A.E. of deaths (train)= %.4f'%mean_absolute_error(yd,regression_d.predict(X)))
print('M.A.E. of deaths (test)= %.4f'%mean_absolute_error(yd_tst,regression_d.predict(X_tst)))

regression_r.fit(X_trn,yr_trn)
print('\nM.A.E. of recovered (train)= %.4f'%mean_absolute_error(yr,regression_r.predict(X)))
print('M.A.E. of deaths (test)= %.4f'%mean_absolute_error(yd_tst,regression_d.predict(X_tst)))

In [None]:
predictions = log_apply.inverse_transform(pd.concat([
    X.reset_index(drop=True),
    pd.DataFrame(regression_c.predict(X),columns = ['change_cases']),
    pd.DataFrame(regression_d.predict(X),columns = ['change_deaths']),
    pd.DataFrame(regression_r.predict(X),columns = ['change_recovered']),
],1))

In [None]:
plt.scatter(dataset['days'],dataset['change_cases']/1000,s=5,label='Daily Cases',alpha = 0.3,color = colors[0])
plt.plot(predictions['days'],0.001*predictions['change_cases'].rolling(window = 14,center=True).mean(),
         linewidth=2,label='Regression',color = colors[0])

# plt.xticks(rotation=45)
plt.title('Model for Daily Cases of Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.scatter(dataset['days'],dataset['change_deaths']/1000,s=5,label='Daily Deaths',alpha = 0.3,color = colors[1])
plt.plot(predictions['days'],0.001*predictions['change_deaths'].rolling(window = 14,center=True).mean(),
         linewidth=2,label='Regression',color = colors[1])

# plt.xticks(rotation=45)
plt.title('Model for Daily Deaths by Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(dataset['days'],dataset['change_recovered']/1000,s=5,label='Daily Recoveries',alpha = 0.3,color = colors[2])
plt.plot(predictions['days'],0.001*predictions['change_recovered'].rolling(window = 14,center=True).mean(),
         linewidth=2,label='Regression',color = colors[2])

# plt.xticks(rotation=45)
plt.title('Model for Daily Recoveries of Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

<a id="0404"></a>
<h2>4.4 Integration of Differential Equations<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

Now that we have used the test dataset to check the model, we can train the model with the entire dataset to make future predictions:

In [None]:
regression_c.fit(X,yc)
regression_d.fit(X,yd)
regression_r.fit(X,yr)

predictions = log_apply.inverse_transform(pd.concat([
    X.reset_index(drop=True),
    pd.DataFrame(regression_c.predict(X),columns = ['change_cases']),
    pd.DataFrame(regression_d.predict(X),columns = ['change_deaths']),
    pd.DataFrame(regression_r.predict(X),columns = ['change_recovered']),
],1))

With regressions, we have a machine learning model for the following system of differential equations:

$\\\\$


<center>
$ \frac{dC}{dt} = f_c(t,C(t),D(t),R(t)), $
</center>

$\\\\$

<center>
$ \frac{dD}{dt} = f_d(t,C(t),D(t),R(t)), $
</center>

$\\\\$

<center>
$ \frac{dR}{dt} = f_r(t,C(t),D(t),R(t)),$
</center>

$\\\\$
This can be solved numerically with the `scipy` library.

First, the system of differential equations is defined as:

In [None]:
def diff_eq(x,t):
    """
    Function resturning the differential equations of the model

    """
    # setting the functions
    c,r,d = x
    lnt = np.log1p(t)
    lnx = np.log1p(x)
    
    
    # mathematical equations
    DiffC = np.expm1(regression_c.predict([[lnt]+list(lnx)]))[0]
    DiffD = np.expm1(regression_d.predict([[lnt]+list(lnx)]))[0]
    DiffR = np.expm1(regression_r.predict([[lnt]+list(lnx)]))[0]

    return np.array([DiffC,DiffD,DiffR])

def neg_diff_eq(x,t):
    return -diff_eq(x,-t)

Performing the integrations:

In [None]:
# defining the limits
t_min = 20
t_max   = 500
n_points = 500

# initial conditions
t0,*x0 = np.expm1(list(X.iloc[-50]))

# counting the points
n_points_right = int(round(n_points*(t_max-t0) / (t_max-t_min)))
n_points_left = int(round(n_points*(t0-t_min) / (t_max-t_min)))

# right integrate
days_list = np.linspace(t0,t_max,n_points_right)
x = odeint(diff_eq,x0,days_list)

# left integrate
neg_days_list = np.linspace(-t0,-t_min,n_points_left)
neg_x = odeint(neg_diff_eq,x0,neg_days_list)

#joinning solution
t_full = np.concatenate((-neg_days_list[::-1], days_list))
x_full = np.concatenate((neg_x[::-1], x))

print('Deaths: %d' % x[-1,1])

<a id="0405"></a>
<h2>4.5 Comparison between model prediction and data<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In [None]:
plt.scatter(dataset['days'],0.000001*dataset['cases'],marker='.',s=80,alpha=0.3,color = colors[0],
            label='Confirmed Cases')
plt.plot(t_full,0.000001*x_full[:,0],color='black',linestyle='dashed',linewidth=1.3,label='Model')

plt.title('Model for Confirmed Cases of Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Millions of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.scatter(dataset['days'],0.001*dataset['deaths'],marker='.',s=80,alpha=0.3,color = colors[1],label='Deaths')
plt.plot(t_full,0.001*x_full[:,1],color='black',linestyle='dashed',linewidth=1.3,label='Model')

plt.title('Model for Deaths by Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

In [None]:
plt.scatter(dataset['days'],0.000001*dataset['recovered'],marker='.',s=80,alpha=0.3,color = colors[2],
            label='Recoveries')
plt.plot(t_full,0.000001*x_full[:,2],color='black',linestyle='dashed',linewidth=1.3,label='Model')

plt.title('Model for Recovered Cases of Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Thousands of Cases')
plt.legend(loc = 'upper left')
plt.show()

<a id="0406"></a>
<h2>4.6 Class for Complete Processing<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In the previous sections, each step of the process was carried out in detail. However, to search for the best hyperparameters of the model, you must run the entire sequence of steps again for each attempt. Instead, it is better to automate the process and the best way to do this is by defining a new class:

In [None]:
class Covid19Regressor(BaseEstimator,TransformerMixin):
    '''
    Description
    ----------
    Arguments
    ----------
    active: boolean
        This parameter controls if the selection will occour. This is useful in hyperparameters searchs to test the contribution
        in the final score
        
    '''
    
    def __init__(self,
                 confirmed = 'cases', 
                 deaths = 'deaths',
                 recovered = 'recovered',
                 
                 confirmed_rate = 'change_cases', 
                 deaths_rate = 'change_deaths',
                 recovered_rate = 'change_recovered',
                 
                 time = 'days',
                 window = 7,
                 min_periods = None,
                 center = True,
                 polynomial_degree = 2,
                 regressor = LinearRegression,
                 regressor_parameters = {},
                 t_initial = 'last',
                 t_min = 20,
                 t_max = 500,
                 n_points = 500
                 
                ):
        
        self.confirmed = confirmed
        self.confirmed_rate = confirmed_rate
        self.deaths = deaths
        self.deaths_rate = deaths_rate
        self.recovered = recovered
        self.recovered_rate = recovered_rate
        self.time = time
        self.window = window
        self.min_periods = min_periods
        self.center = center
        self.polynomial_degree = polynomial_degree
        self.regressor = regressor
        self.regressor_parameters = regressor_parameters
        self.t_initial = t_initial
        self.t_min = t_min
        self.t_max = t_max
        self.n_points = 1+n_points
        
        
    def fit(self,X,y):
        
        # Receiving the data
        self.X = X[[self.time,self.confirmed,self.deaths,self.recovered]].copy()
        self.y = y[[self.confirmed_rate,self.deaths_rate,self.recovered_rate]].copy()
        
        # Evaluating the rolling mean for X
        for col in [self.confirmed,self.deaths,self.recovered]:  
            self.X[col] = self.X[col].fillna(0).rolling(window = self.window,
                                              min_periods = self.min_periods,
                                              center = self.center
                                             ).mean()
            
        # Evaluating the rolling mean for y    
        for col in [self.confirmed_rate,self.deaths_rate,self.recovered_rate]:  
            self.y[col] = self.y[col].fillna(0).rolling(window = self.window,
                                              min_periods = self.min_periods,
                                              center = self.center
                                             ).mean()
            
        # Applying the log scale
        self.X[self.time] = np.log1p(self.X[self.time])

        for col in [self.confirmed,self.deaths,self.recovered]: 
            self.X[col] = np.log1p(self.X[col])
            
        for col in [self.confirmed_rate,self.deaths_rate,self.recovered_rate]: 
            self.y[col] = np.log1p(self.y[col])
        
        # Dropping NaN
        temp = pd.concat([self.X,self.y],1).dropna()
        self.X = temp[[self.time,self.confirmed,self.deaths,self.recovered]]
        self.y = temp[[self.confirmed_rate,self.deaths_rate,self.recovered_rate]]
            
        # Pipeline for regression
        regression_c = Pipeline(steps = [
            ('polinomial',PolynomialFeatures(degree = self.polynomial_degree)),
            ('regressor',self.regressor(**self.regressor_parameters)),
        ])
        regression_d = Pipeline(steps = [
            ('polinomial',PolynomialFeatures(degree = self.polynomial_degree)),
            ('regressor',self.regressor(**self.regressor_parameters)),
        ])
        regression_r = Pipeline(steps = [
            ('polinomial',PolynomialFeatures(degree = self.polynomial_degree)),
            ('regressor',self.regressor(**self.regressor_parameters)),
        ])
        
        # Fitting model
        regression_c.fit(self.X,self.y[self.confirmed_rate])
        regression_d.fit(self.X,self.y[self.deaths_rate])
        regression_r.fit(self.X,self.y[self.recovered_rate])
        
        
        # Predicted Rates
        self.predicted_rate = pd.concat([
            self.X.reset_index(drop=True),
            pd.DataFrame(regression_c.predict(self.X),columns = ['pred_'+self.confirmed_rate]),
            pd.DataFrame(regression_d.predict(self.X),columns = ['pred_'+self.deaths_rate]),
            pd.DataFrame(regression_r.predict(self.X),columns = ['pred_'+self.recovered_rate]),
        ],1)
        
        for col in self.predicted_rate.columns:
            self.predicted_rate[col] = np.expm1(self.predicted_rate[col])
        
        
        # Defining the diferential equations
        def diff_eq(x,t):
            """
            Function resturning the differential equations of the model

            """
            # setting the functions
            c,r,d = x
            lnt = np.log1p(t)
            lnx = np.log1p(x)


            # mathematical equations
            DiffC = np.expm1(regression_c.predict([[lnt]+list(lnx)]))[0]
            DiffD = np.expm1(regression_d.predict([[lnt]+list(lnx)]))[0]
            DiffR = np.expm1(regression_r.predict([[lnt]+list(lnx)]))[0]

            return np.array([DiffC,DiffD,DiffR])

        def neg_diff_eq(x,t):
            return -diff_eq(x,-t)
        
        if type(self.t_initial) == str:
            if self.t_initial == 'last':
                t_initial = int(round(list(cov19.predicted_rate[self.time])[-1]))
        else:
            t_initial = self.t_initial
        
        
        ind_ref = self.predicted_rate[self.time][
            round(self.predicted_rate[self.time]).astype(int) == t_initial].index[0]
        
        # initial conditions
        t0,*x0 = np.expm1(list(self.X.iloc[ind_ref]))

        n_points_right = int(round(self.n_points*(self.t_max-t0) / (self.t_max-self.t_min)))
        n_points_left = int(round(self.n_points*(t0-self.t_min) / (self.t_max-self.t_min)))


        # right integrate
        days_list = np.linspace(t0,self.t_max,n_points_right)
        x = odeint(diff_eq,x0,days_list)

        # left integrate
        neg_days_list = np.linspace(-t0,-self.t_min,n_points_left)
        neg_x = odeint(neg_diff_eq,x0,neg_days_list)

        #joinning solution
        self.t_ode = np.concatenate((-neg_days_list[::-1], days_list))
        self.x_ode = np.concatenate((neg_x[::-1], x))
        
        self.predictions = pd.concat([
            pd.DataFrame(self.t_ode,columns = [self.time]),
            pd.DataFrame(self.x_ode,columns = [self.confirmed,self.deaths,self.recovered])]
        ,1)

        return self
        
    def transform(self,X):
        return X
    
    def predict(self,X):
        
       
        return np.array([
            np.interp(X, self.t_ode, self.x_ode[:,0]),
            np.interp(X, self.t_ode, self.x_ode[:,1]),
            np.interp(X, self.t_ode, self.x_ode[:,2]),
        ])
        

<a id="0407"></a>
<h2>4.7 Size of Training Dataset<a href="#04"
style="
    border-radius: 10px;
    background-color: #f1f1f1;
    border: none;
    color: #37509b;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    padding: 4px 4px;
    font-size: 14px;
">↻</a></h2>

In [None]:
cov19 = Covid19Regressor(window = 7,polynomial_degree = 2,t_initial = 300,t_max=365)
cov19.fit(dataset[['days','cases','deaths','recovered']],
    dataset[['change_cases','change_deaths','change_recovered']])

t_list = np.arange(20,365,1)
x_list = cov19.predict(t_list)

plt.figure(figsize=(8,5))
plt.scatter(dataset['days'],0.000001*dataset['cases'],marker='.',s=80,alpha=0.3,color = colors[0],label='Confirmed Cases')
plt.plot(t_list,0.000001*x_list[0],color=colors[0],linestyle='dashed',linewidth=1.3,label='Model - Cases')

plt.scatter(dataset['days'],0.000001*dataset['deaths'],marker='.',s=80,alpha=0.3,color = colors[1],label='Deaths')
plt.plot(t_list,0.000001*x_list[1],color=colors[1],linestyle='dashed',linewidth=1.3,label='Model - Deaths')

plt.scatter(dataset['days'],0.000001*dataset['recovered'],marker='.',s=80,alpha=0.3,color = colors[2],label='Recoveries')
plt.plot(t_list,0.000001*x_list[2],color=colors[2],linestyle='dashed',linewidth=1.3,label='Model - Recoveries')


plt.title('Model for Covid-19')
plt.xlabel('Days After First Case')
plt.ylabel('Millions of Cases')
# plt.legend(loc = 'lower right')
plt.legend(bbox_to_anchor=(0.02, 0.6), loc=3, borderaxespad=0.)
plt.show()
print('Deaths: %d' % x_list[1][-1])