# Estimating the SIR parameters

In [None]:
import time
time.asctime()

We start with the SIR model

(1) $\frac{d S}{dt} = - \beta \frac{S I}{N}$

(2) $\frac{d I}{dt} = \beta \frac{S I}{N} - \gamma I$,

which describes the evolution of an epidemic, see e.g. https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model. Here $S$ is the number of susceptible people who have not yet been exposed to the virus, $I$ is the number of infected people who can transmit the disease and $t$ is time. The total population $N - S + I + R$, where $R$ is the number of people who have recovered and are no longer infectious, is a constant. 

The critical parameters of the model are $\beta$ (the rate of infection or transmission) and $\gamma$ the rate of recovery. The aim of this notebook is to provide estimates for $\beta$ and $\gamma$ from the covid19-global-forecasting dataset. Note that dataset covid19-global-forecasting dataset does not provide $S$ or $I$ but instead the cummulative number of confirmed cases of COVID-19 infected people on a daily basis for many countries. 


## A differential equation for the cumulative number of confirmed cases

Let 

(3) $C(t) \equiv p \int_0^t I(\tau) d\tau$

be the cumulative number of COVID-19 infected people detected at time $t$. Here $p$ is the unkown proportion of people detected over the total population of infected people. (It's likely that $p \ll 1$ since not everybody who shows symptoms gets tested.)

Equation (1) and (2) can be integrated after inserting (3) to give

$S(t) = S(0) \exp(-\frac{\beta C(t)}{N p})$

$S(t) = N - \frac{\gamma C}{p} - \frac{d}{dt} \frac{C}{p}$

After eliminating $S(t)$ one gets

(4) $\frac{dC}{C} = dt \beta \left(\frac{C}{N p}\right)^{-1} \left( 1 - \exp (- \beta \frac{C}{N p}) \right) - dt \gamma$

At the onset of the epidemic, $x \equiv C/(N p) \ll 1$ is small and the term $(1 - \exp(-x))/x$ on the right hand side can be approximated by $(1 - (1 - x + x^2/2 - \cdots))/x \approx 1 - x/2$: 

(5) $\frac{\Delta C}{C} \approx (\beta - \gamma)  - \frac{\beta^2}{2 N p} C$.

where $\Delta C$ is the daily increment of infections. This equation states that the relative increment of cases ($\Delta C/C$) is a function that is approximately linear with $C$ when $C/N p \ll 1$. 


## Coefficient $\gamma$ as the inverse of the time it takes to recover

There is no reason to assume that people recover at different rates across countries so $\gamma$ can be taken to be the same for all countries. Coefficient $\gamma$ can be inferred from the average number of days patients spend in hospital (12 according to https://abc7ny.com/health/aetna-waives-patient-payments-for-coronavirus-hospital-stays/6049429/). The typical number of days it takes for a person to flush out the virus would have to be larger, let's assume 15 days. Hence $\gamma \approx 1/15$. 

## Coefficient $\beta$ estimated from the $y$-intercept
Coefficient $\beta$ may vary in the range $0.05 - 0.2$, depending on social distance and other implemented measures.  Once $\gamma$ is nailed down, use the $y$ intercept $\beta - \gamma$ in Equ. (5) to estimate $\beta$. Note that the $y$-intercept relates to the basic reproduction number $R_0 \equiv \beta/\gamma  - 1$. For the epidemic to start, $R_0 > 1$.

## The slope gives the population tested for the virus $N p$
The slope of $\Delta C/C$ as a function of the number of confirmed cases $C$ is $-\beta^2/(2 N p)$. Apart from the variability in $\beta$ and the proportion of people tested $p$, the slope is expected to be flatter for large populations (large $N$). 

## Looking at the confirmed cases

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/train.csv')

# remove rows that have fewer than 100 cases (the SIR model is only valid for large S, I and N)
df_100 = df.loc[df.ConfirmedCases >= 100]

# combine country and region into a single column
df_100 = df_100.fillna('-')

df_100['country_state'] = df_100.Country_Region + ' ' + df_100.Province_State

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = px.line(df_100, x='Date', y=df_100.ConfirmedCases, color='country_state')
fig.update_layout(yaxis_type='log')
fig.show()

In [None]:
# focus on a few countries and provinces/states
countries = ['Italy -', 'Spain -', 'Germany -', 'Iran -', 'Switzerland -', 'Korea, South -', 'Netherlands -',
      'Singapore -', 'France -', 'Austria -', 'New Zealand -', 'United Kingdom -',
      'US New York', 'US New Jersey', 'US Washington', 'US California', 'China Hubei', 'China Hong Kong']

fig = go.Figure()
for country in countries:
    d = df_100.loc[df_100.country_state == country]
    fig.add_trace(go.Scatter(x=d.Date, y=d.ConfirmedCases, mode='lines', name=country))
fig.update_layout(yaxis_type='log', yaxis_title='ConfirmedCases')
fig.show()

## Looking at the relative daily increases $\Delta C/C$

In [None]:
def getDailyIncrement(C):
    """
    Estimate the Delta C by taking second order accurate differences
    """
    
    dC = np.zeros(C.shape, np.float32)
    n = C.shape[0]

    if n >= 3:
        # need at least three points
        
        # in the interior, apply second order differencing
        dC[1:-1] = 0.5*(C[2:] - C[:-2])
        
        # extrapolate for the first and last pointds
        dC[0] = dC[1]
        dC[-1] = dC[-2]
    
    return dC

def addDailyIncrementColumn(df):
    """
    Add column with time derivative of the cummulative number of cases
    """
    # get the list of country/regions
    countries = df['Country_Region'].unique()
    
    # initialize
    df['dC'] = np.zeros(df.shape[0], np.float32)  # d C/dt
    
    for country in countries:
        
        df2 = df.loc[df.Country_Region == country]
        
        # get the list of states
        states = df2['Province_State'].unique()
        mskCountry = (df.Country_Region == country)
        
        if len(states) == 1:
            # no states, just one country
            C = df.loc[mskCountry, 'ConfirmedCases'].array
            dC = getDailyIncrement(C)
            df.loc[mskCountry, 'dC'] = dC
        else:
            # treat each state separately
            for state in states:
                msk = mskCountry & (df.Province_State == state)
                C = df.loc[msk, 'ConfirmedCases'].array
                dC = getDailyIncrement(C)
                df.loc[msk, 'dC'] = dC
  

In [None]:
# add "dC" (Delta C) column
addDailyIncrementColumn(df_100)

In [None]:
fig = go.Figure()
for country in countries:
    d = df_100.loc[df_100.country_state == country]
    fig.add_trace(go.Scatter(x=d.ConfirmedCases, y=d.dC/d.ConfirmedCases, mode='lines', name=country))
fig.update_layout(xaxis_title="C = # of confirmed cases", 
                  yaxis_title="Delta C/C = relative daily increase of confirmed cases",)
fig.show()

In [None]:
# extract beta and N*p from linear regression
from sklearn import linear_model
lm = linear_model.LinearRegression()
# choose gamma
gamma = 1/15.
res = {'country': [],
       'y-intercept': [],
       'slope': [],
       'beta': [],
       'N*p': []}
for country in countries:
    d = df_100.loc[df_100.country_state == country]
    x = d.ConfirmedCases.to_numpy()
    y = d.dC.to_numpy()/x
    model = lm.fit(x.reshape(-1, 1), y)
    res['country'].append(country)
    res['y-intercept'].append(lm.intercept_)
    res['slope'].append(lm.coef_[0])
    beta = lm.intercept_ + gamma
    res['beta'].append(beta)
    Np = - beta**2 / (2*lm.coef_[0])
    res['N*p'].append(Np)
df_res = pd.DataFrame(res)
df_res

## Summary 

Discarding data with $C < 10k$, we see that most trajectories settle to a value $\beta - \gamma$ of about 0.2 in inverse days units and a negative slope of $\sim -0.1/80e3 \approx -1.3e-6$. 

Given an estmate of $\gamma \approx 1/15 = 0.07$ this would mean that $\beta \approx 0.2 - 0.3$ in inverse days units. There are some regional differences, New York, Spain and Italy have a higher $\beta$ while Singapore, Washington and Switzerland have lower $\beta$ values (assuming $\gamma$ is the same across countries). Washington State might end up with 15k cases and Switzerland with 30k cases. For Singapore, the number of confirmed cases may never reach 2k. 