# Overview
This notebook consists of two main parts:
1. Data Analysis (Development, Fatality Rate, ...) (finished)
2. Modelling and Prediction (SIR, ML approaches, ...) (not fully finished)

Note: the number of confirmed cases is strongly dependent on the number of Covid-19-tests that are performed over time. Many countries are still ramping up their testing efforts and they do so at different speeds. Thus, I will mostly exclude the number of confirmed cases from these analyses; The number of fatalities is much less error-prone, so I will focus on that instead. (I will still predict the number of confirmed cases as that's necessary for the submission file).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
!pip install mpld3
import mpld3
mpld3.enable_notebook()

In [None]:
train = pd.read_csv('../input/covid19-global-forecasting-week-2/train.csv')
test = pd.read_csv('../input/covid19-global-forecasting-week-2/test.csv')
submission = pd.read_csv('../input/covid19-global-forecasting-week-2/submission.csv')

In [None]:
train

In [None]:
train["Country_Region"] = [country_name.replace("'","") for country_name in train["Country_Region"]]

# 1. Data Analysis (Development, Fatality Rate, ...)

In [None]:
LAST_DATE = train.iloc[-1]["Date"]

## 1.1 Development

Exemplary development in one country

In [None]:
train[train["Country_Region"]=="Italy"][["ConfirmedCases", "Fatalities", "Date"]].plot(x="Date", figsize=(8, 4), title="Covid-19 cases and fatalities in Italy");

Progression for the whole world (i.e. all countries summed up)

In [None]:
train.groupby("Date").sum()[["ConfirmedCases", "Fatalities"]].plot(figsize=(8, 4), title="Covid-19 total cases and fatalities (world)");

## 1.2 Fatalities and Case Fatality Rates

Countries with no fatalities yet

In [None]:
print("Countries with no fatalities as of " + LAST_DATE)
print(*train.groupby("Country_Region").sum()[train.groupby("Country_Region").sum()["Fatalities"] == 0].index.tolist(), sep=", ")

Countries with the most fatalities

In [None]:
train[train["Date"] == LAST_DATE].sort_values("Fatalities", ascending=False)[["Country_Region", "ConfirmedCases", "Fatalities"]].head(10)

Comparing countries' [case fatality rate](https://en.wikipedia.org/wiki/Case_fatality_rate) ("death rate"). **Careful here, Fatalities/ConfirmedCases is not necessarily the real CFR; this can only be calculated ex post**. Thus, a higher CFR as calculated *here* does not necessarily mean that one country's CFR really is higher, it could very well be because of less/more prevalent testing (example: Country A only tests patients that are already in critical condition, Country B tests the whole population -> Country A's Fatalities/ConfirmedCases - ratio will be much higher).

Only countries with at least 100 fatalities are considered.

In [None]:
tmp = train[train["Date"] == LAST_DATE].copy()
tmp["CaseFatalityRate"] = tmp["Fatalities"] / tmp["ConfirmedCases"] * 100  # CFR here is Fatalities/ConfirmedCases * 100 (so that it's in percent)
print("Mean CFR (%):", tmp["CaseFatalityRate"].mean())

heights = tmp[tmp["Fatalities"] >= 100].sort_values("CaseFatalityRate", ascending=False)["CaseFatalityRate"].values
bars = tmp[tmp["Fatalities"] >= 100].sort_values("CaseFatalityRate", ascending=False)["Country_Region"].values
y_pos = np.arange(len(bars))

plt.figure(figsize=(11,4))
plt.bar(y_pos, heights, width=0.5)
 
plt.xticks(y_pos, bars, size="small")
plt.yticks(np.arange(0.0, 11.0, 1.0))
plt.title("Preliminary Case Fatality Rate in Percent by Country")

plt.show();

# 2. Modelling and Prediction

## 2.1 SIR Model

The [SIR](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model) Model tries to model infectious disease developments, with the population split up into three groups ("compartments"):
- **S**usceptible: They can still be infected (healthy)
- **I**nfected: Currently infected
- **R**ecovered: They are presumed immune (cannot contract the virus again)

"$\cdot$" is "multiplied with".

Let $\beta$ (beta) be the **probability of transmission from infected to healthy $\cdot$ the number of people a person is in contact with per day**. Thus, it can be thought of as the **expected amount of people an infected person infects per day** (or any other timestep, I'll use days here). Example: Let the probability of an infected person to infect a healthy/susceptible person be $5 \%$ and the average number of people a person is in contact with per day be $6$. Then, $\beta = 0.05 \cdot 6 = 0.3$, that is, an infected person infects $0.3$ people per day on average.


Now one can see that the **number of days than an infected person can spread the disease** is extremely important. Let $D$ be that number. Then, the number of people an infected person infects on average is $=$ **expected amount of people an infected person infects per day $\cdot $ number of days the person can spread the disease**, and that's exactly $\beta \cdot D$. This is the [basic reproduction number](https://en.wikipedia.org/wiki/Basic_reproduction_number) $R_{0}$. Continuing the example from above: let $D = 10$, then $R_{0} = \beta \cdot D = 0.3 \cdot 10 = 3$. That means that on average, each infected person infects $3$ others.

Now the model below does not use $D$ but $\gamma$, and that's just $= \frac{1}{D}$ (Thus, $R_{0} = \beta \cdot D = \frac{\beta}{\gamma}$ and $\beta = R_{0} \cdot \gamma$)

In [None]:
from scipy.integrate import odeint # a lot of the code for SIR from https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/

In [None]:
# The SIR model differential equations.
def deriv(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta * S * I / N
    dIdt = beta * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

We now want to add *cumulative* Deaths $X$ to the model: $X(t) = \textit{number of deaths from day 0 to day t}$ for $t\geq 14$, else $0$. 

Recursively, the number of cumulative deaths on day $t$ is equal to the number of cumulative deaths on day $t-1$ (that's $=X(t-1)$) plus the number of newly infected 13 days prior multiplied with the case fatality rate $\alpha$ (alpha) (I chose 13 days as that is reported as the average time from infection until death in [this study](https://wwwnc.cdc.gov/eid/article/26/6/20-0320_article)).

Now, the number of newly infected 13 days prior (that's the people who can die on day $t$) is equal to the number of infected 14 days prior multiplied with the expected amount of people an infected person infects per day (that's $\beta$). So the number of newly infected 13 days prior is $\beta \cdot I(t-14)$.

Putting it all together: $X(t) = X(t-1) + \alpha \cdot \beta \cdot I(t-14)$.

This is equal to the closed form formula $X(t) = \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{t-14} I(i)$

Proof: induction over $t$ for $t\geq 13$ (both are $0$ for $t<14$).

Base Case: $X(14) = X(13) + \alpha \cdot \beta \cdot I(14-14) = 0 + \alpha \cdot \beta \cdot I(0) = \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{0} I(i)$.

Assume $X(t) = X(t-1) + \alpha \cdot \beta \cdot I(t-14) = \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{t-14} I(i) $ holds for some $t\geq 14$. Then:

$X(t+1) = X(t+1-1) + \alpha \cdot \beta \cdot I(t+1-14) = X(t) + \alpha \cdot \beta \cdot I(t-13) \stackrel{inductive\, assumption}{=} \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{t-14} I(i) + \alpha \cdot \beta \cdot I(t-13) = \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{t-13} I(i) = \alpha \cdot \beta \cdot \displaystyle \sum_{i=0}^{(t+1)-14} I(i)$

which was to be shown.


In [None]:
def SIR_model(N, D, R_0, CaseFatalityRate, max_days):
    '''
    N: total population
    D, R_0, CaseFatalityRate: see texts above
    '''
    I0, R0 = 1, 0  # Initial number of infected and recovered individuals (1 infected, 0 recovered) [this R0 has nothing to do with the basic reproduction number R0]
    S0 = N - I0 - R0 # Initial number of susceptible (everyone else)

    gamma = 1.0 / D  # see texts above
    beta = R_0 * gamma  # see texts above
    alpha = CaseFatalityRate

    t = np.linspace(0, max_days, max_days) # Grid of time points (in days)

    # Initial conditions vector
    y0 = S0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    ret = odeint(deriv, y0, t, args=(N, beta, gamma))
    S, I, R = ret.T

    # Adding deaths (see text above)
    X = np.zeros(max_days)
    for day in range(13, max_days):
        X[day] = sum(I[:day-13])
    X = alpha * beta * X


    # Plot the data on three separate curves for S(t), I(t) and R(t)
    f, ax = plt.subplots(1,1,figsize=(10,4))
    ax.plot(t, S, 'b', alpha=0.7, linewidth=2, label='Susceptible')
    ax.plot(t, I, 'y', alpha=0.7, linewidth=2, label='Infected')
    ax.plot(t, X, 'r', alpha=0.7, linewidth=2, label='Dead')
    ax.plot(t, R, 'g', alpha=0.7, linewidth=2, label='Recovered')

    ax.set_xlabel('Time (days)')
    ax.title.set_text('SIR-Model. Total Population: ' + str(N) + ", Days Infectious: " + str(D) + ", R_0: " + str(R_0) + ", CFR: " + str(CaseFatalityRate*100) + "%")
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax.yaxis.set_tick_params(length=0)
    ax.xaxis.set_tick_params(length=0)
    ax.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)
    plt.show();

Example of a SIR Model

In [None]:
SIR_model(N=1_000_000, D=14.0, R_0=2.0, CaseFatalityRate=0.05, max_days=360)

## 2.2 SIR-Model with Lockdown

### We now want to find suitable parameters (Days infectious, R_0, CFR) for the SIR model

As I said before, the number of confirmed cases is likely far off from the real number (as not the whole population is getting tested) and thus is not very useful to fit our data to a SIR-Model.

So, we'll mainly use the number of deceased from the dataset to find parameters for the SIR model. What's important to note is that many countries implemented a *lockdown* that greatly reduces the basic reproduction number R_0; thus, we first tweak the model to allow for a second R_0_2 to come into effect on day L (for lockdown).

In [None]:
def SIR_model_with_lockdown(N, D, R_0, CaseFatalityRate, max_days, L, R_0_2):
    '''
    N: total population
    D, R_0, CaseFatalityRate, ...: see texts above
    '''
    # BEFORE LOCKDOWN (same code as first model)
    I0, R0 = 1, 0  # Initial number of infected and recovered individuals (1 infected, 0 recovered) [this R0 has nothing to do with the basic reproduction number R0]
    S0 = N - I0 - R0 # Initial number of susceptible (everyone else)

    gamma = 1.0 / D  # see texts above
    beta = R_0 * gamma  # see texts above
    alpha = CaseFatalityRate

    t = np.linspace(0, L, L)  # Grid of time points (in days)
    
    # Initial conditions vector
    y0 = S0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    ret = odeint(deriv, y0, t, args=(N, beta, gamma))
    S, I, R = ret.T
    
    
    # AFTER LOCKDOWN
    I0_2, R0_2, S0_2 = I[-1], R[-1], S[-1]  # beginning of lockdown -> starting Infected/Susceptible/Recovered numbers are the numbers at the end of no-lockdown period

    gamma = 1.0 / D  # same after lockdown
    beta_2 = R_0_2 * gamma
    alpha = CaseFatalityRate  # same after lockdown

    t_2 = np.linspace(0, max_days - L + 1, max_days - L + 1)
    
    # Initial conditions vector
    y0_2 = S0_2, I0_2, R0_2
    # Integrate the SIR equations over the time grid, t.
    ret_2 = odeint(deriv, y0_2, t_2, args=(N, beta_2, gamma))
    S_2, I_2, R_2 = ret_2.T

    
    # COMBINING PERIODS
    S_full = np.concatenate((S, S_2[1:]))
    I_full = np.concatenate((I, I_2[1:]))
    R_full = np.concatenate((R, R_2[1:]))
    t_full = np.linspace(0, max_days, max_days)
    
    # Adding deaths
    X = np.zeros(max_days)
    for day in range(13, max_days):
        for valid_day in range(day-13):
            if valid_day < L:
                X[day] += alpha * beta * I_full[valid_day]
            else:
                X[day] += alpha * beta_2 * I_full[valid_day]

    

    # Plot the data on three separate curves for S(t), I(t) and R(t)
    f, ax = plt.subplots(1,1,figsize=(10,4))
    ax.plot(t_full, S_full, 'b', alpha=0.7, linewidth=2, label='Susceptible')
    ax.plot(t_full, I_full, 'y', alpha=0.7, linewidth=2, label='Infected')
    ax.plot(t_full, X, 'r', alpha=0.7, linewidth=2, label='Dead')
    ax.plot(t_full, R_full, 'g', alpha=0.7, linewidth=2, label='Recovered')

    ax.set_xlabel('Time (days)')
    ax.title.set_text('SIR-Model with Lockdown. Total Population: ' + str(N) + 
                      ", Days Infectious: " + str(D) + ", R_0: " + str(R_0) + 
                      ", CFR: " + str(CaseFatalityRate*100) + " R_0_2: " + str(R_0_2) + 
                      ", L: " + str(L) + " days")
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    plt.text(L,N/20,'Lockdown')
    plt.plot(L, 0, marker='o', markersize=6, color="red")
    ax.yaxis.set_tick_params(length=0)
    ax.xaxis.set_tick_params(length=0)
    ax.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)
    plt.show();

### (fictitious) Case Study: No Lockdown vs Lockdown
We model a highly infectious virus with an R_0 of 3.0 and 4 days infectious spreading in a population of 1 million. The CFR is set to 5%.
We look at the development without a lockdown and with a lockdown after 22 and 30 days that reduces R_0 to 0.9.

No Lockdown:

About 130k Fatalities at the end.

In [None]:
SIR_model(N=1_000_000, D=4, R_0=3.0, CaseFatalityRate=0.05, max_days=60)

Lockdown after 30 Days:

Fatalities around 85k. The lockdown is too late to stop the spread, but still has significant impact. However, at the time of the lockdown, there are only about 500 fatalities; as the R_0 is so high, the virus spreads incredibly fast and a lockdown would have to come into effect very soon.

In [None]:
SIR_model_with_lockdown(N=1_000_000, D=4, R_0=3.0, CaseFatalityRate=0.05, max_days=60, L=30, R_0_2=0.9)

Lockdown after 22 days:

The lockdown is able to break the chain of infection early on! Fatalities are around 12k.

In [None]:
SIR_model_with_lockdown(N=1_000_000, D=4, R_0=3.0, CaseFatalityRate=0.05, max_days=60, L=22, R_0_2=0.9)

As you can see, with highly contagious viruses, each day counts. Let's impose even more drastic measures: a complete lockdown after 15 days that reduces R_0 to 0.1 reduces fatalities to around 600 people only!

In [None]:
SIR_model_with_lockdown(N=1_000_000, D=4, R_0=3.0, CaseFatalityRate=0.05, max_days=60, L=15, R_0_2=0.9)

## 2.3 Fitting SIR with Lockdown to real-world data

We now try to fit the SIR-Model's Dead Curve to real data by tweaking the variables. Some of them are constant:
- max_days is set to `len(train.groupby("Date").sum().index)` so that we can compare against all available data
- N is fixed for each country, that's just the total population
- L is fixed for each country (the date it went into lockdown)
- D is set to vary from 5 to 20 (according to [this study](https://www.ncbi.nlm.nih.gov/pubmed/32150748), it takes on avg. 5 days to show symptoms, at most 14; according to [this source (German)](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Steckbrief.html#doc13776792bodyText5), people are infectious up to 5 days after onset of symptoms).
- CFR set to vary from $0.1\% - 10\%$ (according to [this study](https://wwwnc.cdc.gov/eid/article/26/6/20-0320_article))
- R_0 and R_0_2 are set to vary from 0.1 to 3.5

In [None]:
# SIR-Model's Fatality Curve (no plotting etc.):
def SIR_model_with_lockdown_deaths(x, N, D, R_0, CaseFatalityRate, max_days, L, R_0_2):
    # BEFORE LOCKDOWN (same code as first model)
    I0, R0 = 1, 0  # Initial number of infected and recovered individuals (1 infected, 0 recovered) [this R0 has nothing to do with the basic reproduction number R0]
    S0 = N - I0 - R0 # Initial number of susceptible (everyone else)

    gamma = 1.0 / D  # see texts above
    beta = R_0 * gamma  # see texts above
    alpha = CaseFatalityRate

    t = np.linspace(0, L, L)  # Grid of time points (in days)
    
    # Initial conditions vector
    y0 = S0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    ret = odeint(deriv, y0, t, args=(N, beta, gamma))
    S, I, R = ret.T
    
    
    # AFTER LOCKDOWN
    I0_2, R0_2, S0_2 = I[-1], R[-1], S[-1]  # beginning of lockdown -> starting Infected/Susceptible/Recovered numbers are the numbers at the end of no-lockdown period

    gamma = 1.0 / D  # same after lockdown
    beta_2 = R_0_2 * gamma
    alpha = CaseFatalityRate  # same after lockdown

    t_2 = np.linspace(0, max_days - L + 1, max_days - L + 1)
    
    # Initial conditions vector
    y0_2 = S0_2, I0_2, R0_2
    # Integrate the SIR equations over the time grid, t.
    ret_2 = odeint(deriv, y0_2, t_2, args=(N, beta_2, gamma))
    S_2, I_2, R_2 = ret_2.T

    
    # COMBINING PERIODS
    S_full = np.concatenate((S, S_2[1:]))
    I_full = np.concatenate((I, I_2[1:]))
    R_full = np.concatenate((R, R_2[1:]))
    t_full = np.linspace(0, max_days, max_days)
    
    # Adding deaths
    X = np.zeros(max_days)
    for day in range(13, max_days):
        for valid_day in range(day-13):
            if valid_day < L:
                X[day] += alpha * beta * I_full[valid_day]
            else:
                X[day] += alpha * beta_2 * I_full[valid_day]
    return X[x]

The (hidden as it's almost the same as before) code above defines a function with signature

`SIR_model_with_lockdown_deaths(x, N, D, R_0, CaseFatalityRate, max_days, L, R_0_2)`

that takes as input the same variables as before and an x and returns the number of fatalities on day x. This function will be used to find suited parameters D, CFR, R_0 and R_0_2 for the model.

In [None]:
!pip install lmfit
from lmfit import Model

In [None]:
# Load countries data file (from https://www.kaggle.com/saga21/covid-global-forecast-sir-model-ml-regressions)
world_population = pd.read_csv("/kaggle/input/population-by-country-2020/population_by_country_2020.csv")

# Select desired columns and rename some of them
world_population = world_population[['Country (or dependency)', 'Population (2020)', 'Density (P/Km²)', 'Land Area (Km²)', 'Med. Age', 'Urban Pop %']]
world_population.columns = ['Country (or dependency)', 'Population (2020)', 'Density', 'Land Area', 'Med Age', 'Urban Pop']

# Replace United States by US
world_population.loc[world_population['Country (or dependency)']=='United States', 'Country (or dependency)'] = 'US'

# Remove the % character from Urban Pop values
world_population['Urban Pop'] = world_population['Urban Pop'].str.rstrip('%')

# Replace Urban Pop and Med Age "N.A" by their respective modes, then transform to int
world_population.loc[world_population['Urban Pop']=='N.A.', 'Urban Pop'] = int(world_population.loc[world_population['Urban Pop']!='N.A.', 'Urban Pop'].mode()[0])
world_population['Urban Pop'] = world_population['Urban Pop'].astype('int16')
world_population.loc[world_population['Med Age']=='N.A.', 'Med Age'] = int(world_population.loc[world_population['Med Age']!='N.A.', 'Med Age'].mode()[0])
world_population['Med Age'] = world_population['Med Age'].astype('int16')

We now define 
1. `fit_SIR`: this function takes a country name, lockdown data (and opt. region name) and first gathers the data (fatalities progression, population, etc.) and then fits the `SIR_model_with_lockdown_deaths`-function from above with fixed N (population), max_days (however many dates are supplied), L (lockdown date) and varying D, R_0, R_0_2, CFR. The function returns the lmfit-module's result object and the country name. The result object contains all we want to know about the curve fitting.
2. `fitted_plot`: this function takes a lmfit-result-object and country name and plots the fitted SIR-model against the real curve.

In [None]:
lockdown_dates = {"Italy": "2020-03-10", "Spain": "2020-03-15", "Germany": "2020-03-23"}

def fit_SIR(country_name, lockdown_date=None, region_name=None):
    """
    y_data: the fatalities data of one country/region (array)
    population: total population of country
    lockdown_date: format YYYY-MM-DD
    """
    if lockdown_date is None:
        lockdown_date = lockdown_dates[country_name]

    if region_name:
        y_data = train[(train["Country_Region"] == country_name) & (train["Region"] == region_name)].Fatalities.values
    else:
        if len(train["Country_Region"] == country_name) > len(train["Country_Region"] == "Germany"):  # country with several regions and no region provided
            y_data = train[(train["Country_Region"] == country_name) & (train["Region"].isnull())].Fatalities.values
        else:
            y_data = train[train["Country_Region"] == country_name].Fatalities.values
        
    max_days = len(train.groupby("Date").sum().index) # constant for all countries

    # country specific values
    N = world_population.loc[world_population['Country (or dependency)'] == country_name]["Population (2020)"].values[0]
    L = train.groupby("Date").sum().index.tolist().index(lockdown_date)  # index of the lockdown date

    # x_data is just [0, 1, ..., max_days] array
    x_data = np.linspace(0, max_days - 1, max_days, dtype=int)
    
    # curve fitting from here
    mod = Model(SIR_model_with_lockdown_deaths)

    # initial values and bounds
    mod.set_param_hint('N', value=N)
    mod.set_param_hint('max_days', value=max_days)
    mod.set_param_hint('L', value=L)
    mod.set_param_hint('D', value=10, min=4, max=25)
    mod.set_param_hint('CaseFatalityRate', value=0.01, min=0.0001, max=0.1)
    mod.set_param_hint('R_0', value=2.0, min=0.1, max=5.0)
    mod.set_param_hint('R_0_2', value=2.0, min=0.1, max=5.0)

    params = mod.make_params()

    # fixing constant parameters
    params['N'].vary = False
    params['max_days'].vary = False
    params['L'].vary = False

    result = mod.fit(y_data, params, x=x_data, method="least_squares")
    
    return result, country_name

def fitted_plot(result, country_name, region_name=None):
    if region_name:
        y_data = train[(train["Country_Region"] == country_name) & (train["Region"] == region_name)].Fatalities.values
    else:
        if len(train["Country_Region"] == country_name) > len(train["Country_Region"] == "Germany"):  # country with several regions and no region provided
            y_data = train[(train["Country_Region"] == country_name) & (train["Region"].isnull())].Fatalities.values
        else:
            y_data = train[train["Country_Region"] == country_name].Fatalities.values

    max_days = len(train.groupby("Date").sum().index)
    x_data = np.linspace(0, max_days - 1, max_days, dtype=int)
    x_ticks = train[train["Country_Region"] == "Germany"].Date.values  # same for all countries
    
    plt.figure(figsize=(10,5))
    
    real_data, = plt.plot(x_data, y_data, 'bo', label="real data")
    SIR_fit = plt.plot(x_data, result.best_fit, 'r-', label="SIR model")
    
    plt.xlabel("Day")
    plt.xticks(x_data[::10], x_ticks[::10])
    plt.ylabel("Fatalities")
    plt.title("Real Data vs SIR-Model in " + country_name)
    plt.legend(numpoints=1, loc=2, frameon=None)
    plt.show()

In [None]:
result, _ = fit_SIR("Italy")
print(result.fit_report())
fitted_plot(result, "Italy")

In [None]:
result, _ = fit_SIR("Spain")
print(result.fit_report())
fitted_plot(result, "Spain")

In [None]:
result, _ = fit_SIR("Germany")
print(result.fit_report())
fitted_plot(result, "Germany")

## 2.4 SIR with time-dependent R_0 and CFR
While the prior models are able to capture some of the aspects of the virus quite well, it's not that hard to fit the curves to the outbreak period as they all look quite similar. To make better predictions, we now treat R_0 and CFR as functions. For example, there is no determined "Lockdown" date anymore at which R_0 jumps to a different value; it can now change continuously. Also, the CFR was until now treated as constant, however, with more people infected, treatment becomes less available and the case fatality rate increases. Now, CFR is treated as a function of the ratio $\frac{I(t)}{N}$ (the fraction of infected of the total population):

$CFR(t) = s \cdot \frac{I(t)}{N} + \alpha_{OPT}$, with $s$ being some arbitrary but fixed scaling factor and $\alpha_{OPT}$ being the CFR with optimal treatment available.

$R_{0}$ will be fitted to one of several different possible distributions we'll look at.

In [None]:
# extended SIR model differential equations. Beta is now a function.
def extended_deriv(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta(t) * S * I / N
    dIdt = beta(t) * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

In [None]:
def extended_SIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0, **R_0_kwargs):
    '''
    R_0: callable
    '''
    I0, R0 = 1, 0  # Initial number of infected and recovered individuals (1 infected, 0 recovered) [this R0 has nothing to do with the basic reproduction number R0]
    S0 = N - I0 - R0 # Initial number of susceptible (everyone else)

    gamma = 1.0 / D  # see texts above

    def beta(t):
        return R_0(t, **R_0_kwargs) * gamma

    t = np.linspace(0, max_days, max_days)  # Grid of time points (in days)
    
    # Initial conditions vector
    y0 = S0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    ret = odeint(extended_deriv, y0, t, args=(N, beta, gamma))
    S, I, R = ret.T

    def CFR(t):
        return CFR_OPT + CFR_scaling_factor * (I[t] / N)

    # Adding deaths
    X = np.zeros(max_days)
    for day in range(13, max_days):
        for valid_day in range(day-13):
            X[day] += CFR(valid_day) * beta(valid_day) * I[valid_day]

    return t, S, I, R, X, [R_0(t, **R_0_kwargs) for t in range(max_days)], N, [CFR(t) for t in range(max_days)]

In [None]:
def plot_extended_SIR(t, S, I, R, X, R_0, N, CFR):
    # Plot the data on three separate curves for S(t), I(t) and R(t)
    f, ax = plt.subplots(1,1,figsize=(10,4))
    ax.plot(t, S, 'b', alpha=0.7, linewidth=2, label='Susceptible')
    ax.plot(t, I, 'y', alpha=0.7, linewidth=2, label='Infected')
    ax.plot(t, X, 'r', alpha=0.7, linewidth=2, label='Dead')
    ax.plot(t, R, 'g', alpha=0.7, linewidth=2, label='Recovered')

    ax.set_xlabel('Time (days)')
    ax.title.set_text('SIR-Model with varying R_0 and CFR')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax.yaxis.set_tick_params(length=0)
    ax.xaxis.set_tick_params(length=0)
    ax.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)
    plt.show();
    
    
    # plt.figure(figsize=(10,4))
    
    f = plt.figure(figsize=(10,4))
    
    # sp1
    ax1 = f.add_subplot(121)
    ax1.plot(t, R_0, 'b--', alpha=0.7, linewidth=2, label='R_0')
    
    ax1.set_xlabel('Time (days)')
    ax1.title.set_text('R_0 over time')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax1.yaxis.set_tick_params(length=0)
    ax1.xaxis.set_tick_params(length=0)
    ax1.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax1.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)

    # sp2
    ax2 = f.add_subplot(122)
    ax2.plot(t, CFR, 'r--', alpha=0.7, linewidth=2, label='CFR')
    
    ax2.set_xlabel('Time (days)')
    ax2.title.set_text('CFR over time')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax2.yaxis.set_tick_params(length=0)
    ax2.xaxis.set_tick_params(length=0)
    ax2.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax2.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)

    plt.show();

Example: 

In [None]:
N = 1_000
D = 4
max_days = 100

I0, R0 = 1, 0
S0 = N - I0 - R0
s = CFR_scaling_factor = 0.1
CFR_OPT = 0.02  # noone in hospital -> only 2% die

def new_R0(t, a, b, c):
    return a / (1 + (t/c)**b)


plot_extended_SIR(*extended_SIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, new_R0, a=3.0, b=1.5, c=50))

## 2.5 Fitting extended SIR to data

In [None]:
def fit_extended_SIR(country_name, R_0_function, region_name=None, fit_method="least_squares", **R_0_kwargs):

    if region_name:
        y_data = train[(train["Country_Region"] == country_name) & (train["Region"] == region_name)].Fatalities.values
    else:
        if len(train["Country_Region"] == country_name) > len(train["Country_Region"] == "Germany"):  # country with several regions and no region provided
            y_data = train[(train["Country_Region"] == country_name) & (train["Region"].isnull())].Fatalities.values
        else:
            y_data = train[train["Country_Region"] == country_name].Fatalities.values
        
    max_days = len(train.groupby("Date").sum().index) # constant for all countries
 
    # country specific values
    N = world_population.loc[world_population['Country (or dependency)'] == country_name]["Population (2020)"].values[0]

    # x_data is just [0, 1, ..., max_days] array
    x_data = np.linspace(0, max_days - 1, max_days, dtype=int)

    # curve fitting from here
    def extended_SIR_deaths(x, N, D, max_days, CFR_OPT, CFR_scaling_factor, **R_0_kwargs):
        t_, S_, I_, R_, X, R_0_, N_, CFR_ = extended_SIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0=R_0_function, **R_0_kwargs)
        return X[x]

    mod = Model(extended_SIR_deaths)

    # initial values and bounds
    mod.set_param_hint('N', value=N, vary=False)
    mod.set_param_hint('max_days', value=max_days, vary=False)

    mod.set_param_hint('D', value=10, min=4, max=25)
    mod.set_param_hint('CFR_OPT', value=0.01, min=0.0001, max=0.1)
    mod.set_param_hint('CFR_scaling_factor', value=0.1, min=0.0001, max=1.0)
    if R_0_kwargs:
        for arg in R_0_kwargs:
            mod.set_param_hint(arg, value=R_0_kwargs[arg])

    params = mod.make_params()
    # print(params)
    result = mod.fit(y_data, params, method=fit_method, x=x_data)
    
    # fetch some result parameters
    CFR_OPT = result.params["CFR_OPT"].value
    CFR_scaling_factor = result.params["CFR_scaling_factor"].value
    R_0_result_params = {}
    for val in R_0_kwargs:
        R_0_result_params[val] = result.params[val].value

    
    # return result, country_name
    return result, country_name, N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0_function, R_0_result_params

def fitted_plot(result, country_name, region_name=None):
    if region_name:
        y_data = train[(train["Country_Region"] == country_name) & (train["Region"] == region_name)].Fatalities.values
    else:
        if len(train["Country_Region"] == country_name) > len(train["Country_Region"] == "Germany"):  # country with several regions and no region provided
            y_data = train[(train["Country_Region"] == country_name) & (train["Region"].isnull())].Fatalities.values
        else:
            y_data = train[train["Country_Region"] == country_name].Fatalities.values

    max_days = len(train.groupby("Date").sum().index)
    x_data = np.linspace(0, max_days - 1, max_days, dtype=int)
    x_ticks = train[train["Country_Region"] == "Germany"].Date.values  # same for all countries
    
    plt.figure(figsize=(10,5))
    
    real_data, = plt.plot(x_data, y_data, 'bo', label="real data")
    SIR_fit = plt.plot(x_data, result.best_fit, 'r-', label="SIR model")
    
    plt.xlabel("Day")
    plt.xticks(x_data[::10], x_ticks[::10])
    plt.ylabel("Fatalities")
    plt.title("Real Data vs SIR-Model in " + country_name)
    plt.legend(numpoints=1, loc=2, frameon=None)
    plt.show()

In [None]:
def new_R0(t, a, b, c):
    return a / (1 + (t/c)**b)

result, country_name, N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0_function, R_0_result_params = fit_extended_SIR("Italy", new_R0, region_name=None, fit_method="least_squares", a=3.0, b=1.5, c=50)
print(result.fit_report())
fitted_plot(result, "Italy");
plot_extended_SIR(*extended_SIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0_function, **R_0_result_params))

## 2.6 Final Model (extended SEIR)
We are now going to make some final changes to the model (and then finally get to the predictions):
1. Switch to SEIR instead of SIR: as Covid-19 appears to take on avg. 3 days ($=\sigma^{-1}$) until it starts being infectious, we'll add an "Exposed"-compartment of people that have the virus and will be infectious 3 days later
2. change the number of days until death to 19 as reported more recently
3. shift the CFR-curve to be calculated with the number of infected 7 days prior as that's the average time (3 days incubation + 7 days with symptoms) until patients get to the E.R. (and thus reducing capacity)
4. use a logistic curve $\displaystyle\frac{R_{0_{start}} - R_{0_{end}}}{1 + e^{-k(-x+x_{0})}} + R_{0_{end}}$ as template for the R_0-curve (R_0_start and R_0_end are the beginning and end values of R_0; x_0 is the x-value of the inflection point (i.e. where the steepest decline in R_0 is, this could be thought of as the main "lockdown" date); k lets us vary how quickly R_0 declines)
5. add an `outbreak`-parameter that sets the day the outbreak occurred; this is important as currently, day 0 of the given data (2020-01-22) is treated as outbreak date by default.

In [None]:
def logistic_R_0(t, R_0_start, k, x0, R_0_end):
    return (R_0_start-R_0_end) / (1 + np.exp(-k*(-t+x0))) + R_0_end

In [None]:
x = np.linspace(0, 100, 100)
plt.title("logistic R_0: initial R_0 2.0, final R_0 1.4, x0=50, varying k-values")
plt.plot(x, logistic_R_0(x, R_0_start=2, k=1.0, x0=50, R_0_end=1.4), label="k=1.0")
plt.plot(x, logistic_R_0(x, R_0_start=2, k=0.5, x0=50, R_0_end=1.4), label="k=0.5")
plt.plot(x, logistic_R_0(x, R_0_start=2, k=0.1, x0=50, R_0_end=1.4), label="k=0.1")
plt.legend()
plt.show();

In [None]:
def extended_deriv_SEIR(y, t, N, beta, gamma, sigma):
    S, E, I, R = y
    dSdt = -beta(t) * S * I / N  # same as before
    dEdt = beta(t) * S * I / N - sigma * E  # changed
    dIdt = sigma * E - gamma * I  # changed
    dRdt = gamma * I  # same as before
    return dSdt, dEdt, dIdt, dRdt

In [None]:
def extended_SEIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0, **R_0_kwargs):
    '''
    R_0: callable
    '''
    I0, R0, E0 = 0, 0, 1  # changed: one exposed at the beginning
    S0 = N - I0 - R0 - E0

    gamma = 1.0 / D
    sigma = 1.0 / 3.0  # changed: 3 days until infectious

    def beta(t):
        return R_0(t, **R_0_kwargs) * gamma

    t = np.linspace(0, max_days, max_days)

    # Initial conditions vector
    y0 = S0, E0, I0, R0
    # Integrate the SIR equations over the time grid, t.
    ret = odeint(extended_deriv_SEIR, y0, t, args=(N, beta, gamma, sigma))
    S, E, I, R = ret.T

    def CFR(t):
        if t < 7:
            return CFR_OPT
        else:
            return CFR_OPT + CFR_scaling_factor * (I[t - 7] / N)  # changed: implemented 7-day shift until patients get to hospital

    # Adding deaths
    X = np.zeros(max_days)
    for day in range(16, max_days):  # changed: changed to 19 days until death minus 3 for the three "exposed days"
        for valid_day in range(day-16):
            X[day] += CFR(valid_day) * beta(valid_day) * I[valid_day]

    return t, S, E, I, R, X, [R_0(t, **R_0_kwargs) for t in range(max_days)], N, [CFR(t) for t in range(max_days)]

In [None]:
def plot_extended_SEIR(t, S, E, I, R, X, R_0, N, CFR, x_ticks=None):
    # Plot the data on three separate curves for S(t), I(t) and R(t)
    f, ax = plt.subplots(1,1,figsize=(10,4))
    ax.plot(t, S, 'b', alpha=0.7, linewidth=2, label='Susceptible')
    ax.plot(t, E, 'y--', alpha=0.7, linewidth=2, label='Exposed')
    ax.plot(t, I, 'y', alpha=0.7, linewidth=2, label='Infected')
    ax.plot(t, X, 'r', alpha=0.7, linewidth=2, label='Dead')
    ax.plot(t, R, 'g', alpha=0.7, linewidth=2, label='Recovered')

    ax.set_xlabel('Time (days)')
    ax.title.set_text('SEIR-Model with varying R_0 and CFR')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax.yaxis.set_tick_params(length=0)
    ax.xaxis.set_tick_params(length=0)

    if x_ticks is not None:
        ax.set_xticks(t[::21])
        ax.set_xticklabels(x_ticks[::21])    

    ax.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)
    plt.show();
    
    f = plt.figure(figsize=(10,4))
    # sp1
    ax1 = f.add_subplot(121)
    ax1.plot(t, R_0, 'b--', alpha=0.7, linewidth=2, label='R_0')
 
    ax1.set_xlabel('Time (days)')
    ax1.title.set_text('R_0 over time')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax1.yaxis.set_tick_params(length=0)
    ax1.xaxis.set_tick_params(length=0)
    if x_ticks is not None:
        ax1.set_xticks(t[::35])
        ax1.set_xticklabels(x_ticks[::35])    
    ax1.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax1.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)

    # sp2
    ax2 = f.add_subplot(122)
    ax2.plot(t, CFR, 'r--', alpha=0.7, linewidth=2, label='CFR')
    
    ax2.set_xlabel('Time (days)')
    ax2.title.set_text('CFR over time')
    # ax.set_ylabel('Number (1000s)')
    # ax.set_ylim(0,1.2)
    ax2.yaxis.set_tick_params(length=0)
    ax2.xaxis.set_tick_params(length=0)
    if x_ticks is not None:
        ax2.set_xticks(t[::70])
        ax2.set_xticklabels(x_ticks[::70])
    ax2.grid(b=True, which='major', c='w', lw=2, ls='-')
    legend = ax2.legend()
    legend.get_frame().set_alpha(0.5)
    for spine in ('top', 'right', 'bottom', 'left'):
        ax.spines[spine].set_visible(False)

    plt.show();

Example: 
Now we can **really** see the flatten-the-curve-patterns: Population of 80 million, 9 days infectious (days 3-12 after exposed), CFR of 2% when hospitals are empty.

In [None]:
N = 80_000_000
D = 9
max_days = 400

s = CFR_scaling_factor = 0.1  # everyone infected at same time -> 12% instead of 2% die
CFR_OPT = 0.02  # noone in hospital -> only 2% die

First with only a slight and late reduction of R_0 from 2.5 towards 1.9 around day 200.

In [None]:
plot_extended_SEIR(*extended_SEIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, logistic_R_0, R_0_start=2.5, k=0.3, x0=200, R_0_end=1.9))

Now with an almost complete lockdown after 170 days:

In [None]:
plot_extended_SEIR(*extended_SEIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, logistic_R_0, R_0_start=2.5, k=0.3, x0=170, R_0_end=0.2))

Okay, we now have all we need to fit the curves. Let's define the curve fitting methods as before:

In [None]:
def fit_extended_SEIR(country_name, missing_days=0, region_name=None, fit_method="least_squares", **R_0_kwargs):

    if region_name is not None:
        y_data = train[(train["Country_Region"] == country_name) & (train["Province_State"] == region_name)].Fatalities.values
    else:
        if len(train["Country_Region"] == country_name) > len(train["Country_Region"] == "Germany"):  # country with several regions and no region provided
            # print("ok")
            y_data = train[(train["Country_Region"] == country_name) & (train["Province_State"].isnull())].Fatalities.values
        else:
            y_data = train[train["Country_Region"] == country_name].Fatalities.values
        
    max_days = len(train.groupby("Date").sum().index) + missing_days # constant for all countries
    y_data = np.concatenate((np.zeros(missing_days), y_data))
    # country specific values
    N = world_population.loc[world_population['Country (or dependency)'] == country_name]["Population (2020)"].values[0]

    # x_data is just [0, 1, ..., max_days] array
    x_data = np.linspace(0, max_days - 1, max_days, dtype=int)

    # curve fitting from here
    def extended_SEIR_deaths(x, N, D, CFR_OPT, CFR_scaling_factor, R_0_delta, **R_0_kwargs):
        # print(x)
        t_, S_, E_, I_, R_, X, R_0_, N_, CFR_ = extended_SEIR(N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0=logistic_R_0, **R_0_kwargs)
        # return np.concatenate((np.zeros(int(outbreak)), X))
        return X[x]

    mod = Model(extended_SEIR_deaths)

    # initial values and bounds
    mod.set_param_hint('N', value=N, vary=False)
    # mod.set_param_hint('max_days', value=max_days, vary=False)
    mod.set_param_hint('D', value=9, vary=False)

    mod.set_param_hint('CFR_OPT', value=0.01, min=0.0001, max=0.1)
    mod.set_param_hint('CFR_scaling_factor', value=0.1, min=0.0001, max=1.0)
    
    mod.set_param_hint('R_0_start', value=2.5, min=1.0, max=5.0)
    mod.set_param_hint('R_0_end', value=0.7, min=0.01, max=5.0)
    # mod.set_param_hint('outbreak', value=20, min=0, max=150)
    mod.set_param_hint('x0', value=30.0, min=0.0, max=float(max_days))
    mod.set_param_hint('k', value=0.1, min=0.01, max=5.0)
    '''
    if R_0_kwargs:
        for arg in R_0_kwargs:
            mod.set_param_hint(arg, value=R_0_kwargs[arg])
    '''

    params = mod.make_params()
    params.add('R_0_delta', value=1.0, min=0.0, expr="R_0_start - R_0_end")  # add constraint R_0_start >= R_0_end
    # print(params)
    result = mod.fit(y_data, params, method=fit_method, x=x_data)

    # fetch some result parameters
    CFR_OPT = result.params["CFR_OPT"].value
    CFR_scaling_factor = result.params["CFR_scaling_factor"].value
    R_0_result_params = {}
    for val in R_0_kwargs:
        R_0_result_params[val] = result.params[val].value

    return result, country_name, y_data, N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0_result_params


def extended_SEIR_fitted_plot(result, country_name, y_data):
#    max_days = len(train.groupby("Date").sum().index)
#   x_data = np.linspace(0, max_days - 1, max_days, dtype=int)
#    x_ticks = train[train["Country_Region"] == "Germany"].Date.values  # same for all countries
    np.datetime64(LAST_DATE)

    # x_ticks = pd.date_range(end=LAST_DATE, periods=len(y_data))
    x_ticks = np.arange(np.datetime64(LAST_DATE) - np.timedelta64(len(y_data),'D'), np.datetime64(LAST_DATE), step=np.timedelta64(1,'D'))
    x_ticks = [np.datetime_as_string(t, unit='D') for t in x_ticks]

    plt.figure(figsize=(10,5))
    x_data = np.linspace(0, len(y_data), len(y_data))
    real_data, = plt.plot(x_data, y_data, 'bo', label="real data")
    SIR_fit = plt.plot(x_data, result.best_fit, 'r-', label="SIR model")
    
    plt.xlabel("Day")
    plt.xticks(x_data[::30], x_ticks[::30])
    # print(x_ticks)
    plt.ylabel("Fatalities")
    plt.title("Real Data vs SIR-Model in " + country_name)
    plt.legend(numpoints=1, loc=2, frameon=None)
    plt.show()

In [None]:
result, country_name, y_data, N, D, max_days, CFR_OPT, CFR_scaling_factor, R_0_result_params = fit_extended_SEIR("Italy", missing_days=30, fit_method="least_squares", 
                                                                                                                 R_0_start=2.5, k=0.3, x0=170, R_0_end=0.2)

print(result.fit_report())
extended_SEIR_fitted_plot(result, "Italy", y_data);

future = 100
x_ticks = np.arange(np.datetime64(LAST_DATE) - np.timedelta64(len(y_data),'D'), np.datetime64(LAST_DATE) + np.timedelta64(future, 'D'), step=np.timedelta64(1,'D'))
x_ticks = [pd.to_datetime(str(t)).strftime("%m/%d") for t in x_ticks]
plot_extended_SEIR(*extended_SEIR(N, D, max_days + future, CFR_OPT, CFR_scaling_factor, logistic_R_0, **R_0_result_params), x_ticks=x_ticks)

In [None]:
y_data = train[(train["Country_Region"] == "Italy") & (train["Province_State"].isnull())].Fatalities.values

x_orig = np.linspace(100, len(y_data)+100, len(y_data))
# print(x_orig.shape)
plt.plot(x_orig, y_data)

zero_part = np.zeros(100)
y_2 = np.concatenate((zero_part, y_data))
noise = np.random.normal(0,1,y_2.shape)
plt.plot(y_2 + noise)
# 0 is the mean of the normal distribution you are choosing from
# 1 is the standard deviation of the normal distribution
# 100 is the number of elements you get in array noise


plt.show();

# //TODO: 
1. use fitted SIR-Models to predict for all countries
2. use ML to predict