# Introduction
Using a mathematical epidemic model, this notebook will predict the number of cases infected with COVID-19 in US.
Data from (Novel Corona Virus 2019, 2019-nCOV, SARS-COV-2 virus caused disease). 
The goal of this notebook is to compare Metapopulation models vs traditional epidemic models.

 * [Preparation (load data, preprocesing](https://www.kaggle.com/prbocca/covid-19-data-with-mphsir-model-us-scenario#Preparation)
 * [Trend analysis](https://www.kaggle.com/prbocca/covid-19-data-with-mphsir-model-us-scenario#Trend-analysis)
 * [Traditional Epidemic Models: SIR, SIR-D, SIR-F, SEWIR-F](https://www.kaggle.com/prbocca/covid-19-data-with-mphsir-model-us-scenario#Traditional-Epidemic-Models)
 * [First MetaPopulation model: MPHSIR](https://www.kaggle.com/prbocca/covid-19-data-with-mphsir-model-us-scenario#MetaPopulation-Epidemic-Models)
 * [Conclusion: Models comparision](https://www.kaggle.com/prbocca/covid-19-data-with-mphsir-model-us-scenario#Conclusion:-Models-comparision)

Note: This notebook was created to work with added precision mobility data. For example, with mobility data obtained from applications or mobile operators. For confidentiality reasons, outdated public mobility information from the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=111) is used here.

Note:  
"Infected" means the currently infected and confirmed cases.  
This can be calculated  as "Confirmed" - "Deaths" - "Recovered"

In [None]:
# n_trials: main optimization parameter for all models
#n_trials=10 #debug
n_trials=500 #production

from datetime import datetime
time_format = "%d%b%Y %H:%M"
datetime.now().strftime(time_format)

## Major update
 * 11Apr2020: First version following the work done by [@lisphilar](https://www.kaggle.com/lisphilar) in [covid-19-data-with-sir-model](https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model)


## Acknowledgement
### Datasets in kaggle
* The number of cases: [Novel Corona Virus 2019 Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)
* Total population: [covid19-global-forecasting-locations-population](https://www.kaggle.com/dgrechka/covid19-global-forecasting-locations-population/metadata)

### External resources
* Population pyramid: [PopulationPyramid.net](https://www.populationpyramid.net/) licenced under [Creative Commons license CC BY 3.0](https://creativecommons.org/licenses/by/3.0/igo/)


### References
* Simple SIR model: [The SIR epidemic model](https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/)
* SEIR model: [Introduction to SEIR model Models](http://indico.ictp.it/event/7960/session/3/contribution/19/material/slides/)
* Basic reproduction number: [Van den Driessche, P., & Watmough, J. (2002).](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6002118/)
* Basic reproduction number: [Infection Modeling — Part 1: Estimating the Impact of a Pathogen via Monte Carlo Simulation](https://towardsdatascience.com/infection-modeling-part-1-87e74645568a)
* Growth Factor: [Exponential growth and epidemics](https://www.youtube.com/watch?v=Kas0tIxDvrg)
* Physical distancing (social distancing): [YouTube: Simulating an epidemic](https://www.youtube.com/watch?v=gxAaO2rsdIs)

# Preparation

## Package

In [None]:
from collections import defaultdict
from datetime import timedelta
from dateutil.relativedelta import relativedelta
import math
import os
from pprint import pprint
import warnings
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot
import pystan.misc # in model.fit(): AttributeError: module 'pystan' has no attribute 'misc'
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib
from matplotlib.ticker import ScalarFormatter
%matplotlib inline
import numpy as np
import optuna
optuna.logging.disable_default_handler()
import pandas as pd
import dask.dataframe as dd
pd.plotting.register_matplotlib_converters()
import seaborn as sns
from scipy.integrate import solve_ivp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

import pickle

from IPython.display import display  #mostrar varios dataframe por cell

from IPython.display import Markdown, display      #mostrar texto enriquecido
def printmd(string):
    display(Markdown(string))

Local Functions

In [None]:
### Upload code as a data
# + add data
# create a dataset "localsrc", drag and drop .py files

from shutil import copyfile

#copy our file into the working directory (make sure it has .py suffix)
copyfile(src = "../input/localsrc/d01_utils_common.py", dst = "../working/d01_utils_common.py")
copyfile(src = "../input/localsrc/d03_models_epidemic_models.py", dst = "../working/d03_models_epidemic_models.py")
copyfile(src = "../input/localsrc/d03_models_mp_epidemic_models.py", dst = "../working/d03_models_mp_epidemic_models.py")


#import all our functions
#from my_functions import *
from d01_utils_common import *
# Plotting
# Trend analysis
# Dataset arrangement

from d03_models_epidemic_models import *
# Numerical simulation. We will perform numerical analysis to solve the ODE using scipy.integrate.solve_ivp function.
# Parameter Estimation using Optuna
# Description of math model: SIR, SIR-D, SIR-F, SEWIR-F, SIR-FV models
# Prediction of the data using some models

from d03_models_mp_epidemic_models import *
# Numerical simulation. We will perform numerical analysis to solve the ODE using scipy.integrate.solve_ivp function.
# Parameter Estimation using Optuna
# Description of math model: MPHSIR, ... models
# Prediction of the data using some models



In [None]:
# Ramdam
np.random.seed(2019)
os.environ["PYTHONHASHSEED"] = "2019"
# Matplotlib
plt.style.use("seaborn-ticks")
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["font.size"] = 11.0
plt.rcParams["figure.figsize"] = (9, 6)
# Pandas
pd.set_option("display.max_colwidth", 1000)

## List of dataset

In [None]:
for dirname, _, filenames in os.walk("/kaggle/input"):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Total population

In [None]:
population_raw = pd.read_csv(
    "/kaggle/input/covid19-global-forecasting-locations-population/locations_population.csv"
)

pprint("Number of NULL values:")
pprint(pd.DataFrame(population_raw.isnull().sum()).T)

display(population_raw.head())

I transform the dataset, adding population by country / city, also adding the global population, and the global population except China.

In [None]:
df = population_raw.copy()
df = df.rename({"Province.State": "Province", "Country.Region": "Country"}, axis=1)
cols = ["Country", "Province", "Population"]
df = df.loc[:, cols].fillna("-")
df.loc[df["Country"] == df["Province"], "Province"] = "-"
# Add total records
_total_df = df.loc[df["Province"] != "-", :].groupby("Country").sum()
_total_df = _total_df.reset_index().assign(Province="-")
df = pd.concat([df, _total_df], axis=0, sort=True)
df = df.drop_duplicates(subset=["Country", "Province"], keep="first")
# Global
global_value = df.loc[df["Province"] == "-", "Population"].sum()
df = df.append(pd.Series(["Global", "-", global_value], index=cols), ignore_index=True)
# Sorting
df = df.sort_values("Population", ascending=False).reset_index(drop=True)
df = df.loc[:, cols]
population_df = df.copy()
population_df.head()

In US, there are provinces (the States)

In [None]:
display(population_df[population_df['Country']=="US"])
pprint(sorted(population_df.loc[population_df['Country']=="US", "Province"].to_list()))

Save population in the dictionary "population_dict"

In [None]:
df = population_df.loc[population_df["Province"] == "-", :]
population_dict = df.set_index("Country").to_dict()["Population"]
population_dict

## Raw data: the number of cases

In [None]:
raw = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv")

pprint("INFO:")
pprint(raw.info())

pprint("DESCRIBE:")
pprint(raw.describe())

pprint("NULL DATA:")
pprint(pd.DataFrame(raw.isnull().sum()).T)

pprint("REPORTED COUNTRIES:")
pprint(", ".join(raw["Country/Region"].unique().tolist()))

pprint("RARE COUNTRIES:")
pprint(raw.loc[raw["Country/Region"] == "Others", "Province/State"].unique().tolist(), compact=True)

raw.tail()

## Data Cleening: the number of cases
Note: "Infected" = "Confirmed" - "Deaths" - "Recovered"

In [None]:
data_cols = ["Infected", "Deaths", "Recovered"]
data_cols_all = ["Confirmed", "Infected", "Deaths", "Recovered"]
rate_cols = ["Fatal per Confirmed", "Recovered per Confirmed", "Fatal per (Fatal or Recovered)"]
variable_dict = {"Susceptible": "S", "Infected": "I", "Recovered": "R", "Deaths": "D"}

In [None]:
df = raw.rename({"ObservationDate": "Date", "Province/State": "Province"}, axis=1)
df["Date"] = pd.to_datetime(df["Date"])
df["Country"] = df["Country/Region"].replace(
    {
        "Mainland China": "China",
        "Hong Kong SAR": "Hong Kong",
        "Taipei and environs": "Taiwan",
        "Iran (Islamic Republic of)": "Iran",
        "Republic of Korea": "South Korea",
        "Republic of Ireland": "Ireland",
        "Macao SAR": "Macau",
        "Russian Federation": "Russia",
        "Republic of Moldova": "Moldova",
        "Taiwan*": "Taiwan",
        "Cruise Ship": "Others",
        "United Kingdom": "UK",
        "Viet Nam": "Vietnam",
        "Czechia": "Czech Republic",
        "St. Martin": "Saint Martin",
        "Cote d'Ivoire": "Ivory Coast",
        "('St. Martin',)": "Saint Martin",
        "Congo (Kinshasa)": "Congo",
    }
)
df["Province"] = df["Province"].fillna("-").replace(
    {
        "Cruise Ship": "Diamond Princess",
        "Diamond Princess cruise ship": "Diamond Princess"
    }
)
df.loc[df["Country"] == "Diamond Princess", ["Country", "Province"]] = ["Others", "Diamond Princess"]
df["Infected"] = df["Confirmed"] - df["Deaths"] - df["Recovered"]
df[data_cols_all] = df[data_cols_all].astype(np.int64)
ncov_df_ungrouped = df.loc[:, ["Date", "Country", "Province", *data_cols_all]]


pprint("INFO:")
display(ncov_df_ungrouped.info())

pprint("DESCRIBE:")
display(ncov_df_ungrouped.describe(include="all").fillna("-"))

pprint("NULL DATA:")
display(pd.DataFrame(ncov_df_ungrouped.isnull().sum()).T)

pprint("REPORTED COUNTRIES:")
pprint(", ".join(ncov_df_ungrouped["Country"].unique().tolist()))

ncov_df_ungrouped.tail()

## Grouping by growth factor
The number of confirmed cases is increasing in many countries, but there are two of countries. In a first-type country, growth factor is larger than 1 and the number of cases is rapidly increasing. In a second-type country, growth factor is less than 1.

### Calculate growth factor
Where $C$ is the number of confirmed cases,  
$$\mathrm{Growth\ Factor} = \cfrac{\Delta \mathrm{C}_{n}}{\Delta \mathrm{C}_{n-1}}$$

In [None]:
df = ncov_df_ungrouped.pivot_table(
    index="Date", columns="Country", values="Confirmed", aggfunc="sum"
).fillna(method="ffill").fillna(0)
# Growth factor: (delta Number_n) / (delta Number_n)
df = df.diff() / df.diff().shift(freq="D")
df = df.replace(np.inf, np.nan).fillna(1.0)
# Rolling mean (window: 7 days)
df = df.rolling(7).mean()
df = df.iloc[6:, :]
# round: 0.01
growth_value_df = df.round(2)
growth_value_df.tail()

## Grouping countires based on growth factor
* Outbreaking: growth factor $>$ 1 for the last 7 days
* Stopping: growth factor $<$ 1 for the last 7 days
* At a crossroad: the others

In [None]:
df = growth_value_df.copy()
df = df.iloc[-7:, :].T
day_cols = df.columns.strftime("%d%b%Y")
df.columns = day_cols
last_date = day_cols[-1]
# Grouping
more_col, less_col = "GF > 1 [straight days]", "GF < 1 [straight days]"
df[more_col] = (growth_value_df > 1).iloc[::-1].cumprod().sum(axis=0)
df[less_col] = (growth_value_df < 1).iloc[::-1].cumprod().sum(axis=0)
df["Group"] = df[[more_col, less_col]].apply(
    lambda x: "Outbreaking" if x[0] >= 7 else "Stopping" if x[1] >= 7 else "Crossroad",
    axis=1
)
# Sorting
df = df.loc[:, ["Group", more_col, less_col, *day_cols]]
df["rank1"] = df[more_col] * df[last_date]
df["rank2"] = df[less_col] * df[last_date]
df = df.sort_values(["Group", "rank1", "rank2"], ascending=False)
df = df.drop(["rank1", "rank2"], axis=1)
growth_df = df.copy()
growth_df.head()

In [None]:
df = pd.merge(ncov_df_ungrouped, growth_df["Group"].reset_index(), on="Country")
ncov_df = df.loc[:, ["Date", "Group", *ncov_df_ungrouped.columns[1:]]]
ncov_df.tail()

And the US cleaned dataset is...

In [None]:
display(ncov_df[ncov_df['Country']=='US'])

df = ncov_df[ncov_df['Country']=='US'].groupby("Date").sum()
df = df.loc[:,[*data_cols]]
display(df)
line_plot(df, "US: Cases over time")

### Visualize total data

In [None]:
# compute total
total_df = ncov_df.groupby("Date").sum()
total_df[rate_cols[0]] = total_df["Deaths"] / total_df[data_cols].sum(axis=1)
total_df[rate_cols[1]] = total_df["Recovered"] / total_df[data_cols].sum(axis=1)
total_df[rate_cols[2]] = total_df["Deaths"] / (total_df["Deaths"] + total_df["Recovered"])


# ploting
pprint(f"{(total_df.index.max() - total_df.index.min()).days} days have passed from the date of the first record.")
line_plot(total_df[data_cols], "Total number of cases over time")
line_plot(total_df[rate_cols], "Global rate over time", ylabel="", math_scale=False)

total_df[rate_cols].plot.kde()
plt.title("Kernel density estimation of the rates")
plt.show()
display(total_df[rate_cols].describe().T)

total_df.tail()

## Mobile Mobility

Note: This notebook was created to work with added precision mobility data. For example, with mobility data obtained from applications or mobile operators. For confidentiality reasons, outdated public mobility information from the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=111) is used here.

The data is disaggregated by states, I am going to add these in subpopulations.
At the beginning 4 subpopulations according to the census in US: [Census_Bureau-designated_regions_and_divisions](https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States#Census_Bureau-designated_regions_and_divisions)
<div>
<img src="attachment:image.png" width="500"/>
</div>

In [None]:
#Region 1: Northeast: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont, New Jersey, New York, Pennsylvania
#Region 2: Midwest: Illinois, Indiana, Michigan, Ohio, Wisconsin, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, South Dakota
#Region 3: South: Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, District of Columbia, West Virginia, Alabama, Kentucky, Mississippi, Tennessee, Arkansas, Louisiana, Oklahoma, Texas
#Region 4: West: Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, Wyoming, Alaska, California, Hawaii, Oregon, Washington

subpopulation_name   = ["Northeast", "Midwest", "South", "West"]
subpopulation_mapper = {
     'Alabama':"South",
     'Alaska':"West",
     'Arizona':"West",
     'Arkansas':"South",
     'California':"West",
     'Colorado':"West",
     'Connecticut':"Northeast",
     'Delaware':"South",
     'Diamond Princess':"-",
     'District of Columbia':"South",
     'Florida':"South",
     'Georgia':"South",
     'Grand Princess':"-",
     'Guam': "-",
     'Hawaii':"West",
     'Idaho':"West",
     'Illinois':"Midwest",
     'Indiana':"Midwest",
     'Iowa':"Midwest",
     'Kansas':"Midwest",
     'Kentucky':"South",
     'Louisiana':"South",
     'Maine':"Northeast",
     'Maryland':"South",
     'Massachusetts':"Northeast",
     'Michigan':"Midwest",
     'Minnesota':"Midwest",
     'Mississippi':"South",
     'Missouri':"Midwest",
     'Montana':"West",
     'Nebraska':"Midwest",
     'Nevada':"West",
     'New Hampshire':"Northeast",
     'New Jersey':"Northeast",
     'New Mexico':"West",
     'New York':"Northeast",
     'North Carolina':"South",
     'North Dakota':"Midwest",
     'Ohio':"Midwest",
     'Oklahoma':"South",
     'Oregon':"West",
     'Pennsylvania':"Northeast",
     'Puerto Rico': "-",
     'Rhode Island':"Northeast",
     'South Carolina':"South",
     'South Dakota':"Midwest",
     'Tennessee':"South",
     'Texas':"South",
     'United States Virgin Islands':"-",
     'Utah':"West",
     'Vermont':"Northeast",
     'Virgin Islands':"-",
     'Virginia':"South",
     'Washington':"West",
     'West Virginia':"South",
     'Wisconsin':"Midwest",
     'Wyoming':"West"
}
    
subpopulation_excluded = ['U.S. Pacific Trust Territories and Possessions', 
                          'U.S. Virgin Islands',
                          'Diamond Princess',
                          'Grand Princess',
                          'Guam',
                          'Puerto Rico',
                          'United States Virgin Islands',
                          'Virgin Islands'
                         ]
                          

In [None]:
df = population_df.loc[population_df["Country"] == "US", :]
df = df[df["Province"]!="-"] #quito el total
df = df.drop("Country",1) # quito pais
df = df[~df.Province.isin(subpopulation_excluded)] #quito provincias raras

# Agrego subpoblaciones
df['sp'] = [ subpopulation_mapper[d] for d in df['Province']]
for sp in subpopulation_name:
    sp_value = df.loc[df["sp"]==sp, "Population"].sum()
    df = df.append(pd.Series([sp,sp_value, "-"], index=["Province", "Population", "sp"]), ignore_index=True)

#Agrego Total
global_value = df.loc[df["sp"]!="-", "Population"].sum()
df = df.append(pd.Series(["Total",global_value, "-"], index=["Province", "Population", "sp"]), ignore_index=True)

#display(df)
subpopulation_dict = df.set_index("Province").to_dict()["Population"]
display(subpopulation_dict)

### Load mobility

We will use the "T-100 Domestic Market (All Carriers)" dataset from the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=111),
with data from January to September of 2019.

In [None]:
raw_S = pd.read_csv("/kaggle/input/localsrc/8573377_T_T100D_MARKET_ALL_CARRIER.csv")
#raw_S["Date"] = [str(row["YEAR"]) + "-" + str(row["MONTH"]) + "-01"  for index, row in raw_S.iterrows()]
raw_S["Date"] = raw_S["YEAR"].astype(str) + "-" + raw_S["MONTH"].astype(str) + "-01" 
raw_S.rename(columns={"ORIGIN_STATE_NM": "orig_dept", "DEST_STATE_NM":"dest_dept", "PASSENGERS":"count"}, inplace=True)
raw_S = raw_S.loc[:, ["Date","orig_dept", "dest_dept","count"]]
raw_S = raw_S.groupby(["Date","orig_dept", "dest_dept"]).sum()
raw_S = raw_S.reset_index(["Date","orig_dept","dest_dept"])
#display(raw_S.head())

pprint("NULL DATA:")
display(pd.DataFrame(raw_S.isnull().sum()).T)
printmd("**No existen datos nulos**")
#raw_S = raw_S.dropna()

pprint("INFO:")
display(raw_S.info())

pprint("DESCRIBE:")
display(raw_S.describe(include="all").fillna("-"))

pprint("REPORTED SUBPOPULATIONS:")
display(", ".join(sorted(raw_S["orig_dept"].fillna("-").unique().tolist())))

pprint("REMOVE BAD STATES:")
pprint(raw_S.shape)
raw_S = raw_S[~raw_S.orig_dept.isin(subpopulation_excluded)]
raw_S = raw_S[~raw_S.dest_dept.isin(subpopulation_excluded)]
pprint(raw_S.shape)

display(raw_S.tail())

### Subpopulation Mapping

In [None]:
raw_S['orig_sp'] = [ subpopulation_mapper[d] for d in raw_S['orig_dept']]
raw_S['dest_sp'] = [ subpopulation_mapper[d] for d in raw_S['dest_dept']]

pprint("cantidad de valores nulos:")
display(pd.DataFrame(raw_S.isnull().sum()).T)

display(raw_S)

### Subpopulation Mobility Analysis

First I define the "market share" to scale the mobility data. This is the percentage of individuals in each mobility subpopulation, according to the totals of that subpopulation

In [None]:
# TODO I do not have information to the market share in this data

'''
_marketshare_df = raw_P.groupby(["Date","sp"]).sum()
_marketshare_df = _marketshare_df.reset_index("sp")
_marketshare_df = _marketshare_df.pivot_table(index=['Date'], columns='sp')
_marketshare_df.columns = _marketshare_df.columns.droplevel().rename(None)
display(_marketshare_df.head())

_subpopulation_size = _marketshare_df.mean() #tamaño de subpoblaciones muestradas segun movilidad (no poblacion total)
subpopulation_size = np.array([_subpopulation_size[sp] for sp in subpopulation_name]) #me aseguro ponerlo en el orden correcto
subpopulation_marketshare = np.array([_subpopulation_size[sp]/subpopulation_dict[sp] for sp in subpopulation_name]) #me aseguro ponerlo en el orden correcto

line_plot(_marketshare_df, "Total number of individuals in subpopulations over time", h = subpopulation_size)

_marketshare_df.plot.kde()
plt.title("Kernel density estimation of the marketshares")
plt.show()
display(_marketshare_df.describe().T)
'''
subpopulation_marketshare = np.ones(len(subpopulation_name))


pprint("El porcentaje de subpoblacion " + str(subpopulation_name) + " con movilidad conocida es:")
display(subpopulation_marketshare)

Now, I calculate the average mobility matrices during the pandemic.

I do not have actual mobility data, I suppose a 30% lower than similar data in 2019.

In [None]:
pprint("Min date in datasest: " + str(min(raw_S["Date"])))
pprint("Max date in datasest: " + str(max(raw_S["Date"])))

before_sample_start = "2019-01-01" #elegimos domingos para tener promedio semanal
before_sample_end = "2019-12-31" #elegimos domingos para tener promedio semanal 

#no tenemos datos de pandemia
pandemia_sample_start = "2019-01-01" #elegimos domingos para tener promedio semanal
pandemia_sample_end = "2019-12-31" #elegimos domingos para tener promedio semanal
pandemic_factor = 0.7


In [None]:
_l_df = raw_S.groupby(["Date","orig_sp","dest_sp"]).sum() #agregar los viajes entre todos los departamentos de las mismas subpoblaciones
_l_df = _l_df.reset_index(["orig_sp","dest_sp"])
_l_df = _l_df.loc[_l_df["orig_sp"]!= _l_df["dest_sp"],:] # eliminamos los viajes dentro de cada subpoblacion, porque no se usan en el modelo (y ademas no son representativos, son los viajes entre departamentos de la subpoblacion)
_l_df = _l_df.pivot_table(index=['Date'], columns=["orig_sp","dest_sp"]) #unmelt
_l_df.columns = _l_df.columns.droplevel() #borro un primer nivel de multindice

_l_before_df = _l_df.loc[(_l_df.index >= before_sample_start) & (_l_df.index < before_sample_end),:]
_l_pandemia_df = _l_df.loc[(_l_df.index >= pandemia_sample_start) & (_l_df.index < pandemia_sample_end),:]
display(_l_pandemia_df.head())

_lflatten_df = _l_df.copy()
_lflatten_df.columns = _lflatten_df.columns.map('|'.join).str.strip('|') #concateno los ultimos dos niveles de multindice para mostrarlo
line_plot(_lflatten_df, "Total number of individuals that travel in the day to sp living in orig (sp|orig)", math_scale=False, 
          h=np.concatenate([np.array(_l_before_df.mean()),np.array(_l_pandemia_df.mean())]))

# lo paso a una matriz, y escalo segun marketshare
#me aseguro ponerlo en el orden correcto
_l_before_df = _l_before_df.mean()
_l_pandemia_df = _l_pandemia_df.mean()
l_before = np.zeros([len(subpopulation_name),len(subpopulation_name)])
l_pandemia = np.zeros([len(subpopulation_name),len(subpopulation_name)])
for sp_i in range(len(subpopulation_name)):
    for sp_j in range(len(subpopulation_name)):
        if (sp_i != sp_j):
            l_before[sp_i,sp_j] = _l_before_df[subpopulation_name[sp_i],subpopulation_name[sp_j]] / subpopulation_marketshare[sp_j] / subpopulation_dict[subpopulation_name[sp_j]]
            l_pandemia[sp_i,sp_j] = _l_pandemia_df[subpopulation_name[sp_i],subpopulation_name[sp_j]] / subpopulation_marketshare[sp_j] / subpopulation_dict[subpopulation_name[sp_j]]
l_before =  l_before/273*9  #monthly average to dayly average (only data from january to september: 9 months, 273 days)
l_pandemia =  l_pandemia/273*9*pandemic_factor
pprint("El porcentaje de individuos que viaja a la subpoblacion i, y viven en la subpoblacion j por dia, escalado segun marketshare:")
display(l_pandemia)

a_before =  l_before/24/60 
a_pandemia =  l_pandemia/24/60 
pprint("El porcentaje de individuos que viaja a la subpoblacion i, y viven en la subpoblacion j por minuto:")
display(a_pandemia)

In [None]:
pprint("We do not have data,.. I suppose r = l. Please read below...")
r_before = l_before
r_pandemia = l_pandemia

#pprint("El porcentaje de individuos que viaja desde la subpoblacion i hacia donde viven en la subpoblacion j por dia, escalado segun marketshare:")
pprint("The rate of individuals that commute from subpopulation i to their homes at subpopulation j, per day, scaled by marketshare:")
display(r_pandemia)

b_before =  r_before/24/60 
b_pandemia =  r_pandemia/24/60 
#pprint("El porcentaje de individuos que viaja desde la subpoblacion i hacia donde viven en la subpoblacion j por minuto:")
pprint("The rate of individuals that commute to subpopulation i to their homes at subpopulation j, per minute:")
display(b_pandemia)

# Trend analysis

Using fbprophet package, we will find changing points of log10(comfirmed/deaths/recovered).  
We will use the data in the most cirical country where the number of days with growth factor $>$ 1 is the longest.

In [None]:
uy_country = 'US'
uy_df = ncov_df.loc[ncov_df["Country"] == uy_country, ["Date", *data_cols_all]].groupby("Date").sum()

display(uy_df)
line_plot(uy_df, f"{uy_country}: Cases over time", y_integer=True)

In [None]:
show_trend(ncov_df, variable="Confirmed", places=[(uy_country, None)])
printmd("**The slope was change at 29Feb2020.**")

In [None]:
#show_trend(ncov_df, variable="Confirmed", places=[(uy_country, None)], n_changepoints=-1, start_date="29Feb2020")
printmd("**It changes again in 28Mar2020...**")

In [None]:
#show_trend(ncov_df, variable="Confirmed", places=[(uy_country, None)], n_changepoints=-1, start_date="28Mar2020")
printmd("**NPodriamos aceptar que no ha cambiado ahora...**")

In [None]:
uy_country_start = "28Mar2020"

printmd("**Records after " + uy_country_start + " will be used for improvement of math model.**")

# Traditional Epidemic Models

* ## SIR model
To understand the trend of infection, we will use mathematical epidemic model. Let's start discussion using a basic model named SIR.

### What is SIR model?
SIR model is a simple mathematical model to understand outbreak of infectious diseases.  
[The SIR epidemic model - Learning Scientific Programming with Python](https://scipython.com/book/chapter-8-scipy/additional-examples/the-sir-epidemic-model/)

 * S: Susceptible (=All - Confirmed)
 * I: Infected (=Confirmed - Recovered - Deaths)
 * R: Recovered or fatal (=Recovered + Deaths)
 
Note: THIS IS NOT THE GENERAL MODEL!  
Though R in SIR model is "Recovered and have immunity", I defined "R as Recovered or fatal". This is because mortality rate cannot be ignored in the real COVID-19 data.

Model:  
S + I $\overset{\beta}{\longrightarrow}$ 2I  
I $\overset{\gamma}{\longrightarrow}$ R

$\beta$: Effective contact rate [1/min]  
$\gamma$: Recovery(+Mortality) rate [1/min]  

Ordinary Differential Equation (ODE):   
$\frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I$  
$\frac{\mathrm{d}I}{\mathrm{d}T}= N^{-1}\beta S I - \gamma I$  
$\frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I$  

Where $N=S+I+R$ is the total population, $T$ is the elapsed time from the start date.

#### Non-dimensional SIR model
To simplify the model, we will remove the units of the variables from ODE.

Set $(S, I, R) = N \times (x, y, z)$ and $(T, \beta, \gamma) = (\tau t, \tau^{-1} \rho, \tau^{-1} \sigma)$.  

This results in the ODE  
$\frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y$  
$\frac{\mathrm{d}y}{\mathrm{d}t}= \rho x y - \sigma y$  
$\frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y$  

Where $N$ is the total population and $\tau$ is a coefficient ([min], is an integer to simplify).  

The range of variables and parameters:  
$0 < (x, y, z, \rho, \sigma) < 1$  
$1\leq \tau \leq 1440$  

Basic reproduction number, Non-dimentional parameter, is defined as  
$R_0 = \rho \sigma^{-1} = \beta \gamma^{-1}$  

Estimated Mean Values of $R_0$:  
$R_0$ means "the average number of secondary infections caused by an infected host" ([Infection Modeling — Part 1](https://towardsdatascience.com/infection-modeling-part-1-87e74645568a)).  
(Secondary data: [Van den Driessche, P., & Watmough, J. (2002).](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6002118))  
2.06: Zika in South America, 2015-2016  
1.51: Ebola in Guinea, 2014  
1.33: H1N1 influenza in South Africa, 2009  
3.5 : SARS in 2002-2003  
1.68: H2N2 influenza in US, 1957  
3.8 : Fall wave of 1918 Spanish influenza in Genova  
1.5 : Spring wave of 1918 Spanish influenza in Genova  

When $x=\frac{1}{R_0}$, $\frac{\mathrm{d}y}{\mathrm{d}t}=0$. This means that the max value of confirmed ($=y+z$) is $1-\frac{1}{R_0}$.

### Hyperparameter optimization
Using Optuna package, ($\rho, \sigma, \tau$) will be estimated by model fitting.

In [None]:
tau=1440

In [None]:
%%time

sir_estimator = Estimator(
    SIR, ncov_df, population_dict[uy_country], name=uy_country, places=[(uy_country, None)],
    start_date=uy_country_start,
    tau=tau
)
sir_dict = sir_estimator.run(n_trials)

In [None]:
display(sir_estimator.history_df().head())

sir_estimator.history_graph()

display(pd.DataFrame.from_dict({"SIR": sir_dict}, orient="index"))

sir_estimator.compare_graph()


In [None]:
sir_estimator.predict_graph(step_n=400)

df = sir_estimator.predict_df(400)
display(df.loc[datetime.today():, ["Infected", "Recovered/Deaths"]].head(14).style.background_gradient(axis=0))

## SIR-D model
Because we can measure the number of fatal cases and recovered cases separately, we can use two variables ("Recovered" and "Deaths") instead of "Recovered + Deaths" in the mathematical model.

### What is SIR-D model?
* S: Susceptible
* I: Infected
* R: Recovered
* D: Fatal

Model:  
S + I $\overset{\beta}{\longrightarrow}$ 2I  
I $\overset{\gamma}{\longrightarrow}$ R  
I $\overset{\alpha}{\longrightarrow}$ D  

$\alpha$: Mortality rate [1/min]  
$\beta$: Effective contact rate [1/min]  
$\gamma$: Recovery rate [1/min]  

Ordinary Differential Equation (ODE):   
$\frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I$  
$\frac{\mathrm{d}I}{\mathrm{d}T}= N^{-1}\beta S I - (\gamma + \alpha) I$  
$\frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I$  
$\frac{\mathrm{d}D}{\mathrm{d}T}= \alpha I$  

Where $N=S+I+R+D$ is the total population, $T$ is the elapsed time from the start date.

### Non-dimensional SIR-D model
Set $(S, I, R, D) = N \times (x, y, z, w)$ and $(T, \alpha, \beta, \gamma) = (\tau t, \tau^{-1} \kappa, \tau^{-1} \rho, \tau^{-1} \sigma)$.  
This results in the ODE  
$\frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y$  
$\frac{\mathrm{d}y}{\mathrm{d}t}= \rho x y - (\sigma + \kappa) y$  
$\frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y$  
$\frac{\mathrm{d}w}{\mathrm{d}t}= \kappa y$  

Where $N$ is the total population and $\tau$ is a coefficient ([min], is an integer to simplify).  

The range of variables and parameters:  
$0 \leq (x, y, z, w, \kappa, \rho, \sigma) \leq 1$  
$1\leq \tau \leq 1440$

Reproduction number can be defined as  
$R_0 = \rho (\sigma + \kappa)^{-1} = \beta (\gamma + \alpha)^{-1}$

### Hyperparameter optimization
Using Optuna package, ($\kappa, \rho, \sigma, \tau$) will be estimated by model fitting.

In [None]:
%%time
sird_estimator = Estimator(
    SIRD, ncov_df, population_dict[uy_country],
    name=uy_country, places=[(uy_country, None)],
    start_date=uy_country_start,
    tau=tau
)
sird_dict = sird_estimator.run(n_trials)

In [None]:
display(sird_estimator.history_df().head())

sird_estimator.history_graph()

display(pd.DataFrame.from_dict({"SIR": sir_dict, "SIR-D": sird_dict}, orient="index").fillna("-"))

sird_estimator.compare_graph()


In [None]:
sird_estimator.predict_graph(step_n=300)

df = sird_estimator.predict_df(300)
display(df.loc[datetime.today():, ["Infected", "Recovered", "Deaths"]].head(14).style.background_gradient(axis=0))

## SIR-F model
Some cases are reported as fatal cases before clinical diagnosis of COVID-19. To consider this issue, "S + I $\to$ Fatal + I" will be added to the model.

### What is SIR-F model?
* S: Susceptible
* S$^\ast$: Confirmed and un-categorized
* I: Confirmed and categorized as I
* R: Recovered
* F: Fatal with confirmation

Measurable variables:  
Confirmed = $I+R+F$  
Recovered = $R$  
Deaths = $F$  

Model:  
S $\overset{\beta \mathrm{I}}{\longrightarrow}$ S$^\ast$ $\overset{\alpha_1}{\longrightarrow}$ F  
S $\overset{\beta \mathrm{I}}{\longrightarrow}$ S$^\ast$ $\overset{1 - \alpha_1}{\longrightarrow}$ I  
I $\overset{\gamma}{\longrightarrow}$ R  
I $\overset{\alpha_2}{\longrightarrow}$ F  

$\alpha_1$: Mortality rate of S$^\ast$ cases [-]  
$\alpha_2$: Mortality rate of I cases [1/min]  
$\beta$: Effective contact rate [1/min]  
$\gamma$: Recovery rate [1/min]  

Ordinary Differential Equation (ODE):   
$\frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I$  
$\frac{\mathrm{d}I}{\mathrm{d}T}= N^{-1}(1 - \alpha_1) \beta S I - (\gamma + \alpha_2) I$  
$\frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I$  
$\frac{\mathrm{d}F}{\mathrm{d}T}= N^{-1}\alpha_1 \beta S I + \alpha_2 I$  

Where $N=S+I+R+F$ is the total population, $T$ is the elapsed time from the start date.

### Non-dimensional SIR-F model
Set $(S, I, R, F) = N \times (x, y, z, w)$ and $(T, \alpha_1, \alpha_2, \beta, \gamma) = (\tau t, \theta, \tau^{-1} \kappa, \tau^{-1} \rho, \tau^{-1} \sigma)$.  
This results in the ODE  
$\frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y$  
$\frac{\mathrm{d}y}{\mathrm{d}t}= \rho (1-\theta) x y - (\sigma + \kappa) y$  
$\frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y$  
$\frac{\mathrm{d}w}{\mathrm{d}t}= \rho \theta x y + \kappa y$  

Where $N$ is the total population and $\tau$ is a coefficient ([min], is an integer to simplify).  

The range of variables and parameters:  
$0 \leq (x, y, z, w, \theta, \kappa, \rho, \sigma) \leq 1$  
$1 \leq \tau \leq 1440$  

Reproduction number can be defined as  
$R_0 = \rho (1 - \theta) (\sigma + \kappa)^{-1} = \beta (1 - \alpha_1) (\gamma + \alpha_2)^{-1}$

### Hyperparameter optimization
Using Optuna package, ($\theta, \kappa, \rho, \sigma, \tau$) will be estimated by model fitting.

In [None]:
%%time
sirf_estimator = Estimator(
    SIRF, ncov_df, population_dict[uy_country],
    name=uy_country, places=[(uy_country, None)],
    start_date=uy_country_start,
    tau=tau
)
sirf_dict = sirf_estimator.run(n_trials)

In [None]:
display(sirf_estimator.history_df().head())

sirf_estimator.history_graph()

display(pd.DataFrame.from_dict({"SIR": sir_dict, "SIR-D": sird_dict, "SIR-F": sirf_dict}, orient="index").fillna("-"))

sirf_estimator.compare_graph()


In [None]:
sirf_estimator.predict_graph(step_n=400)

df = sirf_estimator.predict_df(400)
display(df.loc[datetime.today():, ["Infected", "Recovered", "Fatal"]].head(14).style.background_gradient(axis=0))

# MetaPopulation Epidemic Models

## MP*SIR model
To understand the trend of infection, we will use mathematical epidemic model. Let's start discussion using a basic model named MPHSIR.

### What is MPHSIR model?
MPHSIR model is a simple mathematical model to understand the outbreak of infectious diseases where travellers interact between subpopulations.
The metapopulation concept is to subdivide the entire population into distinct "subpopulations", each of which has independent epidemiological dynamics, together with limited interaction between the subpopulations (caused by the travellers).
The epidemiological dynamics are supposed identical (homogeneous) between subpopulations (identical contact and recovery rates),
that's where the name MetaPopulationHomogeneous-SIR (MPHSIR) comes from.

This model is based in [Modeling Infectious Diseases in Humans and Animals
Matt J. Keeling & Pejman Rohani](https://homepages.warwick.ac.uk/~masfz/ModelingInfectiousDiseases/Chapter7/Program_7.2/index.html)
Note: This is not the general model because:
 * Though R in SIR model is "Recovered and have immunity", we defined "R as Recovered or fatal". This is because mortality rate cannot be ignored in the real COVID-19 data.
 * Effective contact rate and Recovery(+Mortality) rate are supposed homogeneous between subpopulations. 
 * Birth rate and death rates are ignored because of the fast spread of COVID-19.
 * Permanent relocation from one subpopulation to another is sufficiently rare that it may be ignored as an epidemiologically significant force. Instead, it is more natural to think about commuters spreading the disease. Commuters live in one subpopulation but travel occasionally to another subpopulation. Therefore, we will need two matrices, $l$ and $r$, that determine the rate that individuals leave from and return to their home subpopulation.

At each subpopulation, a simple SIR model is used to predict, but with the main consideration that the individuals at subpopulations change over time because of travellers (and this is a new factor of infection):
 * $S_{ij}$: number of Susceptible (= All - Confirmed) currently in population $i$ that live in population $j$.
 * $I_{ij}$: number of Infected (= Confirmed - Recovered - Deaths) currently in population $i$ that live in population $j$.
 * $R_{ij}$: number of Recovered or fatal (=Recovered + Deaths) currently in population $i$ that live in population $j$.
 * $N_{ij}$: Total hosts currently in population $i$ that live in population $j$.

From the standard SIR models we consider the number of individuals of each type in each spatial class:
 * S + I $\overset{\beta}{\longrightarrow}$ 2I  
 * I $\overset{\gamma}{\longrightarrow}$ R
 * N $=$ S + I + R

Ordinary Differential Equation (ODE):   
$\frac{\mathrm{d}S_{ii}}{\mathrm{d}T}= - \beta S_{ii} \frac{\sum_j I_{ij}}{\sum_j N_{ij}} - \sum_j l_{ji} S_{ii} + \sum_j r_{ji} S_{ji}$      
$\frac{\mathrm{d}S_{ij}}{\mathrm{d}T}= - \beta S_{ij} \frac{\sum_j I_{ij}}{\sum_j N_{ij}} + l_{ij} S_{jj} - r_{ij} S_{ij}$     
$\frac{\mathrm{d}I_{ii}}{\mathrm{d}T}= + \beta S_{ii} \frac{\sum_j I_{ij}}{\sum_j N_{ij}} - \gamma I_{ii} - \sum_j l_{ji} I_{ii} + \sum_j r_{ji} I_{ji}$     
$\frac{\mathrm{d}I_{ij}}{\mathrm{d}T}= + \beta S_{ij} \frac{\sum_j I_{ij}}{\sum_j N_{ij}} - \gamma I_{ij} + l_{ij} I_{jj} - r_{ij} I_{ij}$     
$\frac{\mathrm{d}R_{ii}}{\mathrm{d}T}= + \gamma I_{ii} - \sum_j l_{ji} R_{ii} + \sum_j r_{ji} R_{ji}$      
$\frac{\mathrm{d}R_{ij}}{\mathrm{d}T}= + \gamma I_{ij} + l_{ij} R_{jj} - r_{ij} R_{ij}$       
$\frac{\mathrm{d}N_{ii}}{\mathrm{d}T}= - \sum_j l_{ji} N_{ii} + \sum_j r_{ji} N_{ji}$          
$\frac{\mathrm{d}N_{ij}}{\mathrm{d}T}= + l_{ij} N_{jj} - r_{ij} N_{ij}$    


Where: 
 * $T$ is the elapsed time from the start date
 * $n$ is the number of sub-populations. Note that all parameters are vectors of size $n$, or matrices of size $n × n$
 * $\beta$: Effective contact rate [1/min]  
 * $\gamma$: Recovery(+Mortality) rate [1/min]  
 * $l_{ij}$: is the rate at which individuals leave their home subpopulation $j$ and commute to subpopulation $i$. $l$ is a matrix of size $n × n$
 * $r_{ij}$: is the rate at which individuals return their home subpopulation $j$ from being in subpopulation $i$. $r$ is a matrix of size $n × n$
All rates are specified in mins.

Requirements.
All parameters must be positive. It is also expected that the diagonal terms of the $l$ and $r$ matrices are all zero.


#### Non-dimensional MPHSIR model
To simplify the model, we will remove the units of the variables from ODE.
We define:   
$\tau$ is a coefficient ([min], is an integer to simplify)     
$T = \tau t$     
$\rho = \tau \beta$      
$\sigma = \tau \gamma$   
$a_{ij} = \tau l_{ij}$                   
$b_{ij} = \tau r_{ij}$                      
total individuals that lives in subpopulation $j$: $N_j = \sum_i N_{ij}$            
total individuals currently in subpopulation $i$:  $sumN_i = \sum_j N_{ij}$, $sumI_i = \sum_j I_{ij}$

And we change the variables:

$S_{ii} = N_i x_{ii}$,
$S_{ij} = N_j x_{ij}$,
$I_{ii} = N_i y_{ii}$,
$I_{ij} = N_j y_{ij}$,
$R_{ii} = N_i z_{ii}$,
$R_{ij} = N_j z_{ij}$,


This results in the ODE:

$\frac{\mathrm{d}x_{ii}}{\mathrm{d}t}= - \rho x_{ii} \frac{sumI_i}{sumN_i} - \sum_j a_{ji} x_{ii} + \sum_j b_{ji} x_{ji}$          
$\frac{\mathrm{d}x_{ij}}{\mathrm{d}t}= - \rho x_{ij} \frac{sumI_i}{sumN_i} + a_{ij} x_{jj} - b_{ij} x_{ij}$           
$\frac{\mathrm{d}y_{ii}}{\mathrm{d}t}= + \rho x_{ii} \frac{sumI_i}{sumN_i} - \sigma y_{ii} - \sum_j a_{ji} y_{ii} + \sum_j b_{ji} y_{ji}$          
$\frac{\mathrm{d}y_{ij}}{\mathrm{d}t}= + \rho x_{ij} \frac{sumI_i}{sumN_i} - \sigma y_{ij} + a_{ij} y_{jj} - b_{ij} y_{ij}$           
$\frac{\mathrm{d}z_{ii}}{\mathrm{d}t}= + \sigma y_{ii} - \sum_j a_{ji} z_{ii} + \sum_j b_{ji} z_{ji}$               
$\frac{\mathrm{d}z_{ij}}{\mathrm{d}t}= + \sigma y_{ij} + a_{ij} z_{jj} - b_{ij} z_{ij}$             

The range of variables and parameters :  
$0 < (x_{ij}, y_{ij}, z_{ij}, \rho, \sigma, a_{ij}, b_{ij}) < 1$  for all $i,j$.
$1\leq \tau \leq 1440$  


### Hyperparameter optimization

I must create the dataset, because it is not obtained from n_cov directly ...

In [None]:

def create_target_uscensus_df(ncov_df, subpopulation_total, subpopulation_name, start_date=None):
    """
    Select the records of the places, calculate the number of susceptible people,
     and calculate the elapsed time [day] from the start date of the target dataframe.
    @ncov_df <pd.DataFrame>: the clean data
    @initials_total_subpopulations <array[int]>: total population in each subpopulation
    @kwargs: keword arguments of select_area()
    @return <tuple(2 objects)>:
        - 1. first_date <pd.Timestamp>: the first date of the selected records
        - 2. target_df <pd.DataFrame>:
            - column T: elapsed time [min] from the start date of the dataset
            - column Susceptible: the number of patients who are in the palces but not infected/recovered/died
            - column Infected: the number of infected cases
            - column Recovered: the number of recovered cases
            - column Deaths: the number of death cases
    SUPONE Hay registros de todas las provincias en todos los dias
    """
    province_name= []
    for k, v in subpopulation_mapper.items():
        if v!="-":
            province_name.append(k)
    
    df = ncov_df[(ncov_df['Country']=='US') & ncov_df['Province'].isin(province_name)].copy() #solo US y las provincias deseadas
    df = df[df['Date'] >= start_date] #despues de la fecha
    df['sp'] = [ subpopulation_mapper[d] for d in df['Province'] ]
    df = df.groupby(["Date","sp"]).sum()
    df = df.reset_index()
    #display(df)
    first_date = df.loc[df.index[0], "Date"]
    # column T
    df["T"] = ((df["Date"] - first_date).dt.total_seconds() / 60).astype(int)
    response_variables = ["Infected", "Recovered", "Deaths"]
    df = df.loc[:, ["T", "sp", *response_variables]]
    target_df = pd.DataFrame()
    for sp in range(len(subpopulation_name)):
        #pprint( str(sp) + " - " + subpopulation_name[sp] + " - " + str(subpopulation_total[sp]))
        target_df_sp =  df.loc[df["sp"]==subpopulation_name[sp], ["T", *response_variables]].copy()
        target_df_sp["Susceptible"] =  int(subpopulation_total[sp]) - target_df_sp["Infected"] - target_df_sp["Recovered"] - target_df_sp["Deaths"]
        target_df_sp.columns = str(sp) +"@" + target_df_sp.columns
        target_df_sp.rename(columns={str(sp) +'@T': 'T'}, inplace=True)
        #display(target_df_sp)
        if len(target_df)>0:
            target_df = pd.merge(target_df, target_df_sp, on="T")
        else:
            target_df = target_df_sp
    return (first_date, target_df)


tau = 1440
subpopulation_total = np.array([subpopulation_dict[sp] for sp in subpopulation_name]) #poblacion total 

# funcion basica que filtra el dataset ncov_df, y crea un dataset ficticio norte/sur
start_date, target_df = create_target_uscensus_df(
    ncov_df,  subpopulation_total, subpopulation_name, start_date=uy_country_start
)
pprint([subpopulation_name, subpopulation_total, start_date.strftime(time_format)])
display(target_df)

display(a_pandemia)
display(b_pandemia)


In [None]:

%%time

mphsir_estimator = MPEstimator(
    MPHSIR, start_date, 
    target_df, subpopulation_total, subpopulation_name,
    tau=tau, a=a_pandemia, b=b_pandemia, N=subpopulation_total) #parametros fijos
mphsir_dict = mphsir_estimator.run(n_trials)


In [None]:
display(mphsir_estimator.compare_df())

display(mphsir_estimator.history_df().head())

mphsir_estimator.history_graph()

display(pd.DataFrame.from_dict({"MPHSIR": mphsir_dict}, orient="index"))

mphsir_estimator.compare_graph()


In [None]:
mphsir_estimator.predict_graph(step_n=400)

df = mphsir_estimator.predict_df(400)
display(df.loc[datetime.today():, :].head(14).style.background_gradient(axis=0))

# Conclusion: Models comparision

### Parameters

In [None]:
display(pd.DataFrame.from_dict({"SIR": sir_dict, 
                                "SIR-D": sird_dict, 
                                "SIR-F": sirf_dict, 
                                #"SEWIR-F": sewirf_dict, 
                                "MPHSIR": mphsir_dict
                               }, orient="index").fillna("-"))

### Plots

In [None]:
sir_estimator.predict_graph(400)
sird_estimator.predict_graph(400)
sirf_estimator.predict_graph(400)
#sewirf_estimator.predict_graph(400)
mphsir_estimator.predict_graph(400)

### Predictions

In [None]:
real_data = ncov_df.loc[(ncov_df['Country']==uy_country), ["Date","Infected"]].groupby("Date").sum()
real_data = real_data.reset_index()
real_data["Date"] = real_data["Date"].dt.date
real_data.rename(columns={"Infected": "Infected_real"}, inplace=True)
#display(real_data)

sir_data = sir_estimator.predict_df(400)
sir_data["Date"] = pd.to_datetime(sir_data.index)
sir_data["Date"] = sir_data["Date"].dt.date
sir_data = pd.DataFrame([g.iloc[np.argmax(g.index)] for l, g in sir_data.groupby('Date')]) #si hay mas de un registro por fecha, me quedo con el ultimo
sir_data.rename(columns={"Infected": "Infected_sir"}, inplace=True)
#display(sir_data)

sird_data = sird_estimator.predict_df(400)
sird_data["Date"] = pd.to_datetime(sird_data.index)
sird_data["Date"] = sird_data["Date"].dt.date
sird_data = pd.DataFrame([g.iloc[np.argmax(g.index)] for l, g in sird_data.groupby('Date')]) #si hay mas de un registro por fecha, me quedo con el ultimo
sird_data.rename(columns={"Infected": "Infected_sird"}, inplace=True)
#display(sird_data)

sirf_data = sirf_estimator.predict_df(400)
sirf_data["Date"] = pd.to_datetime(sirf_data.index)
sirf_data["Date"] = sirf_data["Date"].dt.date
sirf_data = pd.DataFrame([g.iloc[np.argmax(g.index)] for l, g in sirf_data.groupby('Date')]) #si hay mas de un registro por fecha, me quedo con el ultimo
sirf_data.rename(columns={"Infected": "Infected_sirf"}, inplace=True)
#display(sirf_data)

'''
df = sewirf_estimator.predict_df(400)
sewirf_data = sewirf_estimator.predict_df(400)
sewirf_data["Date"] = pd.to_datetime(sewirf_data.index)
sewirf_data["Date"] = sewirf_data["Date"].dt.date
sewirf_data = pd.DataFrame([g.iloc[np.argmax(g.index)] for l, g in sewirf_data.groupby('Date')]) #si hay mas de un registro por fecha, me quedo con el ultimo
sewirf_data.rename(columns={"Infected": "Infected_sewirf"}, inplace=True)
#display(sewirf_data)
'''

mphsir_data = mphsir_estimator.predict_df(400)
mphsir_data["Date"] = pd.to_datetime(mphsir_data.index)
mphsir_data["Date"] = mphsir_data["Date"].dt.date
mphsir_data = pd.DataFrame([g.iloc[np.argmax(g.index)] for l, g in mphsir_data.groupby('Date')]) #si hay mas de un registro por fecha, me quedo con el ultimo
mphsir_data.rename(columns={"Infected": "Infected_mphsir"}, inplace=True)
#display(mphsir_data)


join_data = pd.merge(real_data, sir_data[["Date","Infected_sir"]], how='inner', on=["Date"])
join_data = pd.merge(join_data, sird_data[["Date","Infected_sird"]], how='inner', on=["Date"])
join_data = pd.merge(join_data, sirf_data[["Date","Infected_sirf"]], how='inner', on=["Date"])
#join_data = pd.merge(join_data, sewirf_data[["Date","Infected_sewirf"]], how='inner', on=["Date"])
join_data = pd.merge(join_data, mphsir_data[["Date","Infected_mphsir"]], how='inner', on=["Date"])
join_data["Date"] =  pd.to_datetime(join_data["Date"])
join_data = join_data.set_index("Date")

display(join_data)
line_plot(join_data, "Comparación", math_scale=False)