# All COVID-19 Models Are Wrong: Are any of them useful?

While I was comparing the fatality forecasts from the Kaggle [COVID19 Global Forecasting (Week 4)](https://www.kaggle.com/c/covid19-global-forecasting-week-4/overview/description) competition to other models (IHME and LANL), I observed a high degree of variability around which model performed best at the US state livel; it appeared fairly random which model was doing best for each state.

The key factor for a model's short-term performance was how current the model training data was. A model that had three extra days of training data tended to out-perform other models, or even the same model that was trained three days prior. This strongly suggest these models are missing important dynamics of COVID-19 spread, or that the dynamics are non-linear to a degree that makes long-term prediction nearly impossible.

### Key Takeaways:
* The performance of models in the short term depends heavily on the last few days of training data (see Wyoming, Texas, and Ohio examples below)
* It's very difficult say which model is "best" at this point (although, the Kaggle models and LANL appear to be more robust)
* The ability of these models to accurately predict weeks in the future is doubtful, except perhaps within an order of magnitude.


### Data Used for Benchmarks

The data for this notebook contains 3 sets of model:
1. Two sets of IHME predictions made on April 13 and April 16 2020.
2. Two sets of LANL predictions made on April 12 and April 15, 2020.
3. Week 4 of the Kaggle competition selected predictions, made on April 14, 2020.

For the IHME and LANL prediction, I plot them in the same color, with the previous date using dashed lines.

For the Kaggle predictions, I inculde the median of all selected submissions (solid red), as well as the predictions from the top 4 teams that did well in the Week 3 competition (thus suggesting they are stronger models).

In [None]:
import numpy as np
import pandas as pd

from pathlib import Path
data_path_benchmark = Path('/kaggle/input/covid19-benchmarks/')
data_path_competition = Path('/kaggle/input/covid19-global-forecasting-week-4/')
data_path_actuals = Path('/kaggle/input/covid19-models-raw-data')

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv(data_path_actuals / 'actuals.csv', index_col='Id')
wk4_test = pd.read_csv(data_path_competition / 'test.csv', index_col='ForecastId')

wk4_preds_fatalities = pd.read_csv(data_path_benchmark / 'wk4_preds_fatalities_selected_deduped.csv', index_col='ForecastId')

ihme_cols = ['location_name', 'date', 'deaths_mean','deaths_lower', 'deaths_upper']
ihme_1 = pd.read_csv(data_path_benchmark / 'ihme_2020_04_13.csv')[ihme_cols]
ihme_2 = pd.read_csv(data_path_benchmark / 'ihme_2020_04_16.csv')[ihme_cols]

lanl_cols = ['dates', 'q.50', 'state']
lanl_1 = pd.read_csv(data_path_benchmark / 'lanl_2020_04_12.csv')[lanl_cols]
lanl_2 = pd.read_csv(data_path_benchmark / 'lanl_2020_04_15.csv')[lanl_cols]

In [None]:
states = [
  'Alabama','Alaska','Arizona','Arkansas','California','Colorado',
  'Connecticut','Delaware','Florida','Georgia','Hawaii','Idaho','Illinois',
  'Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland',
  'Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana',
  'Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York',
  'North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania',
  'Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah',
  'Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming']

In [None]:
def plot_fatalities(state):

    fig = plt.figure(figsize=(12,8))
    ax = plt.axes()
    
    # to help with ylim
    max_preds = []

    # Actual
    select = (train['Country_Region']=='US') & (train['Province_State']==state)
    ids = train.loc[select].index.tolist()
    dates = train.loc[select, 'Date'].values
    data = train.loc[ids, 'Fatalities'].values
    plt.plot(dates, data, c='k', label='Actual', linewidth=4)
    max_preds.append(data[-1])

    # LANL 1
    lanl_ids = lanl_1['state'] == state
    dates = lanl_1.loc[lanl_ids, 'dates'].values
    data = lanl_1.loc[lanl_ids, 'q.50'].values
    plt.plot(dates, data, label=f'LANL (Apr 12)', c='g', alpha=0.5, linestyle='--')
    max_preds.append(data[-1])
    
    # IHME 1
    ihme_ids = ihme_1['location_name'] == state
    dates = ihme_1.loc[ihme_ids, 'date'].values.tolist()
    data = ihme_1.loc[ihme_ids, 'deaths_mean'].cumsum().values.tolist()
    start = dates.index('2020-03-15')
    dates = dates[start:]
    data = data[start:]
    plt.plot(dates, data, label=f'IHME (Apr 13)', alpha=0.5, c='b', linestyle='--')
    max_preds.append(data[-1])

#     # Kaggle Median
    select = (wk4_test['Country_Region']=='US') & (wk4_test['Province_State']==state)
    ids = wk4_test.loc[select].index.tolist()
    dates = wk4_test.loc[select, 'Date'].tolist()
    data = wk4_preds_fatalities.loc[ids].quantile(0.5, axis=1).values
#     plt.plot(dates[13:], data[13:], c='r', linestyle='--', label=f'Kaggle Median (Apr 14)') # predictions start on 13th row
    max_preds.append(data[-1])

    # Top four Week 3 teams
    subs = ['15210308.csv', '15210199.csv', '15208266.csv', '15210154.csv']
    data = wk4_preds_fatalities.loc[ids, subs].quantile(0.5, axis=1).values
    plt.plot(dates[13:], data[13:], c='r', label=f'Kaggle Top 4 Teams (Apr 14)') # predictions start on 13th row
    max_preds.append(data[-1])
    
    # LANL 2
    lanl_ids = lanl_2['state'] == state
    dates = lanl_2.loc[lanl_ids, 'dates'].values
    data = lanl_2.loc[lanl_ids, 'q.50'].values
    plt.plot(dates, data, label=f'LANL (Apr 15)', c='g')
    max_preds.append(data[-1])
    
    # IHME 2
    ihme_ids = ihme_1['location_name'] == state
    dates = ihme_2.loc[ihme_ids, 'date'].values.tolist()
    data = ihme_2.loc[ihme_ids, 'deaths_mean'].cumsum().values.tolist()
    start = dates.index('2020-03-15')
    dates = dates[start:]
    data = data[start:]
    plt.plot(dates, data, label=f'IHME (Apr 16)', c='b')
    max_preds.append(data[-1])
    

    
    fig.autofmt_xdate()
    ax.set_xlim(('2020-04-01', '2020-05-31'))
    ax.grid(False)
    plt.xticks(rotation=90)
    plt.title(f'{state} Fatalities (Cumulative)\n', fontsize=20)

    ylim = int(np.ceil(max(max_preds) / 100.0)) * 100 # round up to nearest 100
    ax.set_ylim(0, ylim)
    plt.legend(fontsize=14, loc=2)
    plt.show()

## How 3 days drastically changed an IHME prediction

Wyoming is a very clear instance of what a small bump can do to the IMHE predictions. The blue dashed line is the prediction on April 13. Becaues the actuals saw a step change increase (albeit relatively small), the IHME model drasically changes it's forecast (blue solid line) when the model was updated 3 days later on April 16.

Texas shows the opposite behavior. Three days of new data, and the IHME predictions are reduced by over a factor of 2x.

And finally, Ohio is an example where the April 16 IHME data appears to be doing the best, but it's easy to see that having the latest data gives it an advantage compared to the other models.

In [None]:
plot_fatalities('Wyoming')

In [None]:
plot_fatalities('Texas')

In [None]:
plot_fatalities('Ohio')

# Plots of all states

In [None]:
for state in states:
    plot_fatalities(state)