SARS-CoV-2, Novel Coronavirus, Covid-19, Corona. There are almost as many names for the virus that has shaken the world in the last 5 months as there are approaches countries have taken to combat it. But what's really working?


To start answering this difficult question, we've decided to focus on a four extreme cases - two "bad" examples, and two "good" examples. What are the common threads and what metrics can best predict whether a region's actions during a pandemic are good or bad?


# More Effective Covid-19 Response Strategies

**South Korea**
- Early school closures
- Widespread testing 
- No lockdown

**New Zealand**
- Widespread testing

# Less Effective Covid-19 Response Strategies

**Italy**
- Lockdowns too late
- Elderly population

**United States**
- Lockdowns
- Limited Testing

In [None]:
%matplotlib inline

import datetime
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import iplot
import seaborn as sns

In [None]:
# Import dataset
covid_df = pd.read_csv('../input/covid19dataexploration/covid19_data.csv')

In [None]:
# Rename columns for ease of use 
covid_df['Date'] = pd.to_datetime(covid_df['Date'])
covid_df = covid_df.rename(columns={'Cumulative tests': 'agg_tests',
                                    'Cumulative tests per million': 'agg_tests_per_mil',
                                    'Total confirmed cases (cases)': 'agg_cases',
                                    'Confirmed cases per million (cases per million)': 'agg_cases_per_mil',
                                    'Total confirmed deaths (deaths)': 'agg_deaths',
                                    'Confirmed deaths per million (deaths per million)': 'agg_deaths_per_mil'})
covid_df[['agg_tests', 'agg_cases', 'agg_deaths']] = covid_df[['agg_tests', 'agg_cases', 'agg_deaths']].fillna(value=0)

We'll start by comparing total confirmed deaths over the course of the pandemic for the 4 countries we have chosen, but we will begin on the day that the country had at least 1 confirmed deaths to compare similar stages within each region. This represents an approximation of speed of the outbreak in each country.

We can do the same analysis for total confirmed cases and cumulative tests as well.

In [None]:
# Calculate days since first X (can be cases, tests, death, etc.)
def days_since_first(col):
    first_dates = covid_df[covid_df[col] != 0].groupby('Entity').first()['Date']
    
    def day_diff(row):
        if row['Entity'] not in first_dates:
            return None
        return (row['Date'] - first_dates[row['Entity']]).days
    
    return covid_df.apply(day_diff, axis=1)

covid_df['days_since_1st_death'] = days_since_first('agg_deaths')
covid_df['days_since_1st_case'] = days_since_first('agg_cases')
covid_df['days_since_1st_test'] = days_since_first('agg_tests')

In [None]:
data=covid_df[covid_df['Entity'].isin(['Italy','United States', 'South Korea','New Zealand'])]
fig = px.line(data[data['days_since_1st_death'] >=0], x='days_since_1st_death', y='agg_deaths_per_mil', color='Entity')
fig.update_layout(title='Figure 1. Total Confirmed COVID-19 Deaths per Million Since 1st Confirmed Death',
                   xaxis_title='Days Since 1st Confirmed Death',
                   yaxis_title='Total Confirmed Deaths per Million')
fig.show()


In [None]:
data=covid_df[covid_df['Entity'].isin(['Italy','United States', 'South Korea','New Zealand'])]
fig = px.line(data[data['days_since_1st_case'] >=0], x='days_since_1st_case', y='agg_cases_per_mil', color='Entity')
fig.update_layout(title='Figure 2. Total Confirmed COVID-19 Cases per Million Since 1st Confirmed Case',
                   xaxis_title='Days Since 1st Confirmed Case',
                   yaxis_title='Total Confirmed Cases per Million')
fig.show()


In [None]:
data=covid_df[covid_df['Entity'].isin(['Italy','United States', 'South Korea','New Zealand'])]
fig = px.line(data[data['days_since_1st_test'] >=0], x='days_since_1st_test', y='agg_tests_per_mil', color='Entity')
fig.update_layout(title='Figure 3. Total COVID-19 Tests per Million Since 1st Confirmed Test',
                   xaxis_title='Days Since 1st Reported Test',
                   yaxis_title='Total Confirmed Tests per Million')
fig.show()

The figures comparing deaths per million and cases per million (Figures 1 and 2) make sense; Italy and the U.S. take on a more exponential curve while New Zealand and South Korea are flat. Figure 3 comparing tests per million is more surprising. When it comes to testing, it would seem more rapid deployment of many tests would spell greater success for disease management. However,it appears Italy deployed more tests per million at a faster rate than South Korea, yet still saw the most dramatic exponential growth in deaths per million. 

Let's investigate the timing that these tests were deployed to see if that sheds more insight. 

In [None]:
data=covid_df[covid_df['Entity'].isin(['Italy','United States', 'South Korea','New Zealand'])]
fig = px.line(data[data['days_since_1st_case'] >= 0], y='agg_tests_per_mil', color='Entity')
fig.update_layout(title='Figure 4. Total COVID-19 Tests per Million since 1st Confirmed Case',
                   xaxis_title='Days Since 1st Confirmed Case',
                   yaxis_title='Total Confirmed Tests per Million')
fig.show()

Figure 4 still shows that there may be something more to the story here as South Korea did not employ significantly more tests at a faster rate than Italy when compared to the first confirmed case in the region, yet experienced significantly more deaths per capita. Perhaps there were already a lot of un-confirmed cases in Italy by the time the first case was confirmed, and the virus was spreading undetected? 

To investigate this, we will look at the Test Positive Ratio, or the percentage of tests performed in a given region that come back as confirmed cases. The higher the Test Positive Ratio (TPR), the more likely that there are more cases out there that haven't been caught. A high TPR can imply that a region is primarily only testing to confirm obvious cases, leaving a lot of the less severe cases to go under the radar. 

To see if TPR is a better metric for understanding a region's response to COVID-19, we will look at the relationship between the maximum TPR vs deaths per million (Figure 5).

In [None]:
test_positive_ratio = covid_df['agg_cases_per_mil'].astype(float) / covid_df['agg_tests_per_mil'].astype(float)
covid_df['test_positive_ratio'] = test_positive_ratio

There are some days within the data where the U.S. is reporting more confirmed cases than tests performed. This could potentially be due to U.S. patients being tested by agencies outside the U.S. as they contracted the virus outside of the country. 

In [None]:
covid_df[covid_df['test_positive_ratio'] > 1]


For the sake of the following analysis, we will remove these lines of data.

In [None]:
def generate_max(df, col):
    return df.groupby('Entity')[col].max()

max_positive_ratio = generate_max(covid_df[covid_df['test_positive_ratio'] <= 1], 'test_positive_ratio')
max_death_per_mil = generate_max(covid_df[covid_df['test_positive_ratio'] <= 1], 'agg_deaths_per_mil')
max_cases_per_mil = generate_max(covid_df[covid_df['test_positive_ratio'] <= 1], 'agg_cases_per_mil')
all_entities = covid_df['Entity'].unique()
positive_test_ratio_vs_deaths = pd.DataFrame({'max_positive_ratio': max_positive_ratio[all_entities],
                                              'max_cases_per_mil': max_cases_per_mil[all_entities],
                                              'Entity':all_entities})

fig = px.scatter(positive_test_ratio_vs_deaths,
                 x=positive_test_ratio_vs_deaths['max_positive_ratio'],
                 y=positive_test_ratio_vs_deaths['max_cases_per_mil'],
                 color='Entity')
fig.update_layout(title='Figure 5. Maximum Test Positive Ratio (TPR) vs. Total Deaths per Million',
                   xaxis_title='Maximum Test Positive Ratio (TPR)',
                   yaxis_title='Total Cases per Million')
fig.show()

## There appears to be a relatively linear trend between TPR and Total Deaths per Million where the higher the Test Positive Ratio, the more severe the pandemic is in that region. 

There are some significant outliers in to this trend. The Phillipines has a very high maximum TPR but a low Deaths Per Million, indicating that the Phillipines may be underreporting deaths. Italy has a very high total Deaths per Million, yet a relatively low maximum TPR. **WHY IS THIS THE CASE FOR ITALY?**


If a higher the Test Positive Ratio (TPR) means an increased probability that there are more cases in a country that aren't being caught, then a decrease in TPR over time could mean that the region is through the peak of the pandemic. Let's dig in some more and see how TPR changes in our regions of interest over the course of the pandemic. 

Percentage of days that tests are reported since the first case in a region is reported. We'll call this the test reporting rate.


In [None]:
entity = covid_df[covid_df['days_since_1st_case'] > 0].groupby('Entity')
entity_trr = entity['agg_tests_per_mil'].count() / (entity['days_since_1st_test'].last() - entity['days_since_1st_test'].first())
# Let's drop entities with zero tests.
entity_trr = entity_trr[entity_trr.ne(0)].dropna()

fig = plt.figure(figsize=(15,5))
g = sns.barplot(x=entity_trr.index, y=entity_trr.values)
g.axes.axhline(1, ls='--')
plt.title('Figure 6. Test Reporting Rate by Entity')
plt.xticks(rotation=90)

To account for sporadic test recording and get a better feel for general trends, we will take a rolling mean of TPR data over 5 days.

In [None]:
def plot_against_entity(df, entity_list, x, y, title, xaxis_title, yaxis_title, horiz_line=False):
    entity_df = df[df['Entity'].isin(entity_list)][['Entity', x, y]]
    fig = px.line(entity_df, x=x, y=y, color='Entity')
    fig.update_layout(title=title, xaxis_title=xaxis_title, yaxis_title=yaxis_title)
    if horiz_line:
        fig.update_layout(shapes=[dict(type='line',
                                   yref='y', y0=0, y1=0,
                                   xref='paper', x0=0, x1=1)])
    return fig

def generate_rolling_mean(df, days, mean_col):
    return df.reset_index().set_index('days_since_1st_case').groupby('Entity').rolling(days, min_periods=1)[mean_col].mean().values

covid_df.loc[covid_df['test_positive_ratio'] <= 1, 'test_positive_ratio_7_day_rolling'] = generate_rolling_mean(covid_df[covid_df['test_positive_ratio'] <= 1], 7, 'test_positive_ratio')

# covid_df.loc['test_positive_ratio'] <= 1]['test_positive_ratio_5_day_rolling'] = generate_rolling_mean(covid_df[covid_df['test_positive_ratio'] <= 1], 5, 'test_positive_ratio')

plot_against_entity(df=covid_df[(covid_df['days_since_1st_test'] >= 0) & (covid_df['test_positive_ratio'] <= 1)], 
                    entity_list=['Italy','United States', 'South Korea','New Zealand'], 
                    x='days_since_1st_test', y='test_positive_ratio_7_day_rolling',
                    title='Figure 7. 7-Day Rolling Test Positive Ratio (TPR)', 
                    xaxis_title='Days Since 1st Test', 
                    yaxis_title='7-Day Rolling Test Positive Ratio (TPR)')

Italy appears to peak 

I'd like to look at the maximum test positive rate ROC to find where an outbreak might be occurring.To remove some of the noise that appears to occur in the first week of reported testing data, we will look at days 8 and beyond from the 1st reported test. Perhaps there is noise due to getting test reporting structures in place for each country. 

In [None]:
# covid_df['test_positive_ratio_7_day_rolling_ROC'] = covid_df['test_positive_ratio_7_day_rolling'].pct_change()
covid_df['test_positive_ratio_ROC'] = covid_df[covid_df['test_positive_ratio'] <= 1]['test_positive_ratio'].pct_change()

# Replace infinite values with NaN
covid_df = covid_df.replace([np.inf, -np.inf], np.nan)

covid_df['test_positive_ratio_7_day_rolling_ROC'] = generate_rolling_mean(covid_df, 7, 'test_positive_ratio_ROC')

# max_tpr_5_day_rolling_roc = covid_df.groupby('Entity')['test_positive_ratio_5_day_rolling_ROC'].max()

covid_df_after_7_days = covid_df[covid_df['days_since_1st_test'] >= 7]
fig = plot_against_entity(df=covid_df_after_7_days,
                    entity_list=['Italy', 'New Zealand', 'South Korea', 'United States'],
                    x='days_since_1st_test',
                    y='test_positive_ratio_7_day_rolling_ROC',
                    title='Figure 8. 7-Day Rolling Test Positive Ratio (TPR) Rate of Change (ROC)', 
                    xaxis_title='Days Since 1st Reported Test ', 
                    yaxis_title='7-Day Rolling Test Positive Ratio (TPR) Rate of Change (ROC)',
                    horiz_line=True)
fig.show()

Let's add columns to the dataset showing rates of increase in cases and tests over a rolling week-long period. We can then compare the ratio of rate of increase in cases to rate of increase in tests.

In [None]:
maxidx_tpr_7_day_rolling_roc_per_entity = covid_df_after_7_days.replace([np.inf, -np.inf], np.nan).groupby('Entity')['test_positive_ratio_7_day_rolling_ROC'].idxmax().dropna()

def find_neg_tpr_roc_after_max(row):
    if row['Entity'] not in maxidx_tpr_7_day_rolling_roc_per_entity:
        return
    if row.name < maxidx_tpr_7_day_rolling_roc_per_entity[row['Entity']]:
        return
    if row['test_positive_ratio_7_day_rolling_ROC'] < 0:
        return row['Date']
    
entity_with_completed_peak_list = []
first_neg_tpr_roc_after_max = []
for entity in covid_df_after_7_days['Entity'].unique():
    tmp = covid_df_after_7_days[covid_df_after_7_days['Entity'] == entity].apply(find_neg_tpr_roc_after_max, axis=1)
    if len(tmp.dropna().index) > 0:
        entity_with_completed_peak_list.append(entity)
        first_neg_tpr_roc_after_max.append(tmp.dropna().iloc[0])

entity_first_neg_dict = dict(zip(entity_with_completed_peak_list, first_neg_tpr_roc_after_max))

# Now using this dictionary we can find the length of the peaks for those countries that had a negative TPR ROC after the max.
entities_completed_peak_length = {}
for entity, first_neg_date in entity_first_neg_dict.items():
    tpr_roc_max_idx = maxidx_tpr_7_day_rolling_roc_per_entity[entity]
    tpr_roc_max_date = covid_df.loc[tpr_roc_max_idx, 'Date']
    length = (first_neg_date - tpr_roc_max_date).days
    entities_completed_peak_length.update({entity: length})

print(entities_completed_peak_length)

In [None]:
entities_peak_length_df = pd.DataFrame(entities_completed_peak_length.values(), columns=['Peak Length'], index=entities_completed_peak_length.keys())

fig = px.bar(entities_peak_length_df, x=entities_peak_length_df.index, y=entities_peak_length_df['Peak Length'])
fig.update_layout(title='Figure 9. Completed Peak Lengths by Entity',
                   xaxis_title='Entities',
                   yaxis_title='Peak Length')
fig.show()

Now let's get the average outbreak length across all these countries. We can use this to "predict" when countries that have not yet reached the peak might reach it.

There are a lot of factors we're not accounting for. We could define a transfer function between tests/cases/deaths (or rate of change of those) and the length in days of the outbreak. This would help us better predict the length of our current outbreak.

if an entity hasn't reached the peak yet, but is more than the average amount of days (~12 days) from max TPR ROC to first negative TPR ROC, we could hypothesize that they either haven't reached their true maximum TPR ROC or the measures they are taking are flattening the curve.

In [None]:
average_outbreak_length = np.array([entities_completed_peak_length[entity] for entity in entities_completed_peak_length]).mean()
print('Average Outbreak Length: {} days'.format(average_outbreak_length))

entities_not_yet_through_peak = set(maxidx_tpr_7_day_rolling_roc_per_entity.index).difference(set(entity_with_completed_peak_list))
print('Entities not yet through peak: {}'.format(entities_not_yet_through_peak))

For the entities that are not yet through their peak of the pandemic, let's try to fit a model to predict when they will experience a peak. To accomplish this, we will use the maximum of Deaths per Million, Cases per Million, and Tests per Million as inputs to the model, as well as generated features such as Mean Case Fatality Rate (CFR, deaths per confirmed cases), Maximum CFR, Mean and Maximum TPR, Mean and Maximum TPR ROC, and the incremental area under the curve of TPR (TPR iAUC).

### Feature Generation

In [None]:
def calc_features(covid_entity):
    # Let's build a DataFrame of maximum aggregate X per million with the outbreak length in days.
    covid_entity['max_agg_deaths_per_mil'] = covid_entity['agg_deaths_per_mil'].max()
    covid_entity['max_agg_cases_per_mil'] = covid_entity['agg_cases_per_mil'].max()
    covid_entity['max_agg_tests_per_mil'] = covid_entity['agg_tests_per_mil'].max()

    # Calculate Case Fatality Rate (CFR) and find the maximum and mean.
    covid_entity['mean_cfr'] = covid_entity['cfr'].mean()
    covid_entity['max_cfr'] = covid_entity['cfr'].max()

    # Calculate mean/max Test Positive Rate
    covid_entity['mean_tpr'] = covid_entity['test_positive_ratio'].mean()
    covid_entity['max_tpr'] = covid_entity['test_positive_ratio'].max()

    # Calculate mean Test Positive Rate Rate of Change
    covid_entity['mean_tpr_roc'] = covid_entity['test_positive_ratio_ROC'].mean()
    covid_entity['max_tpr_roc'] = covid_entity['test_positive_ratio_7_day_rolling_ROC'].max()

    # Calculate Area Under the Curve of Test Positive Rate over Time
    covid_entity_prev_days = covid_entity['days_since_1st_test'].shift(periods=1)
    covid_entity_prev_tpr = covid_entity['test_positive_ratio'].shift(periods=1)
    day_diff = covid_entity['days_since_1st_test'] - covid_entity_prev_days
    tpr_sum = covid_entity['test_positive_ratio'] + covid_entity_prev_tpr
    tpr_auc = 0.5 * day_diff * tpr_sum
    covid_entity['tpr_iauc'] = tpr_auc.sum()

    return covid_entity
    
covid_df['cfr'] = covid_df['agg_deaths'] / covid_df['agg_cases']

calculated_features = covid_df.groupby('Entity').apply(calc_features) # .dropna()

feature_list = ['max_agg_deaths_per_mil', 'max_agg_cases_per_mil', 'max_agg_tests_per_mil',
                'mean_cfr', 'max_cfr',
                'mean_tpr', 'max_tpr', 'max_tpr_roc', 'mean_tpr_roc', 'tpr_iauc']

# Let's see what these features look like.
print(calculated_features[calculated_features['Entity'] == 'South Korea'][feature_list].head())

Let's build a DataFrame consisting of only the countries that have completed their outbreak.

In [None]:
outbreak_length_df = pd.DataFrame(entities_completed_peak_length.values(), index=entities_completed_peak_length.keys(), columns=['outbreak_length_in_days'])
single_calc_features = calculated_features.groupby('Entity')[feature_list].first()
for feature in feature_list:
    outbreak_length_df[feature] = single_calc_features[feature]

    
outbreak_length_df = outbreak_length_df

# Let's look at this too.
print(outbreak_length_df.head())

### Training

Now let's fit a couple models to the data. One will be polynomial and the other will be linear.

We're using all of the data from entities that are through the peak of the pandemic. We're doing this due to lack of data. We can test our models by comparing the predicted peaks to real data from entities that have since completed their peaks (we only have data through early April in our dataset).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

X = outbreak_length_df[feature_list].values
y = outbreak_length_df['outbreak_length_in_days'].values

poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
        
poly_model = LinearRegression(normalize=True)
poly_model.fit(X_poly, y)

print('Poly Model Score: {}\n'.format(poly_model.score(X_poly, y)))

model = LinearRegression()
model.fit(X, y)

print('Linear Model Score: {}\n'.format(model.score(X, y)))

def predict_peak(entity):
    values = [single_calc_features.loc[entity, feature_list].values]
    return model.predict(values)[0]

def predict_peak_poly(entity):
    values = poly_reg.fit_transform([single_calc_features.loc[entity, feature_list].values])
    return poly_model.predict(values)[0]

### Prediction

We got a perfect fit with the polynomial model, so we've most certainly overfit our small dataset with that model. 0.53 is the best R^2 value we've been able to get for the linear model. Let's use our models to predict the end of the pandemic peak for entities that have not yet made it to the peak (with the data we have).

First, let's use the Polynomial Model.

In [None]:
def run_entities_through_model(poly=False):
    predicted_peak = {}
    for entity in entities_not_yet_through_peak:
        if entity not in single_calc_features.index:
            continue
        maxdate = covid_df.loc[maxidx_tpr_7_day_rolling_roc_per_entity[entity], 'Date']
        predicted_peaklength = predict_peak_poly(entity) if poly else predict_peak(entity)
        predicted_peakdate = maxdate + datetime.timedelta(days=predicted_peaklength)
        predicted_peak.update({entity: {'start': maxdate, 'peak': predicted_peakdate, 'length': predicted_peaklength}})
        
    return predicted_peak

predicted_peak_poly = run_entities_through_model(poly=True)

fig = plt.figure(figsize=(15,5))
sns.barplot(x=list(predicted_peak_poly.keys()), y=[val['length'] for val in predicted_peak_poly.values()])
plt.title('Figure 10. Predicted Peak Length in Days (Polynomial Model)')
plt.xlabel('Entity')
plt.ylabel('Peak Length (Days)')
plt.xticks(rotation=45)

We're getting huge negative numbers for the Philippines and even larger numbers for US. This seems like the polynomial model has definitely overfit the dataset. Let's try to use the Linear model instead.

In [None]:
predicted_peak = run_entities_through_model(poly=False)

fig = plt.figure(figsize=(15,5))
sns.barplot(x=list(predicted_peak.keys()), y=[val['length'] for val in predicted_peak.values()])
plt.title('Figure 11. Predicted Peak Length in Days (Linear Model)')
plt.xlabel('Entity')
plt.ylabel('Peak Length (Days)')
plt.xticks(rotation=45)

This seems far more reasonable. Let's create a timeline with this data, showing start and end points of the pandemic outbreak to peak.

In [None]:
fig = plt.figure(figsize=(20,5))
plt.title('Figure 12. Predicted Peak Timeline')
plt.xlabel('Date')
plt.ylabel('Entity')
plt.grid(True)
for entity, item in predicted_peak.items():
    plt.plot([item['start'], item['peak']], [entity, entity], linewidth=10)

In [None]:
for entity, item in predicted_peak.items():
    print(entity)
    print('Start: {}, End: {}\n'.format(item['start'], item['peak']))

## Let's check our model with more current real data.

In [None]:
ihme_covid_df = pd.read_csv('/kaggle/input/ihmes-covid19-projections/2020_05_10/Hospitalization_all_locs.csv')
ihme_covid_df['date'] = pd.to_datetime(ihme_covid_df['date'])
ihme_covid_df['tpr'] = ihme_covid_df['confirmed_infections'] / ihme_covid_df['total_tests']

def days_since_first_ihme(col):
    first_dates = ihme_covid_df[ihme_covid_df[col] != 0].groupby('location_name').first()['date']
    
    def day_diff(row):
        if row['location_name'] not in first_dates:
            return None
        return (row['date'] - first_dates[row['location_name']]).days
    
    return ihme_covid_df.apply(day_diff, axis=1)

ihme_covid_df['days_since_1st_case'] = days_since_first_ihme('confirmed_infections')
ihme_covid_df['days_since_1st_test'] = days_since_first_ihme('total_tests')

def generate_rolling_mean_ihme(df, days, mean_col):
    return df.reset_index().set_index('days_since_1st_case').groupby('location_name').rolling(days, min_periods=1)[mean_col].mean().values

ihme_covid_df.loc[ihme_covid_df['tpr'] <= 1, 'tpr_7_day_rolling'] = generate_rolling_mean_ihme(ihme_covid_df[ihme_covid_df['tpr'] <= 1], 7, 'tpr')

ihme_covid_df['tpr_ROC'] = ihme_covid_df[ihme_covid_df['tpr'] <= 1]['tpr'].pct_change()

# Replace infinite values with NaN
ihme_covid_df = ihme_covid_df.replace([np.inf, -np.inf], np.nan)

ihme_covid_df['tpr_7_day_rolling_ROC'] = generate_rolling_mean_ihme(ihme_covid_df, 7, 'tpr_ROC')

In [None]:
ihme_covid_df_after_7_days = ihme_covid_df[ihme_covid_df['days_since_1st_test'] >= 7]
maxidx_tpr_7_day_rolling_roc_per_entity = ihme_covid_df_after_7_days.replace([np.inf, -np.inf], np.nan).groupby('location_name')['tpr_7_day_rolling_ROC'].idxmax().dropna()

def find_neg_tpr_roc_after_max_ihme(row):
    if row['location_name'] not in maxidx_tpr_7_day_rolling_roc_per_entity:
        return
    if row.name < maxidx_tpr_7_day_rolling_roc_per_entity[row['location_name']]:
        return
    if row['tpr_7_day_rolling_ROC'] < 0:
        return row['date']
    
entity_with_completed_peak_list_ihme = []
first_neg_tpr_roc_after_max_ihme = []
for entity in ihme_covid_df_after_7_days['location_name'].unique():
    tmp = ihme_covid_df_after_7_days[ihme_covid_df_after_7_days['location_name'] == entity].apply(find_neg_tpr_roc_after_max_ihme, axis=1)
    if len(tmp.dropna().index) > 0:
        entity_with_completed_peak_list_ihme.append(entity)
        first_neg_tpr_roc_after_max_ihme.append(tmp.dropna().iloc[0])

entity_first_neg_dict_ihme = dict(zip(entity_with_completed_peak_list_ihme, first_neg_tpr_roc_after_max_ihme))

# Now using this dictionary we can find the length of the peaks for those countries that had a negative TPR ROC after the max.
entities_completed_peak_length_ihme = {}
for entity, first_neg_date in entity_first_neg_dict_ihme.items():
    tpr_roc_max_idx = maxidx_tpr_7_day_rolling_roc_per_entity[entity]
    tpr_roc_max_date = ihme_covid_df.loc[tpr_roc_max_idx, 'date']
    length = (first_neg_date - tpr_roc_max_date).days
    entities_completed_peak_length_ihme.update({entity: length})

print(entities_completed_peak_length_ihme)

In [None]:
entity_absolute_relative_error = []
entity_ok = []
for entity, length in entities_completed_peak_length_ihme.items():
    if length <= 0:
        continue
    if entity in predicted_peak:
        print('Entity: {}'.format(entity))
        print('Predicted length of pandemic peak: {}'.format(predicted_peak[entity]['length']))
        print('Actual length of pandemic peak: {}'.format(length))
        score = abs(predicted_peak[entity]['length'] - length) / length
        print('Error: {}\n'.format(score))
        entity_absolute_relative_error.append(score)
        entity_ok.append(entity)
        
        

print('Average relative error: {}'.format(np.array(entity_absolute_relative_error).mean()))

In [None]:


for entity, item in predicted_peak.items():
    if entity not in entity_ok:
        continue
    entity_ihme = ihme_covid_df[ihme_covid_df['location_name'] == entity]
    entity_df = entity_ihme[['date', 'confirmed_infections']].dropna()
    fig = px.line(entity_df, x='date', y='confirmed_infections')
    fig.update_layout(title='{} Infections Per Day with Pandemic Peak Prediction Overlay'.format(entity))
    fig.add_shape(
                type="rect",
                # x-reference is assigned to the x-values
                xref="x",
                # y-reference is assigned to the plot paper [0,1]
                yref="paper",
                x0=item['start'],
                y0=0,
                x1=item['peak'],
                y1=1,
                fillcolor="LightSalmon",
                opacity=0.5,
                layer="below",
                line_width=0,
            )
    fig.show()