# COVID-19 Global Forecasting (Week 4)
## Exploring Avaliable Data
![Alt text](https://images.newscientist.com/wp-content/uploads/2020/02/11165812/c0481846-wuhan_novel_coronavirus_illustration-spl.jpg)

The purpose of this notebook is to explore the various methods of modelling the avaliable dataset in order to accurately forecast cases and fatalities of COVID-19. <br>

__Author:__ Cohen Robinson <br>
__Date Created:__ 10/04/2020 <br>
__Date Edited:__ 12/04/2020 <br>

## Changelog
- __v1.0.0__: Created initial version
- __v1.0.1__: Added further detail on global outbreak

___

## Table of Contents

- [Data Exploration](#data_exploration)
- [Preprocessing the Data](#data_pre)
- [Exploratory Data Analysis (EDA)](#eda)
    - [COVID-19: State of the World](#stateofworld)
    - [Infection Hotspots Case Studies:](#inf_hot)
        1. [China]()
        2. [United States]()
        3. [Australia]()
        4. [Italy]()
        5. [France]()
___

## Data Exploration
<a id="data_exploration"></a>
Before we get started on forecasting and other exciting things, let us first begin to understand the dataset thats been given to us.<br>

The provided dataset, gives the cumulative sum of confirmed cases, and deaths by date, country and state.

Lets have a look at the data "as is".

In [None]:
!pip install pycountry-convert
!pip install country-converter
!pip install plotly
!pip install plotly_express

In [None]:
# import libararies
import numpy as np
import pandas as pd

import datetime as dt
import os
import requests

import country_converter as coco
import pycountry as pyco
import pycountry_convert as pc

import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

%matplotlib inline

In [None]:
# global constants
BASE_DATASET_DIR = "/kaggle/input/covid19-global-forecasting-week-4"

In [None]:
TRAIN_DF = pd.read_csv(os.path.join(BASE_DATASET_DIR, "train.csv"))
TEST_DF = pd.read_csv(os.path.join(BASE_DATASET_DIR, "test.csv"))

TRAIN = "train"
TEST = "test"

BASE_DFS = {TRAIN: TRAIN_DF, TEST: TEST_DF}

In [None]:
display(TRAIN_DF.head())
display(TRAIN_DF.describe())
display(TRAIN_DF.info())
display(TRAIN_DF.dtypes)

Now, that we have some idea of the code; let's look at what we can do to preprocess the data.
___

## Preprocessing the data
<a id='data_pre'></a>
What is clear from our brief exploration of the data, is the following:
- `Date` field is currently stored as an `object` rather than as a `datetime` field.
- `Id` field is fairly irrelevant, and only serves as an index.
- `Country_Region` and `Province_State` fields may need to be controlled, as these will be very important when joining other datasets.
- `Province_State` fields have some NaN fields.

In [None]:
# reformat the 'Date' field first
for df in BASE_DFS.values():
    df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

Before moving onto the next preprocessing stage, lets briefly turn our attention to the bounds of the `Date` field.

In [None]:
min_train_date = BASE_DFS[TRAIN]["Date"].min()
max_train_date = BASE_DFS[TRAIN]["Date"].max()

print(f"Our training dataset ranges from {min_train_date} to {max_train_date}")
print("Total of %d days" % (max_train_date - min_train_date).days)

In [None]:
min_test_date = BASE_DFS[TEST]["Date"].min()
max_test_date = BASE_DFS[TEST]["Date"].max()

print(f"Our testing dataset ranges from {min_test_date} to {max_test_date}")
print("Total of %d days" % (max_test_date - min_test_date).days)

As we can see from the above, we have 72 days worth of training data, and then 42 days thereafter for testing.

Now we move on to setting the `Id` field as an index.

In [None]:
for df in BASE_DFS.values():
    id_col = df.columns[0] # makes the assumption that id is 1st col
    df.set_index(id_col)

We can now move onto standardising the country names, we use the package [`country-converter`](https://pypi.org/project/country-converter/) to achieve this.

In [None]:
def co_to_continent(alpha_2):
    """
    Converts country (ISO-2) to continent (name).
    
    Arguments:
        country
    """
    
    if len(alpha_2) != 2:
        return "UNKNOWN"
    try:
        continent_code = pc.country_alpha2_to_continent_code(alpha_2)
    except KeyError:
        return "UNKNOWN"
    
    return pc.convert_continent_code_to_continent_name(continent_code)

def co_to_country(country_name):
    """
    Converts country to Country class.
    
    Arguments:
        country_name
    """
    
    result = pyco.countries.get(name=country_name)
    
    if result is None:
        try:
            result = (pyco.countries.search_fuzzy(country_name))[0]
        except LookupError:
            result = None
        
    return result

In [None]:
cc = coco.CountryConverter()

for df in BASE_DFS.values():
    # saves repeatedly searching unnecessarily
    unique_countries = df["Country_Region"].unique()
    conv_countries = [co_to_country(country) for country in unique_countries]
    country_dict = dict(zip(unique_countries, conv_countries))
    
    df["country_iso2"] = df["Country_Region"].apply(lambda x: country_dict[x].alpha_2 
                                                    if country_dict[x] is not None else "")
    
    df["country_iso3"] = df["Country_Region"].apply(lambda x: country_dict[x].alpha_3 
                                                    if country_dict[x] is not None else "")
    
    df["Country_Region"] = df["Country_Region"].apply(lambda x: country_dict[x].name 
                                                      if country_dict[x] is not None else x)
    
    df["continent"] = df["country_iso2"].apply(co_to_continent)

In [None]:
# replace the NaN values
for df in BASE_DFS.values():
    df["Province_State"].fillna("NaN", inplace=True)

## Exploratory Data Analysis (EDA)
<a id="eda"></a>
After exploring the base dataset, and some small preprocessing; we can now begin to explore the dataset further.

### COVID-19: Current State of the World
<a id="stateofworld"></a>
Lets have a brief look at what is happening in the world w.r.t COVID-19.

In [None]:
df_temp = BASE_DFS[TRAIN].copy()
df_temp["Province_State"] = df_temp["Province_State"].apply(lambda x: "" if x == "NaN" else x)

# we want to see data filtered on the current date (can change)
date = df_temp.Date.max()
df_temp = df_temp[df_temp['Date']==date]

df_temp["world"] = "World"
fig = px.treemap(df_temp, path=['world', 'continent', 'Country_Region','Province_State'], 
                 values='ConfirmedCases', color='ConfirmedCases', hover_data=['Country_Region'],
                 color_continuous_scale='haline_r', title='Current distribution of Global COVID-19 Cases')
fig.show()

In [None]:
df_temp = BASE_DFS[TRAIN].copy()
df_temp["Province_State"] = df_temp["Province_State"].apply(lambda x: "" if x == "NaN" else x)

# we want to see data filtered on the current date (can change)
date = df_temp.Date.max()
df_temp = df_temp[df_temp['Date']==date]

df_temp["world"] = "World"
fig = px.treemap(df_temp, path=['world', 'continent', 'Country_Region','Province_State'], values='Fatalities',
                  color='Fatalities', hover_data=['Country_Region'],
                  color_continuous_scale='magma_r', title='Current distribution of Global COVID-19 Deaths')
fig.show()

From the above data, it's clear to see that the largest contibution to global COVID-19 cases and deaths is Europe, followed thereafter by North America. 

What I do find interesting is the comparison in the distributions between global dealths and global cases; It's clear that the _United States_ has the largest cases and deaths, however they share a lower proportion of global deaths than they do global cases. This could mean one of two things:

- The outbreak is fairly recent in the _United States_, hence the full death potential hasn't been realised; and/or
- Countries in _Europe_ have a higher death rate, due to being more susceptible to the virus.

> These two factors could be important in forecasting the number of deaths. So we will explore these ideas further later.

___

Now lets have a look how the cases and deaths have developed globally over time.

In [None]:
df_world = (BASE_DFS[TRAIN].copy()).groupby(['Date']).sum()

fig = go.Figure(data=[
    go.Bar(name='Cases', x=df_world.index, y=df_world['ConfirmedCases']),
    go.Bar(name='Fatalities', x=df_world.index, y=df_world['Fatalities'])
])

fig.update_layout(barmode='overlay', title='Global Cumulative Confirmed Cases and Fatailities')
fig.show()

In [None]:
print("Current global fatality rate: %.2f%%" 
      % (df_world['Fatalities'].max() 
         / df_world['ConfirmedCases'].max() * 100))

It seems that since approximately March 15, 2020: the confirmed covid-19 cases have been growing exponentially while fatalities have been steadily increasing. This could suggest a relatively low death rate, or it could indicate a significant lag between 'catching' the disease and potentially dying.

To explore this further, lets have a look at the daily confirmed cases and fatalities alongside the growth rate.

In [None]:
def filter_down(df, filtered_df, parent_col, parent_val, child_cols, val_col, suffix):
    """
    Filters down a dataframe until it's completely filtered.
    Then applies the discrete growth formula.
    
    Arguments:
        df
        parent_col
        parent_val
        child_cols
        val_col
        suffix
    """
    filtered_df = filtered_df[filtered_df[parent_col] == parent_val]
    if len(child_cols):
        new_parent_col = child_cols[0]
        unique_filter_vals = filtered_df[new_parent_col].unique()
        for value in unique_filter_vals:
            filter_down(df, filtered_df, new_parent_col, value, child_cols[1:], val_col, suffix)
    else:
        index_vals = list(filtered_df.index)

        first_index = index_vals.pop(0)
        df.loc[first_index, val_col+suffix] = 0

        for i in index_vals:
            df.loc[i, val_col+suffix] \
                    = df.loc[i, val_col] - df.loc[i-1, val_col] 

def discrete_growth(df, date_col, filter_cols, cum_sum_col, suffix="_discrete"):
    """
    Determines the discrete growth for each filter column.
    
    Arguments:
        df
        filter_columns
        cum_sum_column
    """
    new_df = df.copy()
    new_df = new_df.sort_values(by=filter_cols+[date_col])
    new_df.head()
    column = filter_cols.pop(0)
    unique_filter_vals = new_df[column].unique()

    for value in unique_filter_vals:
        filter_down(new_df, new_df, column, value, filter_cols, cum_sum_col, suffix)
    
    return new_df

In [None]:
dg_df = BASE_DFS[TRAIN].copy()
dg_df = discrete_growth(dg_df, "Date", ["continent", "Country_Region", "Province_State"], "ConfirmedCases")
dg_df = discrete_growth(dg_df, "Date", ["continent", "Country_Region", "Province_State"], "Fatalities")

In [None]:
dg_df_world = dg_df.groupby(['Date']).sum()
dates = list(dg_df_world.index)
prev_date = dates.pop(0)

# calculate growth rates
for date in dates:
    dg_df_world.loc[date, "ConfirmedCases_GrowthRate"] \
            = dg_df_world.loc[date, "ConfirmedCases_discrete"] / dg_df_world.loc[prev_date, "ConfirmedCases"]
    dg_df_world.loc[date, "Fatalities_GrowthRate"] \
            = dg_df_world.loc[date, "Fatalities_discrete"] / dg_df_world.loc[prev_date, "Fatalities"]
    prev_date = date

fig = go.Figure(data=[
    go.Bar(name='Cases', x=dg_df_world.index, 
           y=dg_df_world['ConfirmedCases_discrete'], yaxis="y1", opacity=0.5),
    go.Bar(name='Fatalities', x=dg_df_world.index, 
           y=dg_df_world['Fatalities_discrete'], yaxis="y1", opacity=0.5),
    go.Line(name="Cases_GrowthRate", x=dg_df_world.index, 
            y=dg_df_world["ConfirmedCases_GrowthRate"], yaxis="y2", line_color="forestgreen"),
    go.Line(name="Fatalities_GrowthRate", x=dg_df_world.index, 
            y=dg_df_world["Fatalities_GrowthRate"], yaxis="y2", line_color="crimson")
])

fig.update_layout(barmode='overlay', title='Global Daily Confirmed Cases and Fatalities',
                 yaxis=dict(title="Cases and Fatalities"),
                 yaxis2=dict(title="Growth Rate %", overlaying='y',side='right'),
                 yaxis2_tickformat = '%')
fig.show()

The graph of `Global Daily Confirmed Cases and Fatalities` shows a clear increase in the number of cases as time has gone on. However, there has been a fairly significant 'levelling' of the growth rate since earlier in the outbreak. The reassuring factor about this graph is the growth rate of infection is clearly not increasing globally at this stage. This would suggest that the measures governments are taking around the world are reducing the number of new infections.

When we forecast, it may be useful to use linear regression on the growth rate of each region then apply this to the confirmed cases.

___

Now, lets have a look at how the infection has progressed globally over time.

In [None]:
map_df = BASE_DFS[TRAIN].copy()
map_df['Date'] = map_df['Date'].astype(str)
map_df = map_df.groupby(['Date', 'Country_Region', 'country_iso3'], 
                        as_index=False)['ConfirmedCases', 'Fatalities'].sum()

map_df['ln(ConfirmedCases)'] = np.log(map_df.ConfirmedCases + 1)
map_df['ln(Fatalities)'] = np.log(map_df.Fatalities + 1)

In [None]:
px.choropleth(map_df, 
              locations="country_iso3", 
              color="ln(ConfirmedCases)", 
              hover_name="Country_Region", 
              hover_data=["ConfirmedCases"] ,
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.Purples, 
              title='Total Confirmed Cases (Ln Scale) by Country/Region Filtered by Date')

In [None]:
px.choropleth(map_df, 
              locations="country_iso3", 
              color="ln(Fatalities)", 
              hover_name="Country_Region", 
              hover_data=["Fatalities"] ,
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.OrRd, 
              title='Total Fatalities (Ln Scale) by Country/Region Filtered by Date')

What this Choropleth shows us, is that the initial outbreak in China was intense. This was then controlled relatively well, and however when COVID-19 began spreading worldwide other countries such as those in Europe or America - this was not controlled and began to spread far more intensely than the original outbreak in China.

Referring back to an earlier point, pinpointing why these particular countries have higher infection and death rates will allow us to forecast more accurately. Some suggestions to what factors these might be:
- Poor access to healthcare could lead to a higher death rate;
- Delays in implimenting social distancing measures;
- Daily testing rate;
- Population density.

With that it mind, lets move on an take a deeper dive into some of the more notable places of infection.
___
### Infection Case Studies
<a id="inf_hot"></a>
Currently, each country is dealing with the COVID-19 outbreak differently. In this way, understanding the effects of the management strategies some countries have put in place might prove to help us forecast the outbreak more effectively.

1. __China__: The source and initial outbreak (likely population density driven) with a high death rate (likely ageing population driven) but managed to control the outbreak effectively.
2. __United States__: An example of the largest outbreak in one country, with policy struggling to keep it under control.
3. __Australia__: A low number of total cases, however as I have local knowledge of the policy - I can more accurately analyse the effects between policy and transmission.
4. __Italy__: The first example outside of China with a major outbreak.
5. __France__: Low proportion of global cases, in comparison to proportion of deaths.

#### Case Study: Australia
<a id="cs_au"></a>

In [None]:
r = requests.get(url="https://raw.githubusercontent.com/rowanhogan/australian-states/master/states.geojson")
topology = r.json()

In [None]:
aus_df = (BASE_DFS[TRAIN].copy())
aus_df = aus_df[aus_df["Country_Region"] == "Australia"]
aus_df['Date'] = aus_df['Date'].astype(str)

fig = px.choropleth(pd.DataFrame((aus_df.groupby(["Province_State"])).max()).reset_index(),
                    geojson=topology,
                    locations='Province_State',
                    featureidkey="properties.STATE_NAME",
                    color_continuous_scale=px.colors.sequential.matter,
                    hover_name='Province_State',
                    color='ConfirmedCases',
                    title='Australia: Total Cases per State by Date'
                   )
fig.update_geos(fitbounds="locations", visible=True)
fig.show()

In [None]:
px.line(aus_df, x='Date', y='ConfirmedCases', color='Province_State', 
        title='Australia: Total Cases by State and Date').show()
px.line(aus_df, x='Date', y='Fatalities', color='Province_State', 
        title='Australia: Total Fatalities by State and Date').show()

In [None]:
aus_df['ln(ConfirmedCases)'] = np.log(aus_df.ConfirmedCases + 1)
aus_df['ln(Fatalities)'] = np.log(aus_df.Fatalities + 1)

In [None]:
px.line(aus_df, x='Date', y='ln(ConfirmedCases)', color='Province_State', 
        title='Australia: Total ln(ConfirmedCases) by State and Date').show()
px.line(aus_df, x='Date', y='ln(Fatalities)', color='Province_State', 
        title='Australia: Total ln(Fatalities) by State and Date').show()

In [None]:
gth_aus_df = discrete_growth(aus_df, "Date", 
                             ["continent", "Country_Region", "Province_State"], 
                             "ConfirmedCases")
gth_aus_df = discrete_growth(gth_aus_df, "Date", 
                             ["continent", "Country_Region", "Province_State"], 
                             "Fatalities")

In [None]:
dg_df_world = gth_aus_df.groupby(['Date']).sum()
dates = list(dg_df_world.index)
prev_date = dates.pop(0)

# calculate growth rates
for date in dates:
    dg_df_world.loc[date, "ConfirmedCases_GrowthRate"] \
            = dg_df_world.loc[date, "ConfirmedCases_discrete"] / dg_df_world.loc[prev_date, "ConfirmedCases"]
    dg_df_world.loc[date, "Fatalities_GrowthRate"] \
            = dg_df_world.loc[date, "Fatalities_discrete"] / dg_df_world.loc[prev_date, "Fatalities"]
    prev_date = date

fig = go.Figure(data=[
    go.Bar(name='Cases', x=dg_df_world.index, 
           y=dg_df_world['ConfirmedCases_discrete'], yaxis="y1", opacity=0.5),
    go.Bar(name='Fatalities', x=dg_df_world.index, 
           y=dg_df_world['Fatalities_discrete'], yaxis="y1", opacity=0.5),
    go.Line(name="Cases_GrowthRate", x=dg_df_world.index, 
            y=dg_df_world["ConfirmedCases_GrowthRate"], yaxis="y2", line_color="forestgreen"),
    go.Line(name="Fatalities_GrowthRate", x=dg_df_world.index, 
            y=dg_df_world["Fatalities_GrowthRate"], yaxis="y2", line_color="crimson")
])

fig.update_layout(barmode='overlay', title='Global Daily Confirmed Cases and Fatalities',
                 yaxis=dict(title="Cases and Fatalities"),
                 yaxis2=dict(title="Growth Rate %", overlaying='y',side='right'),
                 yaxis2_tickformat = '%')
fig.show()

In [None]:
grped_aus_df = aus_df.groupby('Date').sum()

print("Current Austrailian fatality rate: %.2f%%" 
      % (grped_aus_df['Fatalities'].max() 
         / grped_aus_df['ConfirmedCases'].max() * 100))