# COVID-19 Reported cases/death by country (Timeseries Clustering)
## Exploration notebook
### Final Project, BFH BZG1314a, HS20
Author: Alexandre Moeri  
Date: 09.11.2020  
Programming Language: Python  
Data Source: [WHO](https://covid19.who.int/WHO-COVID-19-global-data.csv)

## Setting up the environment

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff

## Importing the data 

In [None]:
data = pd.read_csv('/kaggle/input/who-covid-19-global-data/WHO-COVID-19-global-data.csv')
data.info()

## Exploring and cleaning the data

General structure of the data

In [None]:
data.head()

In [None]:
data.tail()

Check and fix keys (remove leading/trailing spaces)

In [None]:
data.keys()

In [None]:
data = data.rename(columns={' Country_code': 'Country_code', ' Country': 'Country', ' WHO_region': 'WHO_region', ' New_cases': 'New_cases', ' Cumulative_cases': 'Cumulative_cases', ' New_deaths': 'New_deaths', ' Cumulative_deaths': 'Cumulative_deaths'})

In [None]:
data.keys()

Earliest and latest report date

In [None]:
data['Date_reported'].min()

In [None]:
data['Date_reported'].max()

Check if series for each country are of equal length

In [None]:
if (data.groupby(['Country_code']).count().max().values == data.groupby(['Country_code']).count().min().values).all():
    print('All series are of equal length')
else:
    print('All series are *NOT* of equal length')

Checking for null values dataframe.infos() has shown us that there are some null values in the column Country_code

In [None]:
data[pd.isnull(data['Country_code'])]

Seems like the country code for Namibia has been omitted. The correct countrycode is NA, let's fill that in.
Since this is the only source of NaN in our dataset, we can just replace all of the with the countrycode (super hacky thing to do).

In [None]:
data = data.fillna('NA')
data.info()

Let's focus on new cases for now. Get rid of columns we dont require for now

In [None]:
new_cases = data[['Date_reported', 'Country_code', 'New_cases']].rename(columns={'Date_reported': 'Date', 'Country_code': 'Country', 'New_cases': 'Cases'})
new_cases.info()

Let's have a first visual look at our data.

In [None]:
px.line(new_cases, x='Date', y='Cases', color='Country')

Following first observations that can be made:
1. Number of cases pre Mar 2020 are relatively low in non-asian countries, especially excluding China. This is obsious since this is where the first cases were discovered. This suggest that we might want to address that to achieve useful clustering, i.e. with DTW.
2. There seem to be several occurences of negative 'New Cases' entries. Those are most likely artifacts of data cleaning and certainly not a represantion of the real world. This must be rectified in a better way, such that we do not have values below 0 in our series.
3. Clearly the magnitude of cases are influenced by a country's population. It would make sense to normalize the data. Either by looking at new cases / population or potentially the relative rise or fall of new cases.
4. There seems to be clear seasonality, most likely based on weekdays (especially weekdays vs. weekends). We might consider decomposing the series to account mainly for trends(?)

### Getting rid of values below zero

First we need to understand what the values below zero mean to develop a strategy for replacing them

In [None]:
data.where(data['New_cases']<0).dropna()

In [None]:
data.where(data['New_cases']<0).dropna().min()

In [None]:
data.where(data['New_cases']<0).dropna().sort_values(by=['New_cases'])[:20]

In [None]:
data.where(data['New_cases']==-8261).dropna()

In [None]:
data[data['Country_code']=='EC'][data['Date_reported'] > '2020-09-01'][data['Date_reported'] < '2020-09-10']

Based on our findings we shall consider cumulative cases instead of new cases.

In [None]:
cases = data[['Date_reported', 'Country_code', 'Cumulative_cases']].rename(columns={'Date_reported': 'Date', 'Country_code': 'Country', 'Cumulative_cases': 'Cases'})
cases.info()

In [None]:
px.line(cases, x='Date', y='Cases', color='Country')

To Normalize our data by population lets get population numbers

In [None]:
data = pd.read_csv('../input/countryinfo/covid19countryinfo.csv')
pop = data[['alpha2code', 'pop']].rename(columns={'alpha2code': 'Country', 'pop': 'Population'})
pop.head()
# somehow china has duplicate entries, propably territories are listed separately, so ignore those, as WHO data has only one CN
pop[pop['Country']=='CN']
pop = pop[pop['Population']!='19,116,201']
#somehow congo too
pop[pop['Country']=='CG']
pop = pop[pop['Population']!='5,518,087']


In [None]:
# Countries BL, MF, NC, PF, SX do not have population data -> remove them
pop = pop[pop['Country']!='BL']
pop = pop[pop['Country']!='MF']
pop = pop[pop['Country']!='NC']
pop = pop[pop['Country']!='PF']
pop = pop[pop['Country']!='SX']

In [None]:
cases = pd.merge(cases, pop, on="Country")
cases

In [None]:
cases['Population'] = pd.to_numeric(cases['Population'].str.replace(',',''))

In [None]:
def norm(row):
    return row['Cases'] / row['Population']

cases['Cases'] = cases.apply(norm, axis=1)
cases[['Date', 'Country', 'Cases']]
px.line(cases, x='Date', y='Cases', color='Country')

## First attempt at clustering the data

In [None]:
cases_wide = cases.pivot(index='Country', columns='Date', values='Cases')
from scipy.spatial.distance import squareform, pdist
euclidian_distance_matrix = pd.DataFrame(squareform(pdist(cases_wide, metric='euclidean')), columns=cases.Country.unique(), index=cases.Country.unique())
px.imshow(euclidian_distance_matrix)

In [None]:
from scipy.cluster.hierarchy import dendrogram, average
hierarchical_euclidian_cluster = average(pdist(cases_wide, metric='euclidean'))
fig = ff.create_dendrogram(hierarchical_euclidian_cluster)
fig.update_layout(width=800, height=500)
fig.show()

In [None]:
from scipy.cluster.hierarchy import cut_tree
hierarchical_euclidian_cluster_10 = pd.DataFrame(cut_tree(hierarchical_euclidian_cluster, n_clusters=10), columns=['Cluster'], index=cases.Country.unique())
stocks_hec10 = cases.join(hierarchical_euclidian_cluster_10, on='Country', how='left')
px.line(stocks_hec10, x='Date', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

In [None]:
!pip install tslearn
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import TimeSeriesKMeans
dtw_cluster_10 = pd.DataFrame(TimeSeriesKMeans(n_clusters=10, metric='dtw').fit(to_time_series_dataset(cases_wide.values)).labels_, columns=['Cluster'], index=cases.Country.unique())
cases_dtw10 = cases.join(dtw_cluster_10, on='Country', how='left')
px.line(cases_dtw10, x='Date', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

seems like the main cluster feature is when the number of cases startet rising, we should thus further refine our data and replace absolute dates by "days after the first X reported cases" we should consider that our timeseries will no longer have the same lenght and should therefore shorten them and maybe consider dropping a few where reporting only started very late.

lets set the threshold at 1case per 100000 population

In [None]:
start_dates = cases[cases['Cases']>1/100000].groupby(['Country'], as_index=False).min()[['Country', 'Date']].rename(columns={"Date": "Start"})
cases_start = pd.merge(cases, start_dates, on="Country")

cases_start['Day'] = cases_start.apply(lambda row: np.datetime64(row.Date) - np.datetime64(row.Start), axis = 1) 



In [None]:
# filter out days before start
cases_day = cases_start[cases_start['Day']>=np.datetime64("2018-01-01")-np.datetime64("2018-01-01")][['Country', 'Cases', 'Day']]

In [None]:
# lets find the countries trailing in available days
cases_day.groupby('Country').count().sort_values(by=['Day']).head(20)

seems like 250 days would be a reasonable cutoff, so we will ignore VN, KH, CG, SY, MZ, UG, ET & ZW

In [None]:
cases_day = cases_day[cases_day['Country']!='VN']
cases_day = cases_day[cases_day['Country']!='KH']
cases_day = cases_day[cases_day['Country']!='CG']
cases_day = cases_day[cases_day['Country']!='SY']
cases_day = cases_day[cases_day['Country']!='MZ']
cases_day = cases_day[cases_day['Country']!='UG']
cases_day = cases_day[cases_day['Country']!='ET']
cases_day = cases_day[cases_day['Country']!='ZW']
cases_day.groupby('Country').count().sort_values(by=['Day']).head(20)

In [None]:
cases_day.shape

In [None]:
cases_day.groupby('Country').count().shape

In [None]:
#now filter all days > 250
cases_day = cases_day[cases_day['Day']<=np.datetime64("2018-09-08")-np.datetime64("2018-01-01")]

In [None]:
px.line(cases_day, x='Day', y='Cases', color='Country')

In [None]:
cases_day_wide = cases_day.pivot(index='Country', columns='Day', values='Cases')
euclidian_distance_matrix = pd.DataFrame(squareform(pdist(cases_day_wide, metric='euclidean')), columns=cases_day.Country.unique(), index=cases_day.Country.unique())
px.imshow(euclidian_distance_matrix)

In [None]:
hierarchical_euclidian_cluster = average(pdist(cases_day_wide, metric='euclidean'))
fig = ff.create_dendrogram(hierarchical_euclidian_cluster)
fig.update_layout(width=800, height=500)
fig.show()

In [None]:
hierarchical_euclidian_cluster_10 = pd.DataFrame(cut_tree(hierarchical_euclidian_cluster, n_clusters=10), columns=['Cluster'], index=cases_day.Country.unique())
cases_day_hec10 = cases_day.join(hierarchical_euclidian_cluster_10, on='Country', how='left')
px.line(cases_day_hec10, x='Day', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

In [None]:
dtw_cluster_10 = pd.DataFrame(TimeSeriesKMeans(n_clusters=10, metric='dtw').fit(to_time_series_dataset(cases_day_wide.values)).labels_, columns=['Cluster'], index=cases_day.Country.unique())
cases_dtw10 = cases_day.join(dtw_cluster_10, on='Country', how='left')
px.line(cases_dtw10, x='Day', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

In [None]:
case_clusters = cases_dtw10.groupby(['Country'], as_index=False).min()[['Country', 'Cluster']]
case_clusters

In [None]:
# re enrich our cluster data with WHO information
who_data = pd.read_csv('/kaggle/input/who-covid-19-global-data/WHO-COVID-19-global-data.csv')
who_data = who_data.rename(columns={'Country_code': 'Country', 'Country': 'Name', 'WHO_region': 'Region', ' New_cases': 'New_cases', ' Cumulative_cases': 'Cumulative_cases', ' New_deaths': 'New_deaths', ' Cumulative_deaths': 'Cumulative_deaths'})[['Country', 'Name', 'Region']].groupby('Country', as_index=False).min()
clusters = pd.merge(case_clusters, who_data, on='Country')
clusters

In [None]:
clusters.groupby(['Cluster', 'Region']).count()

In [None]:
clusters[clusters['Country']=='CH']

In [None]:
clusters[clusters['Cluster']==0]

In [None]:
clusters[clusters['Country']=='US']

In [None]:
clusters[clusters['Cluster']==7]

In [None]:
clusters[clusters['Country']=='IT']

In [None]:
clusters[clusters['Cluster']==2]

lets try to only look at european countries in the serach of more meaningful results

In [None]:
cases_day_euro = pd.merge(cases_day, who_data, on='Country')
cases_day_euro = cases_day_euro[cases_day_euro['Region']=='EURO']
cases_day_euro

In [None]:
px.line(cases_day_euro, x='Day', y='Cases', color='Country')

In [None]:
cases_day_euro_wide = cases_day_euro.pivot(index='Country', columns='Day', values='Cases')
euclidian_distance_matrix = pd.DataFrame(squareform(pdist(cases_day_euro_wide, metric='euclidean')), columns=cases_day_euro.Country.unique(), index=cases_day_euro.Country.unique())
px.imshow(euclidian_distance_matrix)

In [None]:
hierarchical_euclidian_cluster = average(pdist(cases_day_euro_wide, metric='euclidean'))
fig = ff.create_dendrogram(hierarchical_euclidian_cluster)
fig.update_layout(width=800, height=500)
fig.show()

In [None]:
hierarchical_euclidian_cluster_10 = pd.DataFrame(cut_tree(hierarchical_euclidian_cluster, n_clusters=6), columns=['Cluster'], index=cases_day_euro.Country.unique())
cases_day_euro_hec10 = cases_day_euro.join(hierarchical_euclidian_cluster_10, on='Country', how='left')
px.line(cases_day_euro_hec10, x='Day', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

In [None]:
dtw_cluster_10 = pd.DataFrame(TimeSeriesKMeans(n_clusters=6, metric='dtw').fit(to_time_series_dataset(cases_day_euro_wide.values)).labels_, columns=['Cluster'], index=cases_day_euro.Country.unique())
cases_dtw10 = cases_day_euro.join(dtw_cluster_10, on='Country', how='left')
px.line(cases_dtw10, x='Day', y='Cases', color='Country', facet_col='Cluster', facet_col_wrap=4)

In [None]:
case_clusters_euro = cases_dtw10.groupby(['Country'], as_index=False).min()[['Name', 'Cluster']]
case_clusters_euro

In [None]:
case_clusters_euro[case_clusters_euro['Name']=='Switzerland']

In [None]:
case_clusters_euro[case_clusters_euro['Cluster']==0]