# Dataset for worldwide distribution of COVID-19 cases

This notebook downloads the dataset for worldwide geographic distribution of COVID-19 cases from European Centre for Disease Prevention and Control Agency, and converts it into a Pandas dataframe.

I considered makign a dataset of this, but since the data is updated daily, it seems rather pointless. Instead, this notebook illustrates how to download the current version of the dataset during execution, convert it into a Pandas dataframe, and perform some analysis of it.

The dataset could further be combined with other country or COVID-19 related data. Here I combine it with a dataset about countries and continents to map the data across continents.

The dataset is described here:
https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os


# Get the Dataset

## Download the file

In [None]:
import requests

URL = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"
r = requests.get(url = URL) 

## Read the data into dataframe

In [None]:
from io import StringIO

csv_io = StringIO(r.text)
df = pd.read_csv(csv_io)

Above download and dataframe convert would be what you really need to use the dataset, but let's see what it looks like.

In [None]:
df.head()

Convert date string column to Pandas datetime for processing.

In [None]:
df["dateRep"] = pd.to_datetime(df['dateRep'], format='%d/%m/%Y')

# Look at Data

What countries does it have?

In [None]:
df["countriesAndTerritories"].unique()

In [None]:
df.dtypes

### Take 3 Nordic Countries and Compare

In [None]:
df_norway = df[df["countriesAndTerritories"] == "Norway"].sort_values(by="dateRep")[["dateRep", "cases", "deaths"]]
df_sweden = df[df["countriesAndTerritories"] == "Sweden"].sort_values(by="dateRep")[["dateRep", "cases", "deaths"]]
df_finland = df[df["countriesAndTerritories"] == "Finland"].sort_values(by="dateRep")[["dateRep", "cases", "deaths"]]


In [None]:
df_norway.set_index('dateRep', inplace=True)
df_sweden.set_index('dateRep', inplace=True)
df_finland.set_index('dateRep', inplace=True)


Function to plot cases and cumulative cases side-by-side:

In [None]:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
%matplotlib inline

def dual_plot(col_name, df, type_name):
    plt.rcParams.update({'font.size': 22})
    plt.figure(figsize=(20,12))

    ax = plt.subplot(2, 2, 1)
    plt.xticks(rotation=45) #have to set rotate here before labels created
    locator = mdates.DayLocator(interval=15)
    ax.xaxis.set_major_locator(locator)
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%d/%m'))
    plt.plot(df.index, df[col_name], 'c-')
    plt.title(f'{type_name}s per Day')
    plt.ylabel(f'{type_name}s')
    plt.xlabel('Date')

    ax = plt.subplot(2, 2, 2)
    plt.xticks(rotation=45) #have to set rotate here before labels created
    locator = mdates.DayLocator(interval=15)
    ax.xaxis.set_major_locator(locator)
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%d/%m'))
    plt.plot(df.index, df[col_name].cumsum(), 'r-')
    plt.xlabel('Date')
    plt.ylabel(f'{type_name}s')
    plt.title(f'{type_name}s, cumulative')

    plt.show()

In [None]:
dual_plot("cases", df_norway, "Case")

In [None]:
dual_plot("cases", df_sweden, "Case")

In [None]:
dual_plot("cases", df_finland, "Case")

# Combine with Another Kaggle Dataset (Continents)

In [None]:
!ls /kaggle/input/country-to-continent

In [None]:
df_countries = pd.read_csv("/kaggle/input/country-to-continent/countryContinent.csv", encoding="iso-8859-1")
df_countries.head(10)

If you compare the two datasets, you see the continent dataset is missing Kosovo, which is in the EU daily dataset. So just add it:

In [None]:
set(df["countryterritoryCode"]) - set(df_countries["code_3"])

In [None]:
df_countries.append({"country": "Kosovo", "code_2": "XK", "code_3": "XKX", "continent": "Europe", "sub_region": "Southern Europe", "region_code": 150, "sub_region_code": 39}, ignore_index=True)
pass

In [None]:
df[df["countryterritoryCode"] == "XKX"].head()

Merge the daily dataset now with the continent dataset.

In [None]:
df_m = pd.merge(df, df_countries, left_on='countryterritoryCode', right_on='code_3')
df_m.head(20)

# Create Cumulative Counts per Continent

Here I create cumulative counts for different continents. Someone can probably write some fancy groupby one-liners, I decided to do it the simple way..

In [None]:
continents = df_m["continent"].unique()
continents

In [None]:
#df_sorted_cumulative = pd.DataFrame()
#for country in countries:
#    df_country = df_m[df_m["country"] == country].sort_values(by="dateRep")[["country", "continent", "sub_region", "dateRep", "cases", "deaths"]]
#    df_country["cum_cases"] = df_country["cases"].cumsum()
#    df_country["cum_deaths"] = df_country["deaths"].cumsum()
#    df_sorted_cumulative = pd.concat([df_sorted_cumulative, df_country], axis=0)


In [None]:
#df_sorted_cumulative.head(100)

First create cumulative counts per country, sorted so each continent has its countries in single sequence. To make it easier to merge continent-level cumulative stats back if later desired..

In [None]:
df_sorted_cumulative = pd.DataFrame()
for continent in continents:
    continent_countries = df_m[df_m["continent"] == continent]["country"].unique()
    for country in continent_countries:
        df_country = df_m[df_m["country"] == country].sort_values(by="dateRep")[["country", "continent", "sub_region", "dateRep", "cases", "deaths"]]
        df_country["cum_cases"] = df_country["cases"].cumsum()
        df_country["cum_deaths"] = df_country["deaths"].cumsum()
        df_sorted_cumulative = pd.concat([df_sorted_cumulative, df_country], axis=0)


In [None]:
df_sorted_cumulative.head(100)

### Global Statistics

Use the above per country data per day to calculate statistics worldwide:

In [None]:
df_total_cumulative = df_sorted_cumulative.groupby("dateRep").sum()

In [None]:
df_total_cumulative.tail()

If you compare the above final row to the [daily WHO statistics](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports), it should be about the same.

## Create Dataframes for Continents

Now calculate cumulative sums per continent from the country datas:

In [None]:
df_sorted_cumulative_continent = pd.DataFrame()
for continent in continents:
    df_continent = df_sorted_cumulative[df_sorted_cumulative["continent"] == continent]
    df_continent = df_continent.sort_values(by="dateRep")
    df_continent["cum_cases"] = df_continent["cases"].cumsum()
    df_continent["cum_deaths"] = df_continent["deaths"].cumsum()
    df_sorted_cumulative_continent = pd.concat([df_sorted_cumulative_continent, df_continent], axis=0)


In [None]:
df_sorted_cumulative_continent.tail(10)

So just need to take the last item per day from the above dataframe to get the cumulative per continent.

For example, what does Asia look like?

In [None]:
df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Asia"].tail()

In [None]:
continents

There are only 5 continents listed, so I just take each separately:

In [None]:
df_asia = df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Asia"].groupby("dateRep").max()
df_africa = df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Africa"].groupby("dateRep").max()
df_europe = df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Europe"].groupby("dateRep").max()
df_americas = df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Americas"].groupby("dateRep").max()
df_oceania = df_sorted_cumulative_continent[df_sorted_cumulative_continent["continent"] == "Oceania"].groupby("dateRep").max()

You could further split these according to the sub-area in the continents dataset, let's see about that later..

# Cases and Deaths by Continent / Globally

## Global

In [None]:
dual_plot("cases", df_total_cumulative, "Case")

In [None]:
dual_plot("deaths", df_total_cumulative, "Death")

# Europe

In [None]:
dual_plot("cases", df_europe, "Case")

In [None]:
dual_plot("deaths", df_europe, "Death")

## Americas

This would be quite useful to split to North- and South-America at least but lets see..

In [None]:
dual_plot("cases", df_americas, "Case")

In [None]:
dual_plot("deaths", df_americas, "Death")

# Africa

In [None]:
dual_plot("cases", df_africa, "Case")

In [None]:
dual_plot("deaths", df_africa, "Death")

## Oceania

In [None]:
dual_plot("cases", df_oceania, "Case")

In [None]:
dual_plot("deaths", df_oceania, "Death")

In [None]:
#Possibly deeper splits
df_countries["sub_region"].unique()

To be continued.. but you get the idea on using such dataset already.. 