# European Centre for Disease Prevention and Control Dataset

This is the legacy "Historical daily dataset". ECDC stopped providing daily data after December 17.

> ECDC switched to a weekly reporting schedule for the COVID-19 situation worldwide and in the EU/EEA and the UK on 17 December this year. Hence, all daily updates have been discontinued from 14 December. ECDC will publish updates on the number of cases and deaths reported worldwide and aggregated by week every Thursday. The weekly data will be available as downloadable files in the following formats: XLSX, CSV, JSON and XML. As an exception, the weekly updates for the end-of-year festive season will be published on 23 December and 30 December 2020.

Use weekly table after December 17 - included in the data share too.


In [None]:
import pandas as pd
import datetime
import pycountry
import re
import os
import numpy as np

In [None]:
# papermill parameters
output_folder = "../output/"

### Fetch data

In [None]:
df = pd.read_excel("https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-12-14.xlsx",
                  engine='openpyxl')

### Parse date

In [None]:
df["dateRep"] = pd.to_datetime(df["dateRep"], format="%d/%m/%Y")

### Add difference

In [None]:
df['CASES_SINCE_PREV_DAY'] = df.groupby(['countriesAndTerritories','continentExp'])['cases'].diff().fillna(0).astype(int)
df['DEATHS_SINCE_PREV_DAY'] = df.groupby(['countriesAndTerritories','continentExp'])['deaths'].diff().fillna(0).astype(int)

### Drop cols

In [None]:
df = df.drop(columns=["day", "month", "year", "countryterritoryCode"])

In [None]:
int_conveyance = df["geoId"].loc["JPG11668" == df["geoId"]].index
df["geoId"].iloc[int_conveyance] = np.nan
df["popData2019"].iloc[int_conveyance] = np.nan
df["continentExp"].iloc[int_conveyance] = np.nan
df["countriesAndTerritories"].iloc[int_conveyance] = "Cases on an international conveyance Japan"

### Resolve Country/Region name

In [None]:
country_codes = df["geoId"].unique()
for code in country_codes:
    try:
        pyc = pycountry.countries.get(alpha_2=code)
        if pyc:
            df["countriesAndTerritories"].loc[code == df["geoId"]] = pyc.name
    except LookupError:
        df["countriesAndTerritories"].loc[code == df["geoId"]] = None

### Set Last Update Date and Last Reported Flag

In [None]:
df["LAST_UPDATE_DATE"] = datetime.datetime.utcnow()
df["LAST_REPORTED_FLAG"] = df["dateRep"].max() == df["dateRep"]

### Rename Cols

In [None]:
df = df.rename(columns={
    "dateRep": "DATE", 
    "countriesAndTerritories": "COUNTRY_REGION", 
    "geoId": "ISO3166_1", 
    "popData2018": "POPULATION",
})

### Save dataframe

In [None]:
df.to_csv(output_folder + "ECDC_GLOBAL.csv", index=False, columns=[
    "COUNTRY_REGION",
    "continentExp",
    "ISO3166_1",
    "cases",
    "deaths",
    "CASES_SINCE_PREV_DAY",
    "DEATHS_SINCE_PREV_DAY",
    "popData2019",
    "DATE",
    "LAST_UPDATE_DATE",
    "LAST_REPORTED_FLAG"
])