# Daily data download and first processing

This notebook is part of a toolset to analyse and visualise data on the COVID-19 epidimic and will be run once a day.

It will download the daily updated datasheet created by the ECDC, containing amount of new cases and deaths per country due to the COVID-19 virus, courtesy of https://ourworldindata.org/coronavirus-source-data .

The data is downloaded as an excel sheet and slip into three datasets: new cases per country, per date; new deaths, per country, per date and a set containing the populations of the countries in 2018.
I will comment on all steps what is happening and regularly print what we're working with for transparency.

These datasets will be further processed and visualised in the other notebooks contained in this directory.

### Data import

First we'll have to download the ECDC datasheet.
For now this can only be done locally on my machine, though the data will be saved in the data/ directory in this repository.

In [1]:
import urllib.request               # Calls url to downlload daily file
import datetime                     # Provides current date

try:
    today = datetime.date.today()
    url = 'https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-' + str(today) + '.xlsx'

    # Second argument in urlretrieve is the destination and filename for saving
    _ = urllib.request.urlretrieve(url, 'data/daily_data.xlsx')

except: 
    yesterday = datetime.date.today() - datetime.timedelta(days=1)
    url = 'https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-' + str(yesterday) + '.xlsx'

    # Second argument in urlretrieve is the destination and filename for saving
    _ = urllib.request.urlretrieve(url, 'data/daily_data.xlsx')

Next we'll need to open the downloaded .xlsx file as a pandas dataframe so we can easelly manipulate the data.

In [2]:
import pandas as pd                # Tool to easely manipulate data

df = pd.read_excel('data/daily_data.xlsx')
df

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018
0,2020-03-30,30,3,2020,8,1,Afghanistan,AF,AFG,37172386.0
1,2020-03-29,29,3,2020,15,1,Afghanistan,AF,AFG,37172386.0
2,2020-03-28,28,3,2020,16,1,Afghanistan,AF,AFG,37172386.0
3,2020-03-27,27,3,2020,0,0,Afghanistan,AF,AFG,37172386.0
4,2020-03-26,26,3,2020,33,0,Afghanistan,AF,AFG,37172386.0
...,...,...,...,...,...,...,...,...,...,...
7705,2020-03-25,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
7706,2020-03-24,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14439018.0
7707,2020-03-23,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14439018.0
7708,2020-03-22,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14439018.0


### Data processing

As a first step we will clean out th data a bit since we have quite some values that we do not need.

Considering the 'DateRep' column contains a datetime that Python can translate we will delete the unneeded time attributes, as well as the geoid's since I will not be needing them yet.
This can be updated at a later time.

In [3]:
df = df.drop(columns=['day', 'month', 'year', 'geoId', 'countryterritoryCode'])
display(df)

Unnamed: 0,dateRep,cases,deaths,countriesAndTerritories,popData2018
0,2020-03-30,8,1,Afghanistan,37172386.0
1,2020-03-29,15,1,Afghanistan,37172386.0
2,2020-03-28,16,1,Afghanistan,37172386.0
3,2020-03-27,0,0,Afghanistan,37172386.0
4,2020-03-26,33,0,Afghanistan,37172386.0
...,...,...,...,...,...
7705,2020-03-25,0,0,Zimbabwe,14439018.0
7706,2020-03-24,0,1,Zimbabwe,14439018.0
7707,2020-03-23,0,0,Zimbabwe,14439018.0
7708,2020-03-22,1,0,Zimbabwe,14439018.0


Next, just to make our next step a little bit easier we will create two lists. One for every date, ordered from old to recent, and another for every country that can be found in the database.

In [4]:
dates = []
countries = []

for entry in df['dateRep']:
    if entry not in dates:
        dates.append(entry)

for entry in df['countriesAndTerritories']:
    if entry not in countries:
        countries.append(entry)
        
dates.sort()

Now for the actual data extraction. 

To create the structures of our dataframes we will make a list filled with tuples, which in turn contain the name of the columns we will create (in this case the 'Date' and all country names) followed by a list that holds the values of cases or deaths of that country that will become the columns. 

We will always follow the order of the dates list we created before when filling the lists of country values to they all line up.

At the end we convert those structure lists into dictionaries to be able to use pandas on them and convert them into dataframes.

In [5]:
struct_cases = [('Date', dates)]
struct_deaths = [('Date', dates)]
struct_populations = [('Countries', []),('Populations', [])]
i = 1

for country in countries:
    country_pop = df.loc[df['countriesAndTerritories']==country]['popData2018'].values[0]
    struct_populations[0][1].append(country)
    struct_populations[1][1].append(country_pop)
    
    struct_cases.append((country, []))
    struct_deaths.append((country, []))
    
    country_df = df.loc[df['countriesAndTerritories']==country]
    collected_dates = country_df['dateRep'].tolist()
    
    for date in dates:
        if date in collected_dates:
            entry = country_df.loc[country_df['dateRep']==date]
            struct_cases[i][1].append(entry['cases'].values[0])
            struct_deaths[i][1].append(entry['deaths'].values[0])
            
        else:
            struct_cases[i][1].append(0)
            struct_deaths[i][1].append(0)
    
    i += 1

dict_cases = {title: content for (title, content) in struct_cases}
dict_deaths = {title: content for (title, content) in struct_deaths}
dict_populations = {title: content for (title, content) in struct_populations}

New we're finlly ready to create the dataframes for both new cases as well as new deaths.
I will just fill all the empty cells, when there was no data provided, with 0.

In [6]:
df_cases_new = pd.DataFrame(dict_cases).set_index('Date')
df_cases_new = df_cases_new.fillna(0)

display(df_cases_new)

Unnamed: 0_level_0,Afghanistan,Angola,Albania,Andorra,Antigua_and_Barbuda,Algeria,Anguilla,Argentina,Armenia,Aruba,...,United_Kingdom,United_Republic_of_Tanzania,United_States_of_America,United_States_Virgin_Islands,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-26,33,0,23,24,0,33,0,115,25,2,...,1452,0,13963,0,28,10,15,14,9,1
2020-03-27,0,1,28,36,4,41,2,87,39,9,...,2129,1,16797,0,21,18,1,5,2,0
2020-03-28,16,1,12,43,0,0,0,101,43,0,...,2885,0,18695,2,0,21,12,16,2,2
2020-03-29,15,0,11,41,0,104,0,55,52,0,...,2546,0,19979,3,66,29,0,54,12,2


In [7]:
df_deaths_new = pd.DataFrame(dict_deaths).set_index('Date')
df_deaths_new = df_deaths_new.fillna(0)

display(df_deaths_new)

Unnamed: 0_level_0,Afghanistan,Angola,Albania,Andorra,Antigua_and_Barbuda,Algeria,Anguilla,Argentina,Armenia,Aruba,...,United_Kingdom,United_Republic_of_Tanzania,United_States_of_America,United_States_Virgin_Islands,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-26,0,0,0,0,0,0,0,2,0,0,...,41,0,249,0,0,0,0,0,0,0
2020-03-27,0,0,1,3,0,4,0,4,1,0,...,115,0,246,0,0,0,1,0,0,0
2020-03-28,1,0,3,0,0,0,0,5,0,0,...,181,0,411,0,0,0,0,0,0,0
2020-03-29,1,0,1,1,0,5,0,2,2,0,...,260,1,484,0,0,1,0,0,0,0


In [8]:
df_populations = pd.DataFrame(dict_populations).set_index('Countries')
df_populations = df_populations.fillna(0)

display(df_populations)

Unnamed: 0_level_0,Populations
Countries,Unnamed: 1_level_1
Afghanistan,37172386.0
Angola,30809762.0
Albania,2866376.0
Andorra,77006.0
Antigua_and_Barbuda,96286.0
...,...
Uzbekistan,32955400.0
Venezuela,28870195.0
Vietnam,95540395.0
Zambia,17351822.0


### Saving

With all the datasets created we can now save them to a file to be used in the other notebooks for further processing, visualisations and later on to train networks and regression models.

It will always save on the same file so that all other applications can just rerun when the data is updated and the data is available in the data/ directory.

In [9]:
df_cases_new.to_csv('data/cases_new.csv')
df_deaths_new.to_csv('data/deaths_new.csv')
df_populations.to_csv('data/populations_2018.csv')