# Daily data processing

This worksheet is part of a project to analyse and visualise data on the COVID-19 outbreak and will be run once a day.<br>
Though this can only be done locally on my machine, the data will be saved in ../data/ in this repository.

We will use the data downloaded and extracted, by the other notebook in this directory, we will now mold this data into a few sets that we will be working with. The content of these datasets will be explained as we create them, though here is as overview.

#### Outbreak spread:
    - Total cases, deaths
    - Weekly cases, deaths
    
#### Country impact:
    - Prevalence
    - Daily, weekly incidence
    
#### Other:
    - Mortality
    - Threshold dates

## Data import

Before we can do anything we'll have to load up the three datasets created by the other worksheet, setting the index appropriatly. Note that when saving the sets as .csv files, pandas saves the dates as strings. So to be able to compare them as dates, before we set them as the indexes, we need to tell pandas to translate that column to datetimes.

To make the processing easier to read we will also already make two lists containing all the available dates and countries, as well as getting the last available date.

In [1]:
import pandas as pd
import datetime

df_cases_daily = pd.read_csv('../data/cases_daily.csv')
df_cases_daily['Date'] = df_cases_daily['Date'].astype('datetime64[ns]')
df_cases_daily = df_cases_daily.set_index('Date')

df_deaths_daily = pd.read_csv('../data/deaths_daily.csv')
df_deaths_daily['Date'] = df_deaths_daily['Date'].astype('datetime64[ns]') 
df_deaths_daily = df_deaths_daily.set_index('Date')

df_populations = pd.read_csv('../data/populations.csv').set_index('Ind')

dates_all = df_cases_daily.index
countries_all = df_cases_daily.columns

latest_date = dates_all[-1]

## Total

The first datasets we willl create will be the ones containing the total amount of cases and deaths per day. This is easilly done by looping through all the dates for a particular country and adding all the values to their respective counter.

Note that the structure of these datasets will be the same as the ones for daily cases and deaths, the same countries as collumns and every day as index. So we can just copy that dataset and overwrite it instead of having to make new dictionaries to create our dataframe with.

In [2]:
df_cases_total = df_cases_daily.copy()
df_deaths_total = df_deaths_daily.copy()

for country in countries_all:
    cases = 0
    deaths = 0
    
    for date in dates_all:
        cases += df_cases_daily.at[date, country]
        deaths += df_deaths_daily.at[date, country]
        
        df_cases_total.at[date, country] = cases
        df_deaths_total.at[date, country] = deaths
        
display(df_cases_total)
display(df_deaths_total)

df_cases_total.to_csv('../data/cases_total.csv')
df_deaths_total.to_csv('../data/deaths_total.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,27,0,27,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,27,0,27,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,27,0,27,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,44,0,44,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,44,0,44,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-05,1174364,8489,213277,588988,332917,6731,23962,270,333,1300,...,41903,21,40,312237,400,266,144,240,39,9
2020-04-06,1245596,9175,223332,619493,360656,6847,26093,299,361,1320,...,47806,22,42,337635,406,342,148,241,39,9
2020-04-07,1316987,9803,232488,646772,392975,6981,27968,337,377,1423,...,51608,24,42,368196,415,397,159,245,39,9
2020-04-08,1391889,10525,241993,675720,425596,7136,30919,367,383,1468,...,55242,24,45,398809,424,504,166,251,39,10


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-05,64273,379,8290,45752,8996,40,816,5,19,130,...,4313,1,0,8501,5,2,5,0,1,1
2020-04-06,68928,434,8612,48703,10231,42,906,7,21,152,...,4934,1,1,9647,6,2,5,0,1,1
2020-04-07,74065,477,8890,51972,11659,48,1019,7,22,173,...,5373,1,1,10989,6,2,5,0,1,1
2020-04-08,81477,523,9204,56836,13664,52,1198,11,22,194,...,6159,1,1,12895,7,2,7,0,1,1


## Weekly

To create the datasets containing weekly cases and deaths, we will first need to create a list containng the dates starting from the latest date we have and counting back per week till the last date that is still available.

When the list is created we will revert it so the order o the datasets we create will still be from earliest to most recent.

In [3]:
dates_weekly = []

start_date = latest_date
end_date = start_date - datetime.timedelta(days=7)

while end_date > df_cases_daily.index[0]:
    dates_weekly.append(start_date)
    
    start_date = end_date
    end_date = start_date - datetime.timedelta(days=7)

dates_weekly.reverse()

Now that we know which dates we need to look at we can create the datasets. 
This will be done by, per date, taking the sum of the values between that date and a week earlier.

Since the structure will be different from the daily sets, since there will be less entries, it is more readable to create them using dictionaries as we saw in the other Data-Download notebook. 

In [4]:
dict_cases_weekly = {'Date': dates_weekly}
dict_deaths_weekly = {'Date': dates_weekly}

for country in countries_all:
    cases_weekly = []
    deaths_weekly = []
    
    for date in dates_weekly:
        end_date = date - datetime.timedelta(days=6)
        
        cases_weekly.append(df_cases_daily.loc[end_date:date, country].sum() / 7)
        deaths_weekly.append(df_deaths_daily.loc[end_date:date, country].sum() / 7)
    
    dict_cases_weekly[country] = cases_weekly
    dict_deaths_weekly[country] = deaths_weekly

df_cases_weekly = pd.DataFrame(dict_cases_weekly).set_index('Date')
df_deaths_weekly = pd.DataFrame(dict_deaths_weekly).set_index('Date')

display(df_cases_weekly)
display(df_deaths_weekly)

df_cases_weekly.to_csv('../data/cases_weekly.csv')
df_deaths_weekly.to_csv('../data/deaths_weekly.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-09,4.571429,0.0,4.571429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,0.285714,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-23,81.428571,0.0,81.285714,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-30,1027.428571,0.0,1024.142857,1.428571,1.0,0.857143,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.571429,0.0,0.0,0.0,0.285714,0.0,0.0
2020-02-06,2921.428571,0.0,2916.571429,2.571429,1.285714,1.0,0.0,0.0,0.0,0.0,...,0.285714,0.0,0.0,1.0,0.0,0.0,0.0,1.142857,0.0,0.0
2020-02-13,4579.285714,0.0,4576.142857,2.428571,0.571429,0.142857,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.857143,0.0,0.0
2020-02-20,2199.285714,0.142857,2198.571429,0.142857,0.285714,0.142857,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-27,911.142857,0.142857,840.714286,62.142857,6.857143,1.142857,0.142857,0.142857,0.0,0.142857,...,0.571429,0.0,0.0,6.285714,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-05,1887.857143,2.571429,1316.142857,544.0,18.285714,4.571429,2.285714,0.0,0.0,1.571429,...,10.285714,0.0,0.0,14.285714,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-12,4311.571429,14.0,1422.714286,2663.571429,179.142857,10.857143,21.285714,0.857143,1.571429,1.142857,...,53.0,0.0,0.0,164.714286,0.0,0.0,0.0,3.285714,0.0,0.0


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,0.285714,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-23,2.142857,0.0,2.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-30,21.857143,0.0,21.857143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-06,56.428571,0.0,56.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-13,115.0,0.0,115.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-20,108.285714,0.0,108.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-27,96.0,0.0,94.142857,1.857143,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-05,68.857143,0.0,52.857143,14.142857,1.571429,0.285714,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.571429,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-12,190.857143,0.285714,67.571429,119.714286,3.0,0.142857,0.142857,0.0,0.142857,0.0,...,0.857143,0.0,0.0,2.714286,0.0,0.0,0.0,0.0,0.0,0.0


## 3 Day average



In [5]:
dates_3days = []

start_date = latest_date
end_date = start_date - datetime.timedelta(days=3)

while end_date > df_cases_daily.index[0]:
    dates_3days.append(start_date)
    
    start_date = end_date
    end_date = start_date - datetime.timedelta(days=3)

dates_3days.reverse()

dict_cases_3days = {'Date': dates_3days}
dict_deaths_3days = {'Date': dates_3days}

for country in countries_all:
    cases_3days = []
    deaths_3days = []
    
    for date in dates_3days:
        end_date = date - datetime.timedelta(days=2)
        
        cases_3days.append(df_cases_daily.loc[end_date:date, country].sum() / 3)
        deaths_3days.append(df_deaths_daily.loc[end_date:date, country].sum() / 3)
    
    dict_cases_3days[country] = cases_3days
    dict_deaths_3days[country] = deaths_3days

df_cases_3days = pd.DataFrame(dict_cases_3days).set_index('Date')
df_deaths_3days = pd.DataFrame(dict_deaths_3days).set_index('Date')

display(df_cases_3days)
display(df_deaths_3days)

df_cases_3days.to_csv('../data/cases_3days.csv')
df_deaths_3days.to_csv('../data/deaths_3days.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-04,5.666667,0.0,5.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-07,5.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-13,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-19,52.666667,0.0,52.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-22,105.0,0.0,104.666667,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-25,272.0,0.0,270.333333,1.0,0.333333,0.333333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.666667,0.0,0.0
2020-01-28,1079.0,0.0,1076.0,0.333333,1.666667,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-31,1746.333333,0.0,1740.333333,4.333333,0.666667,1.0,0.0,0.0,0.0,0.0,...,0.666667,0.0,0.0,0.333333,0.0,0.0,0.0,1.0,0.0,0.0


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-13,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-19,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-22,4.666667,0.0,4.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-25,8.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-28,21.666667,0.0,21.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-31,35.666667,0.0,35.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Mortality

Next we will be creating the dataset containing the mortality rates. Again, since the structure doesn't change, we will be copying an existsing dataframe and overwriting it. 

Note that we do need to change the value types for the cells to float since they were read as int when loading the file. Which would cause all values to become 0. As well as the fact that we need to verify that there are any cases, as to not divide by 0.

- mortality = total deaths / total cases

In [6]:
df_mortality = df_deaths_daily.copy().astype('float')

for country in countries_all:
    for date in dates_all:
        if df_cases_total.at[date, country] != 0:
            df_mortality.at[date, country] = df_deaths_total.at[date, country] / df_cases_total.at[date, country]
        else:
            df_mortality.at[date, country] = 0
            
display(df_mortality)
df_mortality.to_csv('../data/mortality.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-02,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-03,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-04,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-05,0.054730,0.044646,0.038870,0.077679,0.027022,0.005943,0.034054,0.018519,0.057057,0.100000,...,0.102928,0.047619,0.000000,0.027226,0.012500,0.007519,0.034722,0.0,0.025641,0.111111
2020-04-06,0.055337,0.047302,0.038561,0.078618,0.028368,0.006134,0.034722,0.023411,0.058172,0.115152,...,0.103209,0.045455,0.023810,0.028572,0.014778,0.005848,0.033784,0.0,0.025641,0.111111
2020-04-07,0.056238,0.048659,0.038239,0.080356,0.029669,0.006876,0.036434,0.020772,0.058355,0.121574,...,0.104112,0.041667,0.023810,0.029846,0.014458,0.005038,0.031447,0.0,0.025641,0.111111
2020-04-08,0.058537,0.049691,0.038034,0.084112,0.032106,0.007287,0.038746,0.029973,0.057441,0.132153,...,0.111491,0.041667,0.022222,0.032334,0.016509,0.003968,0.042169,0.0,0.025641,0.100000


# Thresholds

To make our lives easier when trying to visualise the data, we're going to create a new datafram, using a dictionary, holding the threshold dates. The reason for this will become apparent in the visualization notebooks. 

When a country hasn't passed a threshold, we will fill in the date of tomorrow as a place holder.
The thresholds are as followed: 
- cases > 100
- deaths > 10

In [7]:
dict_thresholds = {'ind' : ['cases', 'deaths']}

for country in countries_all:
    date_list = df_cases_total.loc[df_cases_total[country] > 100][country].index.tolist()
    if len(date_list) > 0:
        cases_date = date_list[0]
    else:
        cases_date = latest_date + datetime.timedelta(days=1)
    
    date_list = df_deaths_total.loc[df_deaths_total[country] > 10][country].index.tolist()
    if len(date_list) > 0:
        deaths_date = date_list[0]
    else:
        deaths_date = latest_date + datetime.timedelta(days=1)
    
    dict_thresholds[country] = [cases_date, deaths_date]
    
df_thresholds = pd.DataFrame(dict_thresholds).set_index('ind')
display(df_thresholds)
df_thresholds.to_csv('../data/thresholds.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
ind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cases,2020-01-19,2020-03-11,2020-01-19,2020-02-23,2020-03-02,2020-03-10,2020-03-11,2020-03-29,2020-03-25,2020-03-23,...,2020-03-06,2020-04-10,2020-04-10,2020-03-03,2020-03-21,2020-03-28,2020-03-26,2020-03-23,2020-04-10,2020-04-10
deaths,2020-01-22,2020-03-18,2020-01-22,2020-02-26,2020-03-05,2020-03-26,2020-03-19,2020-04-08,2020-03-31,2020-03-23,...,2020-03-15,2020-04-10,2020-04-10,2020-03-05,2020-04-10,2020-04-10,2020-04-10,2020-04-10,2020-04-10,2020-04-10


## Prevalence & Incidence

Before we continue creating the next datasets I want to clearify why I made the distiction at the top of this notebook between 'Outbreak spread' and 'Country impact'. As this is a distintion that will continue throughout the project.

The daily, weekly and total datasets all contain all absolute number, by which I mean they are not relative to any country metrics, such as population size, density, healthcare, ext. This means that those numbers will be used to track and visualise the intensity of the spread of the outbreak itself in the different countries. With the hopes of finding when which countries are slowing down or containing the outbreak to see if we can look at which measures where taken at those times and if these might be replicated.

On the other side, the datasets that we will now create will consist of numbers that are relative to the population size. Meaning that they represent how heavilly each country is impacted by the outbreak. Which can then be used to see which countries require more aid at this time and later, hopefully, show us general progressions that can be compare to countries where a new outbreak starts. The choice of these metrics was inspired by the WHO (World Health Organization) and are as followed:

- Prevalence = Total cases / Population
- Incidence = New cases / Population

With that out of the way, we can continue.<br>

Note that we verify if we have the population data on the country before we do this, otherwise we would divide by 0. In this case, for now, we will just fill that country's data with 0's. This will cause it to be disregarded in the visualization and analysis, which is not ideal. But this will hopefully be rectified in the future.

In [8]:
df_prevalence = df_cases_daily.copy().astype('float')
df_incidence_daily = df_cases_daily.copy().astype('float')
df_incidence_weekly = df_cases_weekly.copy().astype('float')
df_incidence_3days = df_cases_3days.copy().astype('float')

for country in countries_all:
    ok = (df_populations.at['Populations', country] != 0)
    
    for date in dates_all:
        if ok:
            df_prevalence.at[date, country] = df_cases_total.at[date, country] / df_populations.at['Populations', country]
            df_incidence_daily.at[date, country] = df_cases_daily.at[date, country] / df_populations.at['Populations', country]
        else: 
            df_prevalence.at[date, country] = 0
            df_incidence_daily.at[date, country] = 0
            
    for date in dates_weekly:
        if ok:
            df_incidence_weekly.at[date, country] = df_cases_weekly.at[date, country] / df_populations.at['Populations', country]
        else: 
            df_incidence_weekly.at[date, country] = 0
            
    for date in dates_3days:
        if ok:
            df_incidence_3days.at[date, country] = df_cases_3days.at[date, country] / df_populations.at['Populations', country]
        else: 
            df_incidence_3days.at[date, country] = 0
            
display(df_prevalence)   
display(df_incidence_daily)
display(df_incidence_weekly)
display(df_incidence_3days)

df_prevalence.to_csv('../data/prevalence.csv')
df_incidence_daily.to_csv('../data/incidence_daily.csv')
df_incidence_weekly.to_csv('../data/incidence_weekly.csv')
df_incidence_3days.to_csv('../data/incidence_3days.csv')

Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,3.600508e-09,0.000000,6.114138e-09,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-01,3.600508e-09,0.000000,6.114138e-09,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-02,3.600508e-09,0.000000,6.114138e-09,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-03,5.867494e-09,0.000000,9.963781e-09,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-04,5.867494e-09,0.000000,9.963781e-09,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-05,1.566039e-04,0.000007,4.829648e-05,0.000766,0.000578,0.000168,0.000056,0.000007,0.000116,0.000031,...,0.000630,3.728803e-07,0.000374,0.000954,0.000116,0.000008,0.000005,0.000003,0.000002,6.233111e-07
2020-04-06,1.661029e-04,0.000007,5.057343e-05,0.000805,0.000627,0.000171,0.000061,0.000008,0.000126,0.000031,...,0.000719,3.906365e-07,0.000393,0.001032,0.000118,0.000010,0.000005,0.000003,0.000002,6.233111e-07
2020-04-07,1.756230e-04,0.000008,5.264681e-05,0.000841,0.000683,0.000174,0.000065,0.000009,0.000132,0.000034,...,0.000776,4.261489e-07,0.000393,0.001125,0.000120,0.000012,0.000006,0.000003,0.000002,6.233111e-07
2020-04-08,1.856114e-04,0.000008,5.479921e-05,0.000879,0.000739,0.000178,0.000072,0.000010,0.000134,0.000035,...,0.000831,4.261489e-07,0.000421,0.001219,0.000123,0.000015,0.000006,0.000003,0.000002,6.925679e-07


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,3.600508e-09,0.000000e+00,6.114138e-09,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,0.000000e+00,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.0,0.000000e+00
2020-01-01,0.000000e+00,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,0.000000e+00,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.0,0.000000e+00
2020-01-02,0.000000e+00,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,0.000000e+00,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.0,0.000000e+00
2020-01-03,2.266986e-09,0.000000e+00,3.849643e-09,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,0.000000e+00,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.0,0.000000e+00
2020-01-04,0.000000e+00,0.000000e+00,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,0.000000e+00,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.0,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-04-05,1.156336e-05,9.195381e-07,2.772648e-06,0.000045,0.000063,0.000005,0.000005,9.415591e-07,0.000010,7.435749e-06,...,0.000056,0.000000e+00,0.000000,0.000105,0.000004,0.000001,0.000000e+00,1.046678e-08,0.0,0.000000e+00
2020-04-06,9.498939e-06,5.409975e-07,2.276950e-06,0.000040,0.000048,0.000003,0.000005,7.801490e-07,0.000010,4.736146e-07,...,0.000089,1.775620e-08,0.000019,0.000078,0.000002,0.000002,1.385512e-07,1.046678e-08,0.0,0.000000e+00
2020-04-07,9.520142e-06,4.952572e-07,2.073372e-06,0.000035,0.000056,0.000003,0.000004,1.022264e-06,0.000006,2.439115e-06,...,0.000057,3.551241e-08,0.000000,0.000093,0.000003,0.000002,3.810158e-07,4.186711e-08,0.0,0.000000e+00
2020-04-08,9.988341e-06,5.693881e-07,2.152403e-06,0.000038,0.000057,0.000004,0.000007,8.070507e-07,0.000002,1.065633e-06,...,0.000055,0.000000e+00,0.000028,0.000094,0.000003,0.000003,2.424646e-07,6.280066e-08,0.0,6.925679e-08


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-09,6.096097e-10,0.0,1.035198e-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,3.810061e-11,0.0,6.469987e-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-23,1.085867e-08,0.0,1.840711e-08,0.0,2.481826e-10,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.366484e-10,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-30,1.370098e-07,0.0,2.319167e-07,1.857456e-09,1.737278e-09,2.134742e-08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.746594e-09,0.0,0.0,0.0,2.990508e-09,0.0,0.0
2020-02-06,3.895787e-07,0.0,6.604563e-07,3.34342e-09,2.233643e-09,2.490532e-08,0.0,0.0,0.0,0.0,...,4.297167e-09,0.0,0.0,3.056539e-09,0.0,0.0,0.0,1.196203e-08,0.0,0.0
2020-02-13,6.106575e-07,0.0,1.036266e-06,3.157674e-09,9.927304e-10,3.557903e-09,0.0,0.0,0.0,0.0,...,1.504008e-08,0.0,0.0,8.732968e-10,0.0,0.0,0.0,8.971523e-09,0.0,0.0
2020-02-20,2.932794e-07,1.126609e-10,4.978655e-07,1.857456e-10,4.963652e-10,3.557903e-09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.366484e-10,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-27,1.215028e-07,1.126609e-10,1.903794e-07,8.079932e-08,1.191276e-08,2.846323e-08,3.321842e-10,3.843099e-09,0.0,3.382961e-09,...,8.594334e-09,0.0,0.0,1.921253e-08,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-05,2.517498e-07,2.027896e-09,2.9804e-07,7.073191e-07,3.176737e-08,1.138529e-07,5.314947e-09,0.0,0.0,3.721257e-08,...,1.54698e-07,0.0,0.0,4.366484e-08,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-12,5.749572e-07,1.104077e-08,3.22173e-07,3.463226e-06,3.11221e-07,2.704007e-07,4.949545e-08,2.305859e-08,5.482283e-07,2.706369e-08,...,7.971244e-07,0.0,0.0,5.034556e-07,0.0,0.0,0.0,3.439084e-08,0.0,0.0


Unnamed: 0_level_0,Global,Africa,Asia,Europe,North America,Oceania,South America,Afghanistan,Albania,Algeria,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-04,7.556621e-10,0.0,1.283214e-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-07,6.667607e-10,0.0,1.132248e-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-13,4.445071e-11,0.0,7.548319e-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-16,4.445071e-11,0.0,7.548319e-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-19,7.023212e-09,0.0,1.192634e-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-22,1.400197e-08,0.0,2.370172e-08,0.0,5.790927e-10,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.018846e-09,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-25,3.627178e-08,0.0,6.121686e-08,1.300219e-09,5.790927e-10,8.301775e-09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.018846e-09,0.0,0.0,0.0,6.977851e-09,0.0,0.0
2020-01-28,1.438869e-07,0.0,2.436597e-07,4.334063e-10,2.895464e-09,2.490532e-08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.056539e-09,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-31,2.328773e-07,0.0,3.940977e-07,5.634282e-09,1.158185e-09,2.490532e-08,0.0,0.0,0.0,0.0,...,1.002672e-08,0.0,0.0,1.018846e-09,0.0,0.0,0.0,1.046678e-08,0.0,0.0
