# Daily data processing

This notebook is part of a toolset to analyse and visualise data on the COVID-19 epidimic and will be run once a day.<br>
Though this can only be done locally on my machine, the data will be saved in ../data/ in this repository.

We will use the data downloaded and extracted, by the other notebook in this directory, we will now mold this data into a few sets that we will be working with. The content of these datasets will be explained as we create them, though here is as overview.

#### Outbreak spread:
    - Total cases, deaths
    - Weekly cases, deaths
    
#### Country impact:
    - Prevalence
    - Daily, weekly incidence
    
#### Other:
    - Mortality
    - Threshold dates
    - Global data
    - Continental data

## Data import

Before we can do anything we'll have to load up the three datasets created by the other worksheet, setting the index appropriatly. Note that when saving the sets as .csv files, pandas saves the dates as strings. So to be able to compare them as dates, before we set them as the indexes, we need to tell pandas to translate that column to datetimes.

To make the processing easier to read we will also already make two lists containing all the available dates and countries, as well as getting the last available date.

In [1]:
import pandas as pd
import datetime

df_cases_daily = pd.read_csv('../data/cases_daily.csv')
df_cases_daily['Date'] = df_cases_daily['Date'].astype('datetime64[ns]')
df_cases_daily = df_cases_daily.set_index('Date')

df_deaths_daily = pd.read_csv('../data/deaths_daily.csv')
df_deaths_daily['Date'] = df_deaths_daily['Date'].astype('datetime64[ns]') 
df_deaths_daily = df_deaths_daily.set_index('Date')

df_populations = pd.read_csv('../data/populations_2018.csv').set_index('Countries')

dates = df_cases_daily.index
countries = df_cases_daily.columns

latest_date = dates[-1]

## Total

The first datasets we willl create will be the ones containing the total amount of cases and deaths per day. This is easilly done by looping through all the dates for a particular country and adding all the values to their respective counter.

Note that the structure of these datasets will be the same as the ones for daily cases and deaths, the same countries as collumns and every day as index. So we can just copy that dataset and overwrite it instead of having to make new dictionaries to create our dataframe with.

In [2]:
df_cases_total = df_cases_daily.copy()
df_deaths_total = df_deaths_daily.copy()

for country in countries:
    cases = 0
    deaths = 0
    
    for date in dates:
        cases += df_cases_daily.at[date, country]
        deaths += df_deaths_daily.at[date, country]
        
        df_cases_total.at[date, country] = cases
        df_deaths_total.at[date, country] = deaths
        
display(df_cases_total)
display(df_deaths_total)

df_cases_total.to_csv('../data/cases_total.csv')
df_deaths_total.to_csv('../data/deaths_total.csv')

Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-31,141,223,511,370,7,2,7,966,482,50,...,22141,19,30,164620,320,149,135,229,35,7
2020-04-01,166,243,584,376,7,2,7,966,532,55,...,25150,19,30,189618,338,173,135,229,35,8
2020-04-02,192,259,847,390,8,2,7,1133,571,55,...,29474,20,30,216721,338,187,143,235,36,8
2020-04-03,235,277,847,428,8,3,9,1133,663,60,...,33718,21,33,245540,369,190,144,239,39,8


Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-31,4,12,31,8,2,0,0,24,3,0,...,1408,1,0,3170,1,2,3,0,0,1
2020-04-01,4,15,35,12,2,0,0,24,3,0,...,1789,1,0,4079,1,2,3,0,0,1
2020-04-02,4,15,58,14,2,0,0,31,3,0,...,2532,1,0,5138,2,2,3,0,0,1
2020-04-03,4,16,58,15,2,0,0,34,4,0,...,2921,1,0,6053,4,2,3,0,1,1


## Weekly

To create the datasets containing weekly cases and deaths, we will first need to create a list containng the dates starting from the latest date we have and counting back per week till the last date that is still available.

When the list is created we will revert it so the order o the datasets we create will still be from earliest to most recent.

In [3]:
dates_weekly = []

start_date = latest_date
end_date = start_date - datetime.timedelta(days=7)

while end_date > df_cases_daily.index[0]:
    dates_weekly.append(start_date)
    
    start_date = end_date
    end_date = start_date - datetime.timedelta(days=7)

dates_weekly.reverse()

Now that we know which dates we need to look at we can create the datasets. 
This will be done by, per date, taking the sum of the values between that date and a week earlier.

Since the structure will be different from the daily sets, since there will be less entries, it is more readable to create them using dictionaries as we saw in the other Data-Download notebook. 

In [4]:
dict_cases_weekly = {'Date': dates_weekly}
dict_deaths_weekly = {'Date': dates_weekly}

for country in countries:
    cases_weekly = []
    deaths_weekly = []
    
    for date in dates_weekly:
        end_date = date - datetime.timedelta(days=7)
        
        cases_weekly.append(df_cases_daily.loc[end_date:date, country].sum())
        deaths_weekly.append(df_deaths_daily.loc[end_date:date, country].sum())
    
    dict_cases_weekly[country] = cases_weekly
    dict_deaths_weekly[country] = deaths_weekly

df_cases_weekly = pd.DataFrame(dict_cases_weekly).set_index('Date')
df_deaths_weekly = pd.DataFrame(dict_deaths_weekly).set_index('Date')

display(df_cases_weekly)
display(df_deaths_weekly)

df_cases_weekly.to_csv('../data/cases_weekly.csv')
df_deaths_weekly.to_csv('../data/deaths_weekly.csv')

Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-18,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,2,0,0
2020-02-01,0,0,0,0,0,0,0,0,0,0,...,2,0,0,6,0,0,0,3,0,0
2020-02-08,0,0,0,0,0,0,0,0,0,0,...,1,0,0,6,0,0,0,8,0,0
2020-02-15,0,0,0,0,0,0,0,0,0,0,...,6,0,0,3,0,0,0,4,0,0
2020-02-22,0,0,0,0,0,0,0,0,0,0,...,0,0,0,20,0,0,0,0,0,0
2020-02-29,1,0,1,0,0,0,0,0,0,0,...,9,0,0,50,0,0,0,0,0,0
2020-03-07,0,0,16,1,0,0,0,8,1,0,...,147,0,0,278,0,0,0,1,0,0
2020-03-14,6,33,9,1,0,0,0,32,12,2,...,592,0,0,1941,0,0,0,33,0,0


Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-18,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-01-25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-02-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-02-08,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-02-15,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-02-22,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-02-29,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2020-03-07,0,0,0,0,0,0,0,0,0,0,...,1,0,0,14,0,0,0,0,0,0
2020-03-14,0,1,2,0,0,0,0,2,0,0,...,9,0,0,35,0,0,0,0,0,0


## Mortality

Next we will be creating the dataset containing the mortality rates. Again, since the structure doesn't change, we will be copying an existsing dataframe and overwriting it. 

Note that we do need to change the value types for the cells to float since they were read as int when loading the file. Which would cause all values to become 0. As well as the fact that we need to verify that there are any cases, as to not divide by 0.

- mortality = total deaths / total cases

In [5]:
df_mortality = df_deaths_daily.copy().astype('float')

for country in countries:
    for date in dates:
        if df_cases_total.at[date, country] != 0:
            df_mortality.at[date, country] = df_deaths_total.at[date, country] / df_cases_total.at[date, country]
        else:
            df_mortality.at[date, country] = 0
            
display(df_mortality)
df_mortality.to_csv('../data/mortality.csv')

Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-02,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-03,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
2020-01-04,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-31,0.028369,0.053812,0.060665,0.021622,0.285714,0.0,0.0,0.024845,0.006224,0.0,...,0.063592,0.052632,0.0,0.019256,0.003125,0.013423,0.022222,0.0,0.000000,0.142857
2020-04-01,0.024096,0.061728,0.059932,0.031915,0.285714,0.0,0.0,0.024845,0.005639,0.0,...,0.071133,0.052632,0.0,0.021512,0.002959,0.011561,0.022222,0.0,0.000000,0.125000
2020-04-02,0.020833,0.057915,0.068477,0.035897,0.250000,0.0,0.0,0.027361,0.005254,0.0,...,0.085906,0.050000,0.0,0.023708,0.005917,0.010695,0.020979,0.0,0.000000,0.125000
2020-04-03,0.017021,0.057762,0.068477,0.035047,0.250000,0.0,0.0,0.030009,0.006033,0.0,...,0.086630,0.047619,0.0,0.024652,0.010840,0.010526,0.020833,0.0,0.025641,0.125000


# Thresholds

To make our lives easier when trying to visualise the data, we're going to create a new datafram, using a dictionary, holding the threshold dates. The reason for this will become apparent in the visualization notebooks. 

When a country hasn't passed a threshold, we will fill in the date of tomorrow as a place holder.
The thresholds are as followed: 
- cases > 100
- deaths > 10

In [6]:
dict_thresholds = {'ind' : ['cases', 'deaths']}

for country in countries:
    date_list = df_cases_total.loc[df_cases_total[country] > 100][country].index.tolist()
    if len(date_list) > 0:
        cases_date = date_list[0]
    else:
        cases_date = latest_date + datetime.timedelta(days=1)
    
    date_list = df_deaths_total.loc[df_deaths_total[country] > 10][country].index.tolist()
    if len(date_list) > 0:
        deaths_date = date_list[0]
    else:
        deaths_date = latest_date + datetime.timedelta(days=1)
    
    dict_thresholds[country] = [cases_date, deaths_date]
    
df_thresholds = pd.DataFrame(dict_thresholds).set_index('ind')
display(df_thresholds)
df_thresholds.to_csv('../data/thresholds.csv')

Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
ind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cases,2020-03-29,2020-03-25,2020-03-23,2020-03-23,2020-04-05,2020-04-05,2020-04-05,2020-03-20,2020-03-19,2020-04-05,...,2020-03-06,2020-04-05,2020-04-05,2020-03-03,2020-03-21,2020-03-28,2020-03-26,2020-03-23,2020-04-05,2020-04-05
deaths,2020-04-05,2020-03-31,2020-03-23,2020-04-01,2020-04-05,2020-04-05,2020-04-05,2020-03-27,2020-04-05,2020-04-05,...,2020-03-15,2020-04-05,2020-04-05,2020-03-05,2020-04-05,2020-04-05,2020-04-05,2020-04-05,2020-04-05,2020-04-05


## Prevalence & Incidence

Before we continue creating the next datasets I want to clearify why I made the distiction at the top of this notebook between 'Outbreak spread' and 'Country impact'. As this is a distintion that will continue throughout the project.

The daily, weekly and total datasets all contain all absolute number, by which I mean they are not relative to any country metrics, such as population size, density, healthcare, ext. This means that those numbers will be used to track and visualise the intensity of the spread of the outbreak itself in the different countries. With the hopes of finding when which countries are slowing down or containing the outbreak to see if we can look at which measures where taken at those times and if these might be replicated.

On the other side, the datasets that we will now create will consist of numbers that are relative to the population size. Meaning that they represent how heavilly each country is impacted by the outbreak. Which can then be used to see which countries require more aid at this time and later, hopefully, show us general progressions that can be compare to countries where a new outbreak starts. The choice of these metrics was inspired by the WHO (World Health Organization) and are as followed:

- Prevalence = Total cases / Population
- Incidence = New cases / Population

With that out of the way, we can continue.<br>

Note that we verify if we have the population data on the country before we do this, otherwise we would divide by 0. In this case, for now, we will just fill that country's data with 0's. This will cause it to be disregarded in the visualization and analysis, which is not ideal. But this will hopefully be rectified in the future.

In [7]:
df_prevalence = df_cases_daily.copy().astype('float')
df_incidence_daily = df_cases_daily.copy().astype('float')
df_incidence_weekly = df_cases_weekly.copy().astype('float')

for country in countries:
    ok = (df_populations.at[country, 'populations'] != 0)
    
    for date in dates:
        if ok:
            df_prevalence.at[date, country] = df_cases_total.at[date, country] / df_populations.at[country, 'populations']
            df_incidence_daily.at[date, country] = df_cases_daily.at[date, country] / df_populations.at[country, 'populations']
        else: 
            df_prevalence.at[date, country] = 0
            df_incidence_daily.at[date, country] = 0
            
    for date in dates_weekly:
        if ok:
            df_incidence_weekly.at[date, country] = df_cases_weekly.at[date, country] / df_populations.at[country, 'populations']
        else: 
            df_incidence_weekly.at[date, country] = 0
            
display(df_prevalence)   
display(df_incidence_daily)
display(df_incidence_weekly)

df_prevalence.to_csv('../data/prevalence.csv')
df_incidence_daily.to_csv('../data/incidence_daily.csv')
df_incidence_weekly.to_csv('../data/incidence_weekly.csv')

Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-01,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-02,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-03,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2020-01-04,0.000000,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-31,0.000004,0.000078,0.000012,0.004805,2.272007e-07,0.0,0.000073,0.000022,0.000163,0.000472,...,0.000333,3.373679e-07,0.000280,0.000503,0.000093,0.000005,0.000005,0.000002,0.000002,4.847975e-07
2020-04-01,0.000004,0.000085,0.000014,0.004883,2.272007e-07,0.0,0.000073,0.000022,0.000180,0.000520,...,0.000378,3.373679e-07,0.000280,0.000580,0.000098,0.000005,0.000005,0.000002,0.000002,5.540543e-07
2020-04-02,0.000005,0.000090,0.000020,0.005065,2.596580e-07,0.0,0.000073,0.000025,0.000193,0.000520,...,0.000443,3.551241e-07,0.000280,0.000662,0.000098,0.000006,0.000005,0.000002,0.000002,5.540543e-07
2020-04-03,0.000006,0.000097,0.000020,0.005558,2.596580e-07,0.0,0.000093,0.000025,0.000225,0.000567,...,0.000507,3.728803e-07,0.000308,0.000751,0.000107,0.000006,0.000005,0.000003,0.000002,5.540543e-07


Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
2020-01-01,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
2020-01-02,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
2020-01-03,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
2020-01-04,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000e+00,0.000000,0.000000,0.000000,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-03-31,7.263456e-07,0.000004,0.000001,0.000467,0.000000e+00,0.0,0.000000,0.000003,0.000020,0.000000,...,0.000039,8.878101e-08,0.000000,0.000066,0.000003,1.213762e-07,5.542048e-07,1.046678e-08,3.457850e-07,0.000000e+00
2020-04-01,6.725422e-07,0.000007,0.000002,0.000078,0.000000e+00,0.0,0.000000,0.000000,0.000017,0.000047,...,0.000045,0.000000e+00,0.000000,0.000076,0.000005,7.282570e-07,0.000000e+00,0.000000e+00,0.000000e+00,6.925679e-08
2020-04-02,6.994439e-07,0.000006,0.000006,0.000182,3.245725e-08,0.0,0.000000,0.000004,0.000013,0.000000,...,0.000065,1.775620e-08,0.000000,0.000083,0.000000,4.248166e-07,2.771024e-07,6.280066e-08,5.763084e-08,0.000000e+00
2020-04-03,1.156773e-06,0.000006,0.000000,0.000493,0.000000e+00,0.0,0.000021,0.000000,0.000031,0.000047,...,0.000064,1.775620e-08,0.000028,0.000088,0.000009,9.103212e-08,3.463780e-08,4.186711e-08,1.728925e-07,0.000000e+00


Unnamed: 0_level_0,Afghanistan,Albania,Algeria,Andorra,Angola,Anguilla,Antigua and Barbuda,Argentina,Armenia,Aruba,...,United Kingdom,United Republic of Tanzania,United States Virgin Islands,United States of America,Uruguay,Uzbekistan,Venezuela,Vietnam,Zambia,Zimbabwe
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.113078e-09,0.0,0.0,0.0,2.093355e-08,0.0,0.0
2020-02-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.008017e-08,0.0,0.0,1.833923e-08,0.0,0.0,0.0,3.140033e-08,0.0,0.0
2020-02-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.504008e-08,0.0,0.0,1.833923e-08,0.0,0.0,0.0,8.373422e-08,0.0,0.0
2020-02-15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9.02405e-08,0.0,0.0,9.169617e-09,0.0,0.0,0.0,4.186711e-08,0.0,0.0
2020-02-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.113078e-08,0.0,0.0,0.0,0.0,0.0,0.0
2020-02-29,2.690169e-08,0.0,2.368073e-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.353608e-07,0.0,0.0,1.528269e-07,0.0,0.0,0.0,0.0,0.0,0.0
2020-03-07,0.0,0.0,3.788917e-07,1.3e-05,0.0,0.0,0.0,1.797975e-07,3.387791e-07,0.0,...,2.210892e-06,0.0,0.0,8.497178e-07,0.0,0.0,0.0,1.046678e-08,0.0,0.0
2020-03-14,1.614101e-07,1.2e-05,2.131266e-07,1.3e-05,0.0,0.0,0.0,7.1919e-07,4.065349e-06,1.9e-05,...,8.90373e-06,0.0,0.0,5.932742e-06,0.0,0.0,0.0,3.454036e-07,0.0,0.0


## Global data

Lastly we will create one dataset that contains all the different metrics for the global population.

In [8]:
global_pop = df_populations.sum(axis=0).values
dict_global = {'Date': dates}

dict_global['cases_daily'] = df_cases_daily.sum(axis=1).values
dict_global['cases_total'] = df_cases_total.sum(axis=1).values

dict_global['prevalence'] = df_cases_total.sum(axis=1).values / global_pop
dict_global['incidence_daily'] = df_cases_daily.sum(axis=1).values / global_pop

dict_global['deaths_daily']  = df_deaths_daily.sum(axis=1).values
dict_global['deaths_total']  = df_deaths_total.sum(axis=1).values
dict_global['mortality']  = (df_deaths_total.sum(axis=1) / df_cases_total.sum(axis=1)).values


In [9]:
sr_weekly_cases = df_cases_weekly.sum(axis=1)
sr_weekly_deaths = df_deaths_weekly.sum(axis=1)

cases_list = []
deaths_list = []
for date in dates:
    if date in dates_weekly:
        cases_list.append(sr_weekly_cases[date])
        deaths_list.append(sr_weekly_deaths[date])
    else: 
        cases_list.append(0)
        deaths_list.append(0)
        
dict_global['cases_weekly'] = cases_list
dict_global['deaths_weekly'] = deaths_list
dict_global['incidence_weekly'] = cases_list / global_pop

In [10]:
df_global = pd.DataFrame(dict_global).set_index('Date')
display(df_global)
df_global.to_csv('../data/global.csv')

Unnamed: 0_level_0,cases_daily,cases_total,prevalence,incidence_daily,deaths_daily,deaths_total,mortality,cases_weekly,deaths_weekly,incidence_weekly
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-12-31,27,27,3.600777e-09,3.600777e-09,0,0,0.000000,0,0,0.000000
2020-01-01,0,27,3.600777e-09,0.000000e+00,0,0,0.000000,0,0,0.000000
2020-01-02,0,27,3.600777e-09,0.000000e+00,0,0,0.000000,0,0,0.000000
2020-01-03,17,44,5.867933e-09,2.267156e-09,0,0,0.000000,0,0,0.000000
2020-01-04,0,44,5.867933e-09,0.000000e+00,0,0,0.000000,0,0,0.000000
...,...,...,...,...,...,...,...,...,...,...
2020-03-31,62443,777796,1.037285e-04,8.327530e-06,3697,37271,0.047919,0,0,0.000000
2020-04-01,73512,851308,1.135322e-04,9.803716e-06,4614,41885,0.049201,0,0,0.000000
2020-04-02,77128,928436,1.238182e-04,1.028595e-05,4998,46883,0.050497,0,0,0.000000
2020-04-03,71813,1000249,1.333953e-04,9.577133e-06,4631,51514,0.051501,0,0,0.000000


## Continental data

In [11]:
continents = ['afrika', 'asia', 'europe', 'north_america', 'oceania', 'south_america']

afrika = ['Malawi', 'Niger', 'Nigeria', 'Zimbabwe','Zambia','United Republic of Tanzania','Uganda','Tunisia','Togo','Sudan','South Africa','Somalia','Sierra Leone','Seychelles', 'Senegal','Rwanda', 'Namibia','Mozambique','Morocco','Mauritius','Mauritania','Mali','Madagascar','Libya','Liberia', 'Kenya','Guinea Bissau','Guinea', 'Ghana', 'Gambia', 'Gabon','Ethiopia', 'Eswatini','Eritrea','Equatorial Guinea','Egypt', 'Djibouti', 'Democratic Republic of the Congo','Cote dIvoire', 'Congo', 'Chad', 'Central African Republic', 'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi', 'Cameroon', 'Cape Verde']
asia = ['Cases on an international conveyance Japan', 'Vietnam','Uzbekistan', 'United Arab Emirates','Turkey','Timor Leste','Thailand','Taiwan','Syria','Sri Lanka','South Korea','Singapore','Saudi Arabia','Qatar', 'Philippines','Palestine','Pakistan','Oman', 'Nepal','Myanmar', 'Mongolia','Maldives','Malaysia','Lebanon','Laos','Kyrgyzstan', 'Kuwait','Jordan', 'Japan', 'Israel','Iraq','Iran','Indonesia','India','China', 'Afghanistan', 'Armenia', 'Bahrain', 'Bangladesh', 'Bhutan', 'Brunei Darussalam', 'Cambodia']
europe = ['United Kingdom','Ukraine','Switzerland','Sweden','Spain','Slovenia','Slovakia','Serbia','San Marino','Romania', 'Russia','Portugal','Poland','Norway', 'North Macedonia','Netherlands','Monaco','Montenegro','Moldova','Malta','Luxembourg', 'Lithuania','Liechtenstein','Latvia','Kosovo','Kazakhstan', 'Jersey','Italy', 'Isle of Man','Ireland','Azerbaijan','Georgia','Iceland', 'Hungary', 'Holy See','Guernsey', 'Greece', 'Gibraltar', 'Germany', 'France', 'Finland', 'Faroe Islands', 'Estonia','Denmark','Czech Republic', 'Cyprus', 'Croatia', 'Albania', 'Andorra', 'Austria', 'Belarus', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria']
north_america = ['Belize', 'United States of America','United States Virgin Islands','Turks and Caicos islands','Sint Maarten','Saint Vincent and the Grenadines','Saint Lucia','Saint Kitts and Nevis','Saint Barthelemy','Puerto Rico', 'Panama','Nicaragua', 'Montserrat','Mexico', 'Jamaica', 'Honduras', 'Haiti', 'Guatemala', 'Grenada', 'Greenland','El Salvador','Dominican Republic', 'Dominica', 'Cuba', 'Cayman Islands', 'Anguilla', 'Antigua and Barbuda', 'Bahamas', 'Barbados', 'Bermuda', 'British Virgin Islands', 'Canada']
oceania = ['Papua New Guinea','Northern Mariana Islands','New Zealand','New Caledonia','Guam','French Polynesia', 'Fiji','Australia']
south_america = ['Bonaire, Saint Eustatius and Saba', 'Venezuela','Uruguay','Trinidad and Tobago','Suriname','Peru','Paraguay', 'Guyana', 'Ecuador', 'CuraÃ§ao', 'Costa Rica', 'Colombia', 'Chile', 'Argentina', 'Aruba', 'Bolivia', 'Brazil']


afrika.sort()
asia.sort()
europe.sort()
north_america.sort()
south_america.sort()
oceania.sort()

missing = []
for country in countries:
    if country not in afrika and country not in asia and country not in europe and country not in north_america and country not in south_america and country not in oceania:
        msising.append(country)
            
to_much = []
for continent in continents:
    for country in eval(continent):
        if country not in countries:
            to_much.append(country)
            
if not missing.empty() or to_much.empty():
    print(f'The follwoing countries are missing: {missing}')
    print(f'The follwoing countries are to much: {to_much}')

NameError: name 'msising' is not defined

In [None]:
for continent in continents:
    continent_pop = df_populations.loc[eval(continent)].sum(axis=0).values
    dict_coninent = {'Date': dates}

    dict_coninent['cases_daily'] = df_cases_daily[eval(continent)].sum(axis=1).values
    dict_coninent['cases_total'] = df_cases_total[eval(continent)].sum(axis=1).values

    dict_coninent['prevalence'] = df_cases_total[eval(continent)].sum(axis=1).values / continent_pop
    dict_coninent['incidence_daily'] = df_cases_daily[eval(continent)].sum(axis=1).values / continent_pop

    dict_coninent['deaths_daily']  = df_deaths_daily[eval(continent)].sum(axis=1).values
    dict_coninent['deaths_total']  = df_deaths_total[eval(continent)].sum(axis=1).values
    dict_coninent['mortality']  = (df_deaths_total[eval(continent)].sum(axis=1) / df_cases_total[eval(continent)].sum(axis=1)).values
    
    sr_weekly_cases = df_cases_weekly[eval(continent)].sum(axis=1)
    sr_weekly_deaths = df_deaths_weekly[eval(continent)].sum(axis=1)

    cases_list = []
    deaths_list = []
    for date in dates:
        if date in dates_weekly:
            cases_list.append(sr_weekly_cases[date])
            deaths_list.append(sr_weekly_deaths[date])
        else: 
            cases_list.append(0)
            deaths_list.append(0)
        
    dict_coninent['cases_weekly'] = cases_list
    dict_coninent['deaths_weekly'] = deaths_list
    dict_coninent['incidence_weekly'] = cases_list / global_pop
    
    print(continent)
    df_continent = pd.DataFrame(dict_coninent).set_index('Date')
    display(df_continent)
    df_continent.to_csv('../data/df_' + continent + '.csv')