Some technical details in preparing data will be presented in appendices, in order to keep the session notebooks simple.

# IMF World Economic Outlook Database

The first database we deal with is IMF's World Economic Outlook Database. The URL to October 2020 file is:

https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2020/02/WEOOct2020all.ashx

Not everything went alright trying to import the data into R:

in Bash, we download the data:

In [None]:
datapath="~/data_ad454"

In [None]:
curl "https://www.imf.org/-/media/Files/Publications/WEO/WEO-Database/2020/02/WEOOct2020all.ashx" > ${datapath}/csv/01_01_weodata.xls

We load some libraries into R:

In [None]:
library(data.table)
library(tidyverse)
library(readxl)
library(readr)

options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

We set the path for data directory (so that if path changes, not all related code need to be modified):

In [None]:
datapath <- "~/data_ad454"

This tryCatch formulation is written here so that a "run all cell" action do stop when an error is encountered when you try to reproduce the results

In case of an error:
- Without tryCatch, the error halts the execution of the notebook. You have to continue to the next cell manually
- With tryCatch, when an error is encountered, the execution is not halted and the error message is saved or returned for information

In [None]:
tryCatch(
    {
        weo <- read_xls(sprintf("%s/csv/01_01_weodata.xls", datapath), 1)
    },
    error = function(e)
    {
        print(as.character(e$message))
    }
)

The file could not be read as an excel file

Let's read it as a csv file:

In [None]:
tryCatch(
    {
        weo <- fread(sprintf("%s/csv/01_01_weodata.xls", datapath))
    },
    error = function(e)
    {
        print(as.character(e$message))
    }
)

Some non printable character were detected. So possibly it is treated as a binary file

Check the encoding of the file:

In [None]:
file -bi ${datapath}/csv/01_01_weodata.xls

Yes it is treated as a binary file. Let's strip the non-printable characters and save as tsv (tab separated values) file:

In [None]:
cat ${datapath}/csv/01_01_weodata.xls | tr -cd '\11\12\15\40-\176' > ${datapath}/csv/01_01_weodata.tsv

The final line is a footer:

In [None]:
tail -1 ${datapath}/csv/01_01_weodata.tsv

Let's delete this also:

In [None]:
head -n-2 ${datapath}/csv/01_01_weodata.tsv > ${datapath}/csv/01_01_weodata_2.tsv
mv ${datapath}/csv/01_01_weodata_2.tsv ${datapath}/csv/01_01_weodata.tsv

In [None]:
tryCatch(
    {
        weo <- fread(sprintf("%s/csv/01_01_weodata.tsv", datapath))
    },
    error = function(e)
    {
        print(as.character(e$message))
    }
)

In [None]:
weo %>% str

File is imported but all numeric columns are imported as character. There are two reasons for that:

- Empty, "--" and "n/a" values should be treated as N/A
- Thousand separator is ","

read_delim() and its variants from readr package are better options to import files with such separator issues:

In [None]:
weo <- readr::read_tsv(sprintf("%s/csv/01_01_weodata.tsv", datapath),
                       locale = locale(decimal_mark = ".", grouping_mark = ","),
                        na = c("", "--", "n/a")
                       )

In [None]:
weo %>% str

Let's get rid off the redundant spec attribute

In [None]:
attributes(weo)$spec <- NULL

Let's set data.table attribute:

In [None]:
setDT(weo)

In [None]:
weo %>% str

Now we can do a few things more to make this dataset more manageable:

- Delete some unnecessary columns like `Estimates Start After` X57 or Country/Series-specific Notes
- Keep only WEO Subject Codes in the main dataset and spare other related columns (with unique rows) to a separate definitions file
- Keep only ISO Country codes and spare other related columns (with unique rows) to a separate countries file. In fact countrycode package handles country code and name integration issues

Delete columns:

In [None]:
weo[, c("Country/Series-specific Notes", "Estimates Start After", "Subject Notes", "X57") := NULL]

In [None]:
weo %>% str

Column names with spaces are hard to be referred.

Let's substitute them with underscores:

In [None]:
setnames(weo,
        names(weo),
        names(weo) %>% str_replace_all(" ", "_"))

In [None]:
names(weo)

Now let's get unique lines of country code related columns:

In [None]:
weo_countries <- weo %>%
select(c("ISO", "WEO_Country_Code", "Country")) %>%
unique %>%
arrange(ISO)

In [None]:
weo_countries %>% str

In [None]:
weo_countries %>% glimpse

Keep ISO and delete other columns from weo:

In [None]:
weo[, c("WEO_Country_Code", "Country") := NULL]

In [None]:
weo %>% glimpse

Now let's do the same for subject related columns:

In [None]:
weo_subject <- weo %>%
select(c("WEO_Subject_Code", "Subject_Descriptor", "Units", "Scale")) %>%
unique

In [None]:
weo_subject %>% glimpse

In [None]:
weo[, c("Subject_Descriptor", "Units", "Scale") := NULL]

In [None]:
weo %>% str

Now we can serialize these files to reproduce them exactly in another session with a single line of code:

In [None]:
saveRDS(weo, sprintf("%s/rds/01_01_weo.rds", datapath))
saveRDS(weo_subject, sprintf("%s/rds/01_01_weo_subject.rds", datapath))
saveRDS(weo_countries, sprintf("%s/rds/01_01_weo_countries.rds", datapath))

We can import them with readRDS() function and assign to a named object and we can choose the name to be assigned to.

In the case of RData (save() and load()), the objects are imported and automatically assigned to the named object that they were saved as.

# COVID data from thevirustracker

Next, we have daily detailed data on COVID for 179 countries for the first few months of COVID pandemics.

The data was provided by the REST API of thevirustracker.com, now defunct, in json format.

First let's see what json looks like:

In [None]:
cat $datapath/json/covid/TR.json | jq . -C

JSON data have a simple, extensible and elastic format so that the schema does not have to be declared beforehand (as fixed columns, etc).

JSON can represent hierarchical, nested and/or semi-structured data easily

JSON is serialized and deserialized easily

JSON can be parsed by many programming languages and tools

Most data shared between web servers and clients (browser, mobile apps, etc) are in JSON format

JSON is convenient format for serde (serialization / deserialization) and sharing across hosts.

However JSON can also be flattened into a tabular form for easier analysis

In [None]:
library(jsonlite)
library(listviewer)
library(tidyverse)
library(data.table)
library(imputeTS)
library(RcppRoll)
library(countrycode)

First define a path for covid data:

In [None]:
covidpath <- sprintf("%s/json/covid", datapath)
covidpath

Read all json data into a list:

In [None]:
all_countries <- lapply(list.files(covidpath, full.names = T), fromJSON)

We can navigate through the nested object:

In [None]:
all_countries %>% jsonedit(mode = "form")

It seems that first part of each country's data has countrycode information and second part has the actual data.

The last line to the actual data is a status code (OK), it can be deleted.

We must wrangle those two parts or stitch them together and combine all country data into a single large data table

In [None]:
all_countries_2 <- lapply(all_countries, function(datax) # for each country
    {
        data1 <- datax[[2]] # get the actual data
        data1 <- data1[-length(data1)] # delete the last part
        if(length(data1) > 0) # if there is any data at all
        {
            # the dates are hidden inside the names of the part for each date's data. extract it as a Date column and combine with the data 
            data1b <- mapply(function(x,y) cbind(date = x, y), as.Date(names(data1), format = "%m/%d/%y"), data1, SIMPLIFY = F)
            data1c <- bind_rows(data1b) # stitch the part of each date into a single table
            data2 <- datax[[1]][[1]] # get the country code part
            data_c <- cbind(data2, data1c) # combine two parts
            data_c
        }
    }
)

In [None]:
all_countries_2 %>% jsonedit(mode = "form")

Now for each country there is a regular and unnested  table

Let's combine them into a single large table:

In [None]:
all_countries_3 <- bind_rows(all_countries_2)

And set data.table attribute:

In [None]:
setDT(all_countries_3)

See whether there are any missing values:

In [None]:
all_countries_3[,lapply(.SD, function(x) sum(is.na(x)))]

Where is that missing value:

In [None]:
all_countries_3[is.na(code), title %>% unique]

In [None]:
codelist2 <- as.data.table(codelist)

In [None]:
codelist2[country.name.en == "Namibia"] %>% t

You got it? The two letter code is NA, and it is interpreted as NA value when json is converted to an R object

Let's impute them:

In [None]:
all_countries_3[title == "Namibia", code := "NA"]

Now let's check the dates:

In [None]:
all_dates <- all_countries_3[, unique(date) %>% sort]

In [None]:
daterange <- all_dates %>% range
daterange

In [None]:
dateseq <- seq(daterange[1], daterange[2], by = 1)

Are there any missing dates?

In [None]:
(all_dates[-1] - all_dates[-length(all_dates)])

In [None]:
missing_dates <- dateseq[!(dateseq %in% all_dates)]
missing_dates

There are no missing dates

In [None]:
all_countries_3 %>% str

In [None]:
all_countries_3 %>% DT::datatable()

Let's serialize it as rds:

In [None]:
saveRDS(all_countries_3, sprintf("%s/rds/01_02_covid.rds", datapath))

# Credit Suisse World Wealth Database

This panel dataset has a vast information on the distribution of financial and non-financial wealth

In [None]:
wealth <- read_xlsx(sprintf("%s/xlsx/01_03_wealth_dt.xlsx", datapath), 1)

In [None]:
setDT(wealth)

In [None]:
wealth %>% str

This dataset has a long story from a large pdf file to many extractions, transformations, integrations and imputations.

I skip it for the time being

Let's convert spaces in column names to underscore:

In [None]:
setnames(wealth, names(wealth),
        names(wealth) %>% str_replace_all(" ", "_"))

In [None]:
names(wealth)

Let's serialize it as rds:

In [None]:
saveRDS(wealth, sprintf("%s/rds/01_03_wealth.rds", datapath))