In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(countrycode) # for country code and name integration

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

We will examine three panel datasets that have country and date/year dimensions

# IMF WEO dataset

WEO dataset has extensive macroeconomic information since 1980.

You can find the details on how the dataset is prepared, see 01_2_appendix.ipynb

## Import the data

We set the path to the data directory as a variable:

In [None]:
datapath <- "../data"

Read the binary files for objects:

In [None]:
weo <- readRDS(sprintf("%s/rds/01_01_weo.rds", datapath))
weo_countries <- readRDS(sprintf("%s/rds/01_01_weo_countries.rds", datapath))
weo_subject <- readRDS(sprintf("%s/rds/01_01_weo_subject.rds", datapath))

Let's get an overall summary of the data objects. str() or glimpse() will do that job:

Country codes and names:

In [None]:
weo_countries %>% str

That "%>%" operator "pipes" the output of the former instruction into the later one as the first input.

In fact "pipes" are a very common programming concept since mid 1970's.

Here an excerpt from an interview with Brian Kernighan, a very very important computer scientist, explains the pipe concept briefly and very comprehensively:

[![](https://img.youtube.com/vi/L9GfCgLLZYE/0.jpg)](https://www.youtube.com/watch?v=L9GfCgLLZYE)

Subject (data field) codes, descriptions and related info

In [None]:
weo_subject %>% str

And the main dataset:

In [None]:
weo %>% str

The problem with this format is that:

- Features (subjects) are not in separate fields, they are hard to be accessed
- Years are in different columns, hard to be filtered

## Reshape WEO data

We will reshape the data so that subjects are in respective columns while years will be a single columns ready to be filtered

In [None]:
weo %>% names

Using tidyr's gather():

In [None]:
weo_long1 <- weo %>%
    gather("year", "value", `1980`:`2025`, # reshape, names of new columns, and the columns to be molten
           na.rm = T) %>% # delete na's
    mutate(year = as.integer(year)) %>% # convert years to integer
    remove_rownames %>% # automatic rownames are redundant
    as.data.table # convert to data.table object

In [None]:
weo_long1

Using data.table's melt()

In [None]:
weo_long2 <- weo %>%
            melt(id.vars = c("ISO", "WEO_Subject_Code"), # reshape, columns to keep
                 variable.name = "year", # name of the field that the columns names will be converted into
                 variable.factor = F, # that variable column will not be of factor type
                 na.rm = T) %>% # skip na's
    mutate(year = as.integer(year)) # convert the years to integer

In [None]:
weo_long2

See whether they are identical

In [None]:
identical(weo_long1,
          weo_long2)

The longer version is a good step to reach **tidy data** format.

What is **tidy data**?

According to Hadley Wickham

> Every column is a variable.
>
> Every row is an observation.
>
> Every cell is a single value.


[Tidy Data vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

We have a single observation in each row. However the variables should be in separate columns.

For that, we will reshape into wide format and hence "cast" the data:

Using the tidyr's spread():

In [None]:
weo_wide1 <- weo_long1 %>%
    spread(key = "WEO_Subject_Code", value = "value") %>%
    remove_rownames

In [None]:
weo_wide1

Using data.table's dcast:

In [None]:
weo_wide2 <- weo_long2 %>%
    dcast(ISO + year ~ WEO_Subject_Code) %>%
    remove_rownames

In [None]:
weo_wide2

Now we can select some variables of interest, make summarizations and visulazations easily to better understand the data.

In [None]:
weo_subject %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

## Share in World Economy

The PPPSH shows the relative weight of a country in world economy, adjusted for purchasing power differences.

We can track how center of gravity of global economy changed by looking at this measure. 

Furthermore we can measure to what extence the economic power is concentrated across time.

In [None]:
weo_sub <- weo_wide2 %>% select(c("ISO", "year", "PPPSH"))

In [None]:
weo_sub

First let's see whether the shares sum up to 100 for each year:

With the data.table approach:

In [None]:
weo_sub[, .(sum = sum(PPPSH, na.rm = T)), by = year]

With some rounding errors, the shares sum up to 100

With the dplyr approach:

In [None]:
weo_sub %>%
    group_by(year) %>%
    summarise(sum = sum(PPPSH, na.rm = T))

### HHI concentration

Let's calculate the Herfindahl–Hirschman Index to summarize the concentration of economic sizes:

When there is only one economy that takes 100% share, hhi becomes 1, as the total size is more dispersed across countries, hhi approaches 0 

In [None]:
weo_hhi <- weo_sub[, .(hhi = sum((PPPSH/100)^2, na.rm = T)), by = year][order(year)]
weo_hhi %>% head

Let's visualize:

In [None]:
weo_hhi %>% ggplot(aes(x = year, y = hhi)) +
geom_line()

Figures after 2019 can be interpreted as forecasts. Let's show them with a different color:

In [None]:
weo_hhi_line2 <- weo_hhi %>% mutate(forecast = year > 2019) %>%
    ggplot(aes(x = year, y = hhi, color = forecast)) +
    geom_line()

weo_hhi_line2

ggplot2 is the most common visualization package in R.

The basic idea is that, features are modified or stacked onto a basic plot object using the "+" notation

For a basic knowledge sufficient for creating visually appealing plots, please refer to:

[A quick introduction to ggplot2](https://towardsdatascience.com/a-quick-introduction-to-ggplot2-d406f83bb9c9)

[Data Visualisation Chapter from r4ds](https://r4ds.had.co.nz/data-visualisation.html)

and

[![](https://img.youtube.com/vi/hr2X7rmkprM/0.jpg)](https://www.youtube.com/watch?v=hr2X7rmkprM)

Let's create a more interactive version of this plot easily:

In [None]:
weo_hhi_line2 %>% ggplotly

plotly is a JavaScript powered interactive graphics library.

The great thing about plotly is that, you can convert a static ggplot2 visualization to a interactive plotly one with a single line of code.

Following the link you can see the static vs. interactive visualization difference and the simple features plotly offers:

[![](https://img.youtube.com/vi/7rvHnmRsE8w/0.jpg)](https://www.youtube.com/watch?v=7rvHnmRsE8w)

We see that after a long decline of concentration between 1989 and 2008, the economic power has steadily been more concentrated and this trend is forecast to continue

What happened in 2008?

### Ranks and top 20

Now it is better to show more detail, and stack the trajectories for countries.

However, there are too many of them, possibly yielding a too crowded plot, hard to misinterpret.

Let's compact the dataset and show only largest 30 economies each year, while the rest will be shown as a separate line

In [None]:
weo_sub[, sharerank := frank(-PPPSH, ties.method = "first"), by = year]

See the progression of largest economies:

In [None]:
weo_sub[sharerank == 1][order(year)]

China takes the lead from 2017 on in PPP terms

Get the largest 20 economies:

In [None]:
weo20 <- weo_sub[sharerank <= 20][order(year)]

In [None]:
weo20

Calculate the share of rest:

In [None]:
weorest <- weo20[, .(ISO = "zzz", PPPSH = 100 - sum(PPPSH), sharerank = 21), by = year]
weorest

The rest of the world after 20 largest economies has a total economic size near that of the largest economy

Now let's combine both

In [None]:
weo20rest <- bind_rows(weo20, weorest)
weo20rest

### Join with country names

ISO codes does not show up good in a plot, so let's join with country names:

dplyr approach:

In [None]:
weo20rest2 <- weo20rest %>% left_join(weo_countries %>% select(-WEO_Country_Code), by = "ISO")
weo20rest2

Treat the NA country names:

In [None]:
weo20rest2[ISO == "zzz", Country := "Rest of the World"]

In [None]:
weo20rest2

data.table approach:

In [None]:
weo20rest2b <- weo_countries[, .SD, .SDcols = -"WEO_Country_Code"][weo20rest, on = "ISO"]

In [None]:
weo20rest2b[ISO == "zzz", Country := "Rest of the World"]

In [None]:
weo20rest2b

And using merge:

In [None]:
weo20rest2c <- merge(weo20rest, weo_countries[, .SD, .SDcols = -"WEO_Country_Code"],
     by = "ISO",
     all.x = T)

In [None]:
weo20rest2c[ISO == "zzz", Country := "Rest of the World"]

In [None]:
weo20rest2c

### Stacked area

Now let's show the progression of world share as a single stacked area chart

In [None]:
options(repr.plot.width = 20, repr.plot.height = 20)

In [None]:
weo20_stacked <- weo20rest2c %>%
    arrange(year, desc(PPPSH)) %>%
    ggplot(aes(x = year, y = PPPSH, fill = Country)) +
    geom_area(position = "stack")

In [None]:
weo20_stacked

In [None]:
weo20_stacked %>% ggplotly

See how shares of US and China swapped

# COVID dataset

Now defunct thevirustracker.com API provided detailed daily data on COVID cases for +190 countries.

The set shows the first few months of the pandemics.

You can find the details on how the dataset is prepared, see 01_2_appendix.ipynb

In [None]:
covid <- readRDS(sprintf("%s/rds/01_02_covid.rds", datapath))

In [None]:
covid

This is the range of dates:

In [None]:
covid[, range(date)]

Let's delete some unnecessary columns:

In [None]:
covid[, c("ourid", "source") := NULL]

In [None]:
covid

## Merge with iso3c

Total cases mean not much.

It is better to show as a percentage of population

We can merge with WEO data.

However we have a problem:

Two country codes do not match with the three country codes in WEO.

We can use the countrycodes package for that

Select only few columns:

In [None]:
codelist2 <- codelist %>% select(c("country.name.en", "country.name.en.regex", "imf", "iso2c", "iso3c", "iso3n"))

In [None]:
setDT(codelist2)

Check the 2 letter codes that appear in covid data but does not in iso2c of codelist:

In [None]:
missing_codes <- setdiff(covid[, code %>% unique],
        codelist2$iso2c)

In [None]:
missing_codes

Which countries are they?

In [None]:
covid[code %in% missing_codes, .(title, code)] %>% unique

Diamond Princess was a cruise ship which was quarantinated after a large number of cases. We can skip it for the moment

Let's check whether Kosovo shows up in the codelist:

In [None]:
codelist2 %>% datatable

It has an IMF code but not iso2c or iso3c codes.

Since it is a small country, we can also skip it for the moment, in order not to complicate things

In [None]:
covid2 <- covid %>% inner_join(codelist2 %>% select("iso2c", "iso3c"), by = c("code" = "iso2c"))
covid2

## Merge with IMF data

Now let's switch to IMF subject description:

In [None]:
weo_subject %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

LP is the right subject, population in millions

In [None]:
weo_lp <- weo_wide2 %>% select(c("ISO", "year", "LP"))

In [None]:
weo_lp

Filter for 2020

In [None]:
weo_lp2020 <- weo_lp[year == 2020]

In [None]:
weo_lp2020

Now let's merge!

In [None]:
covid2 %>% names

In [None]:
covid3 <- covid2 %>% inner_join(weo_lp2020, by = c("iso3c" = "ISO"))
covid3

Inner join ensures that only those cases which appear on both sides are kept

## Total cases per million

Calculate the total cases per million population:

In [None]:
covid3[, total_cases_pm := total_cases / LP]

In [None]:
covid3

Now let's see the countries with highest total cases per million population reached until early April

In [None]:
tcpm <- covid3[, .SD, .SDcols = c("title", "iso3c", "total_cases_pm")][
    , .(max_tcpm = max(total_cases_pm)), by = c("iso3c", "title")][order(-max_tcpm)]

In [None]:
tcpm

More prosper nations had a higher total cases per million.

Let's plot in bars:

In [None]:
tcpm_plot <- tcpm[1:20] %>% mutate(title = reorder(title, max_tcpm)) %>%
ggplot(aes(x = title, y = max_tcpm)) +
geom_bar(stat="identity") + coord_flip()

In [None]:
tcpm_plot

In [None]:
tcpm_plot %>% ggplotly

Maybe we can combine this with some economic measure?

Let's leave it for future

# Wealth dataset

This dataset is extracted from a pdf report by Credit Suisse and transformed and wrangled extensively to make it a working one

In [None]:
wealth <- readRDS(sprintf("%s/rds/01_03_wealth.rds", datapath))

In [None]:
wealth

In [None]:
wealth %>% names

Wealth_per_adult clearly is Financial wealth + non-financial wealth - debt, all per adult

The discrepancy between median and mean wealth can be an indication of inequality of wealth distribution in the country 

## Merge with WEO

We can also combine this data with for example per capita income in WEO dataset.

In [None]:
weo_subject %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

NGDPDPC, GDP per capita in current USD is the right measure to combine

In [None]:
weo_gpc <- weo_wide2 %>% select(c("ISO", "year", "NGDPDPC", "LP"))

We have to join by two columns:

In [None]:
wealth2 <- wealth %>% inner_join(weo_gpc,
                      by = c("iso3c" = "ISO",
                             "year" = "year"))

And calculate wealth to income, and adult to population (correcting for scale):

In [None]:
wealth2[, wealth2income := Wealth_per_adult / NGDPDPC]
wealth2[, adult2pop := Adults / LP / 1000]

In [None]:
wealth2

## Wealth to income

Let's select 10 sample countries

In [None]:
iso_samp <- wealth2[, unique(iso3c) %>% sample(10)]
iso_samp

And plot their wealth2income measures across time together:

In [None]:
w2i <- wealth2[iso3c %in% iso_samp] %>%
    select("Country", "year", "wealth2income") %>%
    ggplot(aes(x = year, y = wealth2income, color = Country)) +
geom_line()

In [None]:
w2i

In [None]:
w2i %>% ggplotly

What other economic measures can wealth2income be related to?

Think about this ...