In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "../data"

# Visually Examining Cross Variable Relationships

In this session, we will visually examine cross variable relationships

Let's first import the objects for the WEO dataset: 

In [None]:
# wide data with features in the columns and countries/years in the rows
weo_wide2 <- readRDS(sprintf("%s/rds/01_01_weo_wide2.rds", datapath))

In [None]:
weo_countries <- readRDS(sprintf("%s/rds/01_01_weo_countries.rds", datapath))
weo_subject <- readRDS(sprintf("%s/rds/01_01_weo_subject.rds", datapath))

Remember the nice widget to navigate through and search in tabular data:

In [None]:
weo_subject %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

## Feature engineering

First, I want to select a handful of the 45 variables:

In [None]:
features <- c("NGDP_RPCH", "NGDPRPPPPC", "NID_NGDP", "GGXONLB_NGDP", "BCA_NGDPD")

See what they are:

In [None]:
weo_subject[WEO_Subject_Code %in% features]

The primary balance is the government balance before any interest payments, just like the earnings before interest, tax and depreciation (EBITDA) for a company

We will keep the GDP growth (NGDP_RPCH) and investment/GDP (NID_NGDP) as is, while we will discretize other variables into factors.

Let's filter for 2019's data, extract necessary columns and keep those rows with no NA values:

In [None]:
weo_sub <- weo_wide2 %>%
    filter(year == 2019) %>%
    select(all_of(c("ISO", "year", features))) %>%
    na.omit # deletes rows with na's

In [None]:
weo_sub

First, let's create the income level factor variable again:

In [None]:
weo_sub[, income_level := cut(NGDPRPPPPC,
                              quantile(c(-Inf, NGDPRPPPPC, Inf),
                                       c(0, 0.25, 0.75, 1), na.rm = T),
                              label = c("low", "medium", "high"),
                              ordered_result = T)] # makes the factor an ordered one. this is better for using with comparison operators

Now lets cut the current account balance in to "deficit" and "surplus" from 0 as break point:

In [None]:
weo_sub[, external_balance := cut(BCA_NGDPD,
                                 c(-Inf, 0, Inf),
                                 labels = c("deficit", "surplus"),
                                 ordered_result = T)]

And cut the primary balance in to "negative" and "positive" from 0 as break point:

In [None]:
weo_sub[, primary_balance := cut(GGXONLB_NGDP,
                                 c(-Inf, 0, Inf),
                                 labels = c("negative", "positive"),
                                ordered_result = T)]

Let's see:

In [None]:
weo_sub

Let's get rid off year and the original fields that are converted to factors:

In [None]:
weo_sub[, c("year", "NGDPRPPPPC", "BCA_NGDPD", "GGXONLB_NGDP") := NULL] # setting a column to NULL in data.table is equal to deleting the column

In [None]:
weo_sub

And give other variables more meaningful names:

In [None]:
setnames(weo_sub, c("NGDP_RPCH", "NID_NGDP"), c("growth", "investment")) # old names, vs new names

Now, our data is ready:

In [None]:
weo_sub

## Frequencies

We have three factor variables:

In [None]:
weo_sub %>% keep(is.factor) %>% lapply(levels) # select only factor variables and show the levels for each

Let's calculate the three dimensional frequencies:

In [None]:
crosst <- xtabs(~ external_balance + primary_balance + income_level, data = weo_sub)

In [None]:
crosst

See that low and medium income countries mostly have a current account deficit (import more than they export) and vice versa for high income countries

For low and medium income countries primary balance (government budget balance before interest payments) is also mostly negatively while for high income countries only half of them run a primary balance deficit

## Scatter plots and best fit lines

Let's create a scatter plot across investment/GDP and GDP growth rates.

We differentiate the points by income level:

In [None]:
plot1 <- weo_sub %>%
ggplot(aes(x = investment, y = growth, color = income_level)) + # define the aesthetics: variables and their roles in the plot
geom_point() # add the scatter plot

In [None]:
plot1 %>% ggplotly

A positive relationship is visible but let's draw best fit lines across the points for each income level:

In [None]:
plot2 <- weo_sub %>%
ggplot(aes(x = investment, y = growth, color = income_level)) + # define the aesthetics: variables and their roles in the plot
geom_point() + # add the scatter plot
geom_smooth(method = "lm", formula = y ~ x, se = F) # add best fit line

In [None]:
plot2 %>% ggplotly

You can find more information on scatter plots and best fit lines:

[ggplot2 scatter plots: Quick start guide](http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization)

See that, for medium income countries, effect of an increase in investment/GDP on growth is more accentuated compared to the case of low and high income countries - the best fit line has a higher slope

So far, we could visualize three dimensions in a single plot.

Now let's add a fourth one, and split the plot horizontally for current account deficit and surplus countries using facet_wrap:

In [None]:
plot3 <- weo_sub %>%
ggplot(aes(x = investment, y = growth, color = income_level)) + # define the aesthetics: variables and their roles in the plot
geom_point() + # add the scatter plot
geom_smooth(method = "lm", formula = y ~ x, se = F) + # add best fit line
facet_wrap(~ external_balance) # split plot across external balance

In [None]:
plot3 %>% ggplotly

We see that, for countries running a current account deficit, the effect of investment/GDP on growth is higher, compared to current account surplus countries 

Now let's add a fifth dimension, and split the plot into a grid across external balance levels and primary balance levels:

In [None]:
plot5 <- weo_sub %>%
ggplot(aes(x = investment, y = growth, color = income_level)) + # define the aesthetics: variables and their roles in the plot 
geom_point() + # add the scatter plot
geom_smooth(method = "lm", formula = y ~ x, se = F) + # add best fit line
facet_grid(primary_balance ~ external_balance) # split plot across primary balance and external balance

In [None]:
plot5 %>% ggplotly

You can find more information on ggplot facets following the link:

[ggplot2 facet : split a plot into a matrix of panels](http://www.sthda.com/english/wiki/ggplot2-facet-split-a-plot-into-a-matrix-of-panels)

## Correlations

Best fit lines shows the direction and slope of the relation between two variables but does not tell how strong the relationship is.

For this, we calculate the correlation coefficient:

In [None]:
weo_sub[, cor(investment, growth)]

This level is not very strong.

See whether for each income level the correlation coefficient is similar:

In [None]:
weo_sub[, .(n = .N, cor = cor(investment, growth)), # get the counts and correlation
        by = c("income_level")][order(-cor)] # for each income level and order by decreasing correlations

For low income level countries the relationship between investment and growth is much weaker