In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures
library(magrittr)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Logistic Regression on IMF WEO Dataset

Suppose you are a chief economist at a supranational economic agency.
The main mission of the agency is to predict economic crises in national economies and advice remedies to prevent or lessen the effects of those crises along with financial support.
"Crisis" is defined as negative real economic growth.
Note that total counts of crises and growths are unbalanced towards growths. So crises are relatively rare. 
Although crises are relatively rare, the cost of ignoring an upcoming crisis is too high given the mission of the agency (false negatives).
And it is not feasible to grant support to too many countries that are predicted to have a crisis but do not have at the end given the limited resources of the agency and given the higher chance of having a growth (false positives).

## Preliminary data preparation

Let's first import the objects for the WEO dataset: 

In [None]:
# wide data with features in the columns and countries/years in the rows
weo_wide2 <- readRDS(sprintf("%s/rds/01_01_weo_wide2.rds", datapath))

In [None]:
weo_countries <- readRDS(sprintf("%s/rds/01_01_weo_countries.rds", datapath))
weo_subject <- readRDS(sprintf("%s/rds/01_01_weo_subject.rds", datapath))

Select some of the variables:

In [None]:
vars2 <- c("NGDP_RPCH","NID_NGDP", "LUR", "GGXONLB_NGDP", "BCA_NGDPD")
vars <- c("ISO", "year", vars2)

Now we wrangle the data so that we have only the data for selected years and the data is reshaped so that different years' data for a variable are in separate columns:

In [None]:
weo_wide3 <- weo_wide2 %>% filter(year %in% c(2016, 2018, 2019)) %>%
select(all_of(vars)) %>%
gather("key", "value", -ISO, -year) %>%
as.data.table %>%
filter(key == "NGDP_RPCH" | year %in% 2016:2018) %>%
dcast(ISO ~ key + year, value.var = "value")

In [None]:
weo_wide3

See what those selected variables are:

In [None]:
weo_subject[WEO_Subject_Code %in% vars2]

## Task definition

Your task is to devise a logistic regression model to predict crises ahead using past economic data as such:
- The independent variables will be calculated as the difference between the 2018 and 2016 values of the selected subjects. So for example for total investment it will be NID_NGDP_2018 - NID_NGDP_2016 columns. You can calculate them separately in subsequent lines inside a pipe or do it at once using the data.table notation. Go for the easiest approach that you can do.
- The dependent variable will be calculated from NGDP_RPCH_2019 as a binary column. Crisis (NGDP_RPCH_2019 < 0) is taken as the positive case (just as detection of a disease is billed as "positive") with a value of 1.
- Omit rows with NA values
- Partition the data. You better use an arbitrary seed for reproducibility.
- You may try different model specifications (selection of variables, higher degree polinomial terms, discretizations, interaction terms, exclusion of intercept, etc), different cutting points for classes or different ratios for train and test partitions. But you may opt for a simpler model at the end.
- You may check the collinearity of the variables included in the model
- Compare and interpret the NULL and residual deviances
- Calculate and interpret confusion matrices and ROC curves
- Positive prediction rate (TP / (TP + FN)) for both the train and the test sets should be no less than 0.8
- Negative prediction rate (TN / (TN + FP)) for both the train and the test sets should be no less than 0.5

## Solution