In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Simple Linear Regression on IMF WEO Dataset

Let's first import the objects for the WEO dataset: 

In [None]:
# wide data with features in the columns and countries/years in the rows
weo_wide2 <- readRDS(sprintf("%s/rds/01_01_weo_wide2.rds", datapath))

In [None]:
weo_countries <- readRDS(sprintf("%s/rds/01_01_weo_countries.rds", datapath))
weo_subject <- readRDS(sprintf("%s/rds/01_01_weo_subject.rds", datapath))

Remember the nice widget to navigate through and search in tabular data:

In [None]:
weo_subject %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

You are supposed to create a simple linear regression model to estimate the "NGDP_RPCH" feature for 2019 data in weo_wide2.

What does that NGDP_TPCH stand for:

In [None]:
weo_subject[WEO_Subject_Code == "NGDP_RPCH"]

You should select a reasonable independent variable from the available features to explain this dependent variable.

The independent variable should not be any of the following ones:

In [None]:
weo_subject[str_detect(Subject_Descriptor, "Gross domestic product")]

So the independent variable can be any one of the following:

In [None]:
weo_subject[!str_detect(Subject_Descriptor, "Gross domestic product")]  %>% datatable(
  filter = "top",
  options = list(pageLength = 20)
)

Or you may create a new feature using multiple allowed features above through any kind of mathematical transformations. Note that only one independent variable can be included in the right hand side of the model

You may want to inspect the pairwise relationship among multiple features visually to detect the candidates to be included in the model. Some possible tools are as follows:

- pairs()
- psych::pairs.panels()
- GGally::ggpairs()

Total missing cases (NA's) of the variable(s) (directly as an independent variable or indirectly for calculating a new feature) that you use should not be more than 30. You can check that with `function(x) sum(is.na(x))` for a column

The steps you are required to follow are as follows:

- Show the calculations to create a new feature as an independent variable if you do so (if you include a feature directly skip this step)
- Filter for year 2019 and select the dependent and independent variables
- Create a scatterplot between the pair including a best fit line
- Split the dataset randomly into train and test partitions. The ratio of train partition to the overall set is supposed to be between 0.5 and 0.7
- Create and run a simple linear regression model and assign the model into a named object
- Print the summary of the model and interpret with a few words. **Note that the coefficient of the independent variable should be significantly different than 0 at 0.05 significance level. If not please select another independent variable**
- Create four vectors for predictions/actual values of train/test datasets
- Print two scatter plots: 1) predictions vs actual values of train set, 2) predictions vs actual values of test set. You may need to combine the predictions and actual values into a data.table with appropriate names (one data.table for train, one data.table for the test set) in order to feed into ggplot + geom_point. Please also add a diagonal line and main title. 
- Compare the R2 and RMSE of train and test datasets and interpret the comparison with a few words

**Hint: Most steps follow the 03_02_simple_regression.ipynb notebook**

## Answer

**Tuana Damla Ünal**
<br> **2017301168**

In [None]:
weo_sub <- weo_wide2 %>%
filter(year == 2019) %>%
select(TX_RPCH, NGDP_RPCH) %>%
na.omit %>%
filter(between(TX_RPCH, quantile(TX_RPCH, 0.05), quantile(TX_RPCH, 0.95)))

The export of goods and services of a country brings money and foreign currency into the country which then leads to an increase in GDP. This is why growth of export volume can be a good independent variable for GDP growth model. 

In [None]:
weo_sub %>% ggplot(aes(x = TX_RPCH,
                           y = NGDP_RPCH)) +
                        geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F)

From the graph and best fitted line, it can be seen that there's a positive relationship between growth of exports volume and growth of real GDP. However, the quality of the fit cannot be understood from the line and graph. 


In [None]:
set.seed(1234)
train_ratio <- 0.7
train_indices <- weo_sub[,sample(.N * train_ratio)]
train_data <- weo_sub[train_indices]
test_data <- weo_sub[-train_indices]

In [None]:
model1 <- lm(NGDP_RPCH ~ TX_RPCH, data = train_data)

In [None]:
summary(model1)
tidy(model1)

I divided the 70% of the data for the train and the rest for test the model randomly. 
<br> When I created the model for train data; 
    <br> - The export volume growth has an intercept of bigger than 0 and this is statistically significant. By that, we can say that there is a statiscally significant positive relationship between these variables. 
    <br> - %13 of the variance is explained in the model. If we increased the sample size and added more years, this ratio can be bigger.
    <br> - When export growth is 0, the GDP growth is bigger than 0. This can be due to other economic factors like savings.
    <br> - Slope coefficient of export volume growth is 0.16.

In [None]:
actual_train <- train_data$NGDP_RPCH
predicted_train <- predict(model1, train_data)
actual_test <- test_data$NGDP_RPCH
predicted_test <- predict(model1, test_data)

In [None]:
data.table(actual = actual_train, predictions = predicted_train) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Train Actual vs. Predictions")

In [None]:
data.table(actual = actual_test, predictions = predicted_test) %>%
ggplot(aes(x = actual, y = predictions)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
ggtitle("Test Actual vs. Predictions")

In [None]:
model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )
model_dt

<br> - Test data had a R-squared of %37. In R-squared terms, test data performed better than train data. 
<br> - RMSE and MAE are smaller in test data than in train data which also means a better performance. 
<br> - So, we can say that test performance of the model is higher than train performance.