In [None]:
library(tidyverse)
library(data.table)
library(plotly) # for interactive ploting
library(DT) # for interactive tabulation
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures
library(psych) # for pairwise comparisons
library(GGally) # for pairwise comparisons
library(magrittr) # for two-way pipes
library(lindia) # for qqplots

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Estimating covid total cases

In [None]:
covid <- readRDS(sprintf("%s/rds/05_01_covid5.rds", datapath))

covid dataset is created for this course from different sources:

In [None]:
covid %>% str

In [None]:
covid

The definition of variables are as follows:

- max_tc: Cumulative number of cases until April 2020
- intl_flights: Total number of international flights that the country recevied between January-April 2020
- LP: Total population in million
- dom_flights: Total number of domestic flights that the country had between January-April 2020
- sq_km: Land area of country in square kilometers
- household_size: Size of households in person

See the number of missing values for each column:

In [None]:
covid[, sapply(.SD, function(x) sum(is.na(x)))]

Your tasks are as follows:

- Impute the NA values of intl_flights and dom_flights columns with 0, the reason that no values exist for those columns should be the fact that no flights occurred. You may use any method, but I suggest combining mutate_at with either nafill or ifelse and is.na
- Exclude household_size column with too many missing values. It is better not to impute that column
- Now leave only those rows with no missing values, you may use complete.cases of na.omit. The remaining data should be as follows:

<pre>Classes ‘data.table’ and 'data.frame':	155 obs. of  8 variables:
 $ iso3c       : chr  "USA" "ESP" "ITA" "FRA" ...
 $ title       : chr  "USA" "Spain" "Italy" "France" ...
 $ max_tc      : int  963379 226629 197675 162220 157495 152840 110130 90481 83909 80949 ...
 $ intl_flights: int  76790 19752 22432 35146 67399 53362 2168 0 1428 16667 ...
 $ year        : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
 $ LP          : num  330 46.5 60.3 65 83.2 ...
 $ dom_flights : int  1110676 4485 8721 14567 36131 9674 297 0 1780 8485 ...
 $ sq_km       : num  9831510 505935 301340 549087 357580 ...
 - attr(*, ".internal.selfref")=&lt;externalptr&gt; 
</pre>

- It is better that we scale the remaining columns with population:
    - Create new columns by dividing max_tc, intl_flights and dom_flights with the LP column for per million population figures
    - Create a new column by dividing LP with sq_km column for population density
    - Leave only those newly create columns as such:
    
<pre>Classes ‘data.table’ and 'data.frame':	155 obs. of  6 variables:
 $ iso3c          : chr  "USA" "ESP" "ITA" "FRA" ...
 $ title          : chr  "USA" "Spain" "Italy" "France" ...
 $ max_tcpm       : num  2919 4875 3279 2496 1894 ...
 $ intl_flights_pm: num  233 425 372 541 811 ...
 $ dom_flights_pm : num  3365.6 96.5 144.7 224.1 434.5 ...
 $ pop_density    : num  33.6 91.9 200.1 118.4 232.5 ...
 - attr(*, ".internal.selfref")=&lt;externalptr&gt; 
</pre>


- Now let's trim extremities, filter only those rows where all numeric columns are within their own 0.05 and 0.95 quantile values. You may use piped filters for each column or you may combine a single filter with if_all, where and is.numeric functions (you may check their help pages). The result should be:

<pre>Classes ‘data.table’ and 'data.frame':	116 obs. of  6 variables:
 $ iso3c          : chr  "FRA" "GBR" "TUR" "IRN" ...
 $ title          : chr  "France" "UK" "Turkey" "Iran" ...
 $ max_tcpm       : num  2496 2272.5 1308.4 1075.2 59.8 ...
 $ intl_flights_pm: num  540.77 793.43 25.76 0 1.02 ...
 $ dom_flights_pm : num  224.13 143.84 3.53 0 1.27 ...
 $ pop_density    : num  118.4 276.1 107.2 48.2 146.9 ...
 - attr(*, ".internal.selfref")=&lt;externalptr&gt; 
</pre>

`covid3 %>% keep(is.numeric) %>% summary`

<pre>    max_tcpm        intl_flights_pm   dom_flights_pm      pop_density    
 Min.   :   1.917   Min.   :   0.00   Min.   :  0.0000   Min.   :  4.16  
 1st Qu.:  25.700   1st Qu.:   0.00   1st Qu.:  0.0000   1st Qu.: 40.23  
 Median :  78.099   Median :   0.00   Median :  0.0000   Median : 74.27  
 Mean   : 317.184   Mean   : 101.80   Mean   : 13.5795   Mean   : 93.15  
 3rd Qu.: 472.339   3rd Qu.:  54.59   3rd Qu.:  0.9095   3rd Qu.:110.64  
 Max.   :2495.961   Max.   :1554.66   Max.   :286.7446   Max.   :410.92  </pre>
 
- Visualize the cross variable relationships of numeric variables. You may use any of the following functions:
    - pairs()
    - psych::pairs.panels()
    - GGally::ggpairs()

    You may adjust plot size with:
`options(repr.plot.width = 15, repr.plot.height = 15)`

  Are there any independent variables left that are highly correlated with each other? Are there any variables that are highly correlated with dependent variable `max_tcpm`? 


- Split the data into train and test partitions. Train set should have a ratio of 0.5 - 0.7. You may pick up an arbitrary seed value for reproducibility
- Create a linear model to estimate max_tcpm using all other numeric variables on the train set. Assign the model result into an object.
- Create the qqplot for the residuals, what can you say about the normality of the residuals?
- For train and test sets calculate the fitted values (for the test set, we may call them "predictions")
- Calculate model fit metrics (R2 and RMSE) for both sets. Compare the fits. What can you say about the model fit for both sets?
- Plot residuals (difference between actual and fitted values) vs fitted values for both sets. What are some insights?
- Now run a second model using only the independent variable(s) significant at 5% level from the first model and without an intercept (you can impose that with "-1" term in the formula e.g. y ~ x - 1). Repeat the steps starting from "create a linear model ..." 

Write your comments in markdown cells.

# Answer

In [None]:
covid %<>%
mutate_at(c("intl_flights", "dom_flights"), nafill, "const", 0)

In [None]:
covid[, sapply(.SD, function(x) sum(is.na(x)))]

In [None]:
covid %<>% select(-household_size) %>% na.omit

In [None]:
covid %>% str

In [None]:
covid[, sapply(.SD, function(x) sum(is.na(x)))]

In [None]:
covid[, max_tcpm := max_tc / LP]
covid[, intl_flights_pm := intl_flights / LP]
covid[, dom_flights_pm := dom_flights / LP]
covid[, pop_density := LP / sq_km * 1e6]
covid2 <- covid %>% select(iso3c, title, max_tcpm, intl_flights_pm, dom_flights_pm, pop_density)

In [None]:
covid2 %>% str

In [None]:
quants <- c(0.05, 0.95)

In [None]:
covid3 <- covid2 %>% filter(if_all(where(is.numeric), ~ . %between% quantile(. , quants)))

In [None]:
covid3 %>% str

In [None]:
covid3 %>% keep(is.numeric) %>% summary

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)
covid3 %>% keep(is.numeric) %>% psych::pairs.panels()

In [None]:
covid4 <- covid3 %>% select(-c("iso3c", "title"))

In [None]:
set.seed(1000)
train_ratio <- 0.5

Randomly create row indices for train partition

In [None]:
train_indices <- covid4[,sample(.N * train_ratio)]

Split the data into two partitions

In [None]:
train_data <- covid4[train_indices]
test_data <- covid4[-train_indices]

Run a model:

In [None]:
model1 <- lm(max_tcpm ~ ., train_data)

In [None]:
model1 %>% summary
model1 %>% tidy %>% filter(p.value < 0.1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data$max_tcpm
predicted_train <- predict(model1, train_data)

actual_test <- test_data$max_tcpm
predicted_test <- predict(model1, test_data)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

The predictive performance is not still very good

In [None]:
gg_qqplot(model1, scale.factor = 1)

Residuals are not perfectly normally distributed

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Train Fitted Values vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Test Predictions vs. Residuals")

Visible pattern for predicted values (on test set), variance is not homogenous. Model specification might be wrong. The variables can be transformed or a non-linear model can be imposed

In [None]:
model2 <- lm(max_tcpm ~ intl_flights_pm - 1, train_data)

In [None]:
model2 %>% summary
model2 %>% tidy %>% filter(p.value < 0.1)

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

actual_train <- train_data$max_tcpm
predicted_train <- predict(model2, train_data)

actual_test <- test_data$max_tcpm
predicted_test <- predict(model2, test_data)

model_dt <- data.table(partition = c("train", "test"),
                       R2 = c(R2(predicted_train, actual_train),
                                R2(predicted_test, actual_test)),
                        RMSE = c(RMSE(predicted_train, actual_train),
                                 RMSE(predicted_test, actual_test)),
                        MAE = c(MAE(predicted_train, actual_train),
                                MAE(predicted_test, actual_test))
                        )

model_dt

The predictive performance is not still very good

In [None]:
gg_qqplot(model2, scale.factor = 1)

Residuals are not perfectly normally distributed

In [None]:
data.table(residuals = actual_train - predicted_train, predictions = predicted_train) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Train Fitted Values vs. Residuals")

data.table(residuals = actual_test - predicted_test, predictions = predicted_test) %>%
ggplot(aes(x = predictions, y = residuals)) +
geom_point() +
ggtitle("Test Predictions vs. Residuals")

When only the significant variable is included, the pattern disappeared to a large extent as compared to the previous model, while still a pattern is visible. There may still be some outlier values distorting the analysis