In [None]:
library(tidyverse)
library(data.table)
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures and resampling
#library(nycflights13) # for data

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

In [None]:
flights <- readRDS(sprintf("%s/rds/09_01_flights.rds", datapath))

In [None]:
flights

In [None]:
flights %>% str

You can find more information at:

https://nycflights13.tidyverse.org/

Using the flights dataset:

- Please select a random subset of 20k records (resampling on +300k records can take a long time).
- Create two multiple linear regression models to predict the dep_delay (departure delay) column. At least one model should have more than 5 independent variables. You can do any kind of data transformation and wrangling to create new features or modify existing ones
- Run the models with 10-fold cross validation using caret (using trainControl and train functions)
- Compare the models predictive performance using the resampling results. Which model would you prefer and why? 

# Answer

In [None]:
flights <- copy(flights)
setDT(flights)

In [None]:
flights %>% str

In [None]:
# order by origin and date/time
setorder(flights, origin, year, month, day, dep_time)

In [None]:
# calculate 10 flight moving average of dep delay (excluding the last one) for each origin
flights[, dep_delay_ma := c(NA, RcppRoll::roll_meanr(dep_delay[-.N], 10)), by = origin]

In [None]:
# order by destination and date/time
setorder(flights, dest, year, month, day, dep_time)

In [None]:
# calculate 10 flight moving average of arr delay (excluding the last one) for each destination
flights[, arr_delay_ma := c(NA, RcppRoll::roll_meanr(arr_delay[-.N], 10)), by = dest]

In [None]:
# get the number of flights for each origin/dest and date
flights[, origin_counts := .N, by = c("origin", "year", "month", "day")]
flights[, dest_counts := .N, by = c("dest", "year", "month", "day")]

In [None]:
flights <- na.omit(flights)

In [None]:
flights2 <- flights[sample(.N, 2e4)]

In [None]:
ctrl3 <- trainControl(method = "cv", number = 10, returnResamp = "all", savePredictions = T)

In [None]:
modelf1 <- train(
  dep_delay ~ arr_delay_ma + dep_delay_ma,
  data = flights2,
  method = "lm",
  trControl = ctrl3
)

In [None]:
modelf2 <- train(
  dep_delay ~ arr_delay_ma + dep_delay_ma + origin_counts + dest_counts + origin*hour + month + air_time,
  data = flights2,
  method = "lm",
  trControl = ctrl3
)

In [None]:
resamps <- resamples(list(modelf1, modelf2))

In [None]:
summary(resamps)

- 3rd quartile RMSE is lower for Model2
- 1st quartile R2 is higher for Model2

Model 2 is preferred with these metrics (you may choose different metrics or create your own ones)