In [None]:
library(tidyverse)
library(data.table)
library(broom) # for tidy statistical summaries
library(caret) # for regression performance measures and resampling
library(nycflights13) # for data

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# Cross Validation with NYC Flights Data

We will select a small sample from the flights dataset from nycflights13 package:

We will create two models and apply k-fold cross validation resampling method

We will utilize model wrapper of caret package (train() function) and trainControl() function for resampling.

The following vignette and tutorial chapter show the basic usage of resampling and model training with caret:

https://cran.r-project.org/web/packages/caret/vignettes/caret.html

https://topepo.github.io/caret/model-training-and-tuning.html

## Data wrangling

In [None]:
flights %>% str

In [None]:
flights2 <- flights %>% as.data.table

In [None]:
flights2

Now let's select some features: air_time and distance, omit na's and make some transformations:

- airtime is hhmm format as number. However it is better that we calculate a fractional hour out of this using modulo and floor division
- and let's calculate the speed

In [None]:
flights3 <- flights2 %>% select(air_time, distance) %>% na.omit %>%
mutate(air_time2 = air_time %/% 100 + (air_time %% 100)/60) %>%
mutate(speed = distance / air_time2)

Let's select a small sample to speed up resampling calculations

In [None]:
set.seed(2000)
flights4 <- flights3[sample(.N, 1e4)]

Now let's define some resampling method and parameters:

In [None]:
ctrl3 <- trainControl(method = "cv", number = 10, returnResamp = "all", savePredictions = T)

And run a simple linear model using cross validation:

In [None]:
modelf1 <- train(
  speed ~ distance,
  data = flights4,
  method = "lm",
  trControl = ctrl3
)

Now run a second model with 10 polynomial terms:

In [None]:
modelf2 <- train(
  speed ~ poly(distance, 10),
  data = flights4,
  method = "lm",
  trControl = ctrl3
)

Now let's get the final model summaries:

In [None]:
modelf1 %>% summary

In [None]:
modelf2 %>% summary

See that 10 degree polynomial model has a better R-squared. But is it a better model in terms of prediction?

Let's check some prediction metrics of resamples: 

In [None]:
modelf1$resample

In [None]:
modelf2$resample

The second model, despite having a higher R2, have too much variance of R2 values across resamples.

But let's better compare the distributions of those metrics for both models:

In [None]:
resamps <- resamples(list(modelf1, modelf2))

In [None]:
summary(resamps)

We see that the the second model has extreme outliers in RMSE and Rsquared values on test sets for resamples. 

Now let's develop our own metrics, and for example compare the coefficient of variation of RMSE and R2 values of both models:

In [None]:
modelf1$resample %>% as.data.table %>% .[, sd(RMSE) / mean(RMSE)]

In [None]:
modelf2$resample %>% as.data.table %>% .[, sd(RMSE) / mean(RMSE)]

In [None]:
modelf1$resample %>% as.data.table %>% .[, sd(Rsquared) / mean(Rsquared)]

In [None]:
modelf2$resample %>% as.data.table %>% .[, sd(Rsquared) / mean(Rsquared)]

So the second model has a higher variance in prediction performance

Note that you can use bootstrap method for resampling by providing "boot" value to method argument of trainControl and use leave one out cv by providing the "LOOCV" value to that same argument 

# Data oriented eulogy

Hasan Saltık was a prominent record producer of Turkey who issued more than 200 releases in three decades. He passed away on the 3rd of June, 2021 at age 57.

He never concentrated his efforts and resources on popular but culturally shallow works that could yield more financial returns. Instead, he chose to be a cultural archeologists. By his efforts with Kalan Music, his record label, many cultural gems of our lands that would otherwise be lost in space and time could be brought to daylight (like restoring and digitalizing very old and rare recordings left on shellac records of more than hundred years). Kalan releases are also a valuable part of my personal musical archive.

For his distinctive contributions, he was granted an honorary doctorate degree from ITU several years ago. You can read an interview with him made upon this award:

http://www.musikidergisi.com/haber-4738-hasan_saltik_ituden_fahri_doktora_kalan_muzik_uzerine

To commemorate his valuable and distinctive efforts to our culture, we can follow a data orinted approach:

- www.discogs.com is a great website and data provider for millions of musical releases all over the worls
- https://www.discogs.com/developers has information on the API that return data on search queries to easily collect data
- https://www.discogs.com/developers/#page:database,header:database-label has information on label search
- The code for Kalan Music label is 99790.
- We collected the basic information on releases by Kalan Music label through the rest API. The return values are in json format and it can easily be converted to a data.frame format with jsonlite package.
- With some further wrangling and a small selection of the releases by Kalan, we can understand how important a cultural mission he completed:

In [None]:
kalan <- fread(sprintf("%s/csv/09_01_kalan.csv", datapath))

In [None]:
kalan