Exploration & Research

Tuana Damla Ünal

In [None]:
library(tidymodels)

# Introduction: 
## Summary Information

"tidymodels" is a meta-package that provides functions, demonstrations and interfaces for statistical analysis and modeling. 

## Sub-Packages
It includes the below core sub-packages:

- broom: This package provides 3 main functions (tidy(), glance() & augment()) to simplify the model results, summarize make them more workable for further implications. 
- dials: Dials package is a fune-tuning package for models. As indicated in some websites, it works really well with parsnip package models. 
- dplyr: It has efficient and useful functions for data manipulation and wrangling purposes. Most common functions can be listed as select(), mutate(), filter(), etc. 
- ggplot2: ggplot2 is a package for creating more good-looking, detailed and specific visualizations including histograms, plots, pie charts, line charts.  
- infer: Infer is a package for statistical inference. Its main functions are specify(), hypothesize(), generate() and calculate(). 
- modeldata: Modeldata is a package that consists of many datatsets to be used for modeling purposes.
- parsnip: This package provides a tidy and unified interface for many data modeling without a need of many packages. 
- purr: Purr helps and improves functional programming with many tools to ease working with vectors and functions. 
- recipes: Recipes is a method for creating regressor matrix for modeling and visualizations.
- rsample: It has some tools and functions that enables resampling for testing the model.
- tibble: Tibble is a reimaging of a data frame. It keeps the necessary informations only, but have some problems compared to data frames like it is impossible to change a variable in a tibble. 
- tidyr: Main objective of the tidyr is tidying the data. It does pivotting, rectangling, nesting, splitting and completing missing values. 
- tune: The goal of tune is providing hyperparameter tuning for tidymodels packages like recipes, parsnip and dials. 
- workflows: Workflows ease creating models that have multiple steps. 
- yardstick: The packages consists of functions for evaluating the model. (RMSE, accuracy, etc.)

# Advantages and Disadvantages of Tidymodels

## Comparison with Other Model

### Tidymodels vs. Mlr3

+ Tidymodels have more functionality for the preprocessing step. However, the nested resampling procedure is more straightforward and clean in mlr3.
+ Both packages have functions that can provide to work with workflows and piplines, however, mlr3 does not have some functions that can enable to work for individual steps in this flows. (step_unknown, step_other, step_novel) 
+ Mlr3 have a framework named GraphLearner which takes graphs and strings together the preprocessing, hyperparameter tuning, and prediction process as a Graph Network.


### Tidymodels & Mlflow

+ A few people suggest that a comparison is not logical for mlflow and tidymodels, they complete each other.
+ Mlflow can help collecting model parameters, metrics and artefacts, and displays them in a pretty UI with its tracking feature. The tidymodels packages integrate very nicely with MLflow, and allow for automating some parts of the tracking.
+ Tidymodels presents an excellent opportunity to make life a bit easier for R users who want to take advantage of MLflow. 

### Tidymodels vs. Caret

+ Tidymodels is providing tiny and neat outcomes that can offer a great deal of granularity to the end user. 
+ Tidymodels includes very useful packages that can enable analysing the model estimated probabilities and resampled performances. 
+ The great advantage of caret is that it combines many small code pieces in just one. It also makes sure it’s done as fast as possible.
+ Tidymodels takes over 1 minute while caret only needs 4–5 seconds to run decision tree.

## Pros and Cons in General

### Pros

+ Packages are flexible and modular.
+ It provides tidy outcomes.
+ In the tidymodels ecosystem, workflows package is used to bundle together model components and promote more fluent modeling processes.
+ There are many benefits from making it easier to keep track of model components to avoiding data leakage in feature engineering.

### Cons

+ It is a newer framework of a robust successor "caret". 
+ It can be challenging for a newcomer to know where their specific problem fits in this ecosystem due to the modularity.
+ Its speed performance can be worse than other packages. 
+ There are still room for improvement with new functionalities. 

# Resources

+ https://tidymodels.tidymodels.org
+ https://cran.r-project.org/web/packages/tidymodels/tidymodels.pdf
+ https://mdneuzerling.com/post/tracking-tidymodels-with-mlflow/
+ https://pharmacoecon.me/post/2021-05-01-tidymodels-vs-mlr3/
+ https://konradsemsch.netlify.app/2019/08/caret-vs-tidymodels-comparing-the-old-and-new/
+ https://towardsdatascience.com/caret-vs-tidymodels-how-to-use-both-packages-together-ee3f85b381c
+ https://www.tidyverse.org/blog/2021/05/choose-tidymodels-adventure/
+ https://www.tidymodels.org/start/tuning/

# Examples from AD454 Lab Cases

## - Broom Package

We generally use summary() function for reading the model. However, broom's tidy(), glance() and augment() functions have better implications.

In [None]:
datapath <- "~/data_ad454"

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))
features <- c("price", "brut_metrekare",
             "krediye_uygunluk",
             "kira_getirisi")
realty_data2 <- realty_data %>%
select(all_of(features)) %>%
mutate(unit_price = price / brut_metrekare) %>%
mutate(unit_rent = kira_getirisi / brut_metrekare) %>%
filter(krediye_uygunluk == "uygun") %>%
na.omit %>%
filter(between(unit_price, quantile(unit_price, 0.05), quantile(unit_price, 0.95))) %>%
filter(between(unit_rent, quantile(unit_rent, 0.05), quantile(unit_rent, 0.95)))

model1 <- lm(unit_price ~ unit_rent, data = realty_data2)
tidy(model1) #gives the summary as a tibble to enable work for further anaylsis
glance(model1) #summarizes model performance
augment(model1) #summarizes all points in the model with their their errors

## Yardstick Package

In the lab sessions, we are doing this evaluation table by defining a function. But yardstick has it by default.

In [None]:
table <- augment(model1)
metrics(table, truth = "unit_price", estimate = ".fitted")

## Tune & Dials Package

We have done loops for the optimal complexity, however tune package can does it automatically.

In [None]:
library(modeldata)
data(mlc_churn, package = "modeldata")
churn <- mlc_churn

We splitted the data to train to test sets with inital_split function.

In [None]:
set.seed(123)
cell_split <- initial_split(churn, prop = 7/10)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

We created the rpart model with decision tree and tune functions. 

In [None]:
tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

We have selected levels and tuning parameter for our tree grid.

In [None]:
tree_grid <- grid_regular(cost_complexity(),
                          levels = 5)

We created the workflow to try different complexity parameters.

In [None]:
tree_wf <- workflow() %>%
  add_model(tune_spec) %>%
  add_formula(churn ~ .)

set.seed(234)
cell_folds <- vfold_cv(cell_train)

With the 5 parameters, the model calculated from the beginning with resamples easily. Then we posted the summary metrics. Since data is big, the calculation takes time. Like I said earlier, some operations are slower with tidymodels. 

In [None]:
tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
    )
tree_res %>% 
  collect_metrics()

In order to see the above table demonstrated, I have created below graphs. By looking at them, we can say that the Model 4 has the highest accuracy and roc_auc terms. So, we should select it. 

In [None]:
tree_res %>%
  collect_metrics() %>%
  ggplot(aes(cost_complexity, mean)) +
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

Then, we created the final workflow and had the final tree. 

In [None]:
best_tree <- tree_res %>%
  select_best("roc_auc")

final_wf <- 
  tree_wf %>% 
  finalize_workflow(best_tree)

final_tree <- 
  final_wf %>%
  fit(data = cell_train) 