# TIDYMODELS Exploration&Research

## Summary

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. 

It includes a core set of packages that are loaded on startup:

* broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.

* dials has tools to create and manage values of tuning parameters.

* dplyr contains a grammar for data manipulation.

* ggplot2 implements a grammar of graphics.

* infer is a modern approach to statistical inference.

* parsnip is a tidy, unified interface to creating models.

* purrr is a functional programming toolkit.

* recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.

* rsample has infrastructure for resampling data so that models can be assessed and empirically validated.

* tibble has a modern re-imagining of the data frame.

* tune contains the functions to optimize model hyper-parameters.

* workflows has methods to combine pre-processing steps and models into a single object.

* yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.)

The tidymodels framework also includes many other packages designed for specialized data analysis and modeling tasks.

It can be installed using install.packages("tidymodels")

Tidymodels framework supports the use of multiple cores for processing

## Comparison

The most popular (by number of monthly downloads from Github) ML framework available for R to date is caret and its successor packages that are wrapped together in a tidymodels framework. Max Kuhn builds both of these packages. Like mlr was refactored into mlr3, caret was refactored into tidymodels. Since caret has been around for a long time, and there are numerous resources, answers, and solutions to all the possible questions. On the other hand, tidymodels is newer and is built on the tidyverse principles. RStudio hired Max intending to design a tidy version of the caret. Because tidymodels follows the tidyverse principles, its more unified and follows familiar patterns (utilizing pipes for example)

Caret is a single package with various functions for machine learning. For example, createDataPartition for splitting data and trainControl for setting up cross-validation.
Tidymodels is a collection of packages for modelling. It is currently being designed to be decoupled into several packages and the key steps for modelling are currently implemented. This offers greater flexibility for defining models. However, even if it is more readeble and familior to tidyverse users, it may become difficult to remember the workflow or stay at the track when building the model since it has a couple of packages. I.e. there still isn’t a completely unified workflow that allows them to be as succint and elegant as in the caret

Pros:
* Familiar patterns from tidyverse
* Tidy version of caret
* Greater flexibility for defining models

Cons:
* It is still in the development phase
* Newer framework, so there is not much resources and discussions on it 
* There is not a completely unified workflow

## Links

* https://www.tidymodels.org/
* https://talkrtive.com/post/tidymodels-for-beginners/
* https://tidymodels.tidymodels.org/
* https://towardsdatascience.com/caret-vs-tidymodels-how-to-use-both-packages-together-ee3f85b381c
* https://www.r-bloggers.com/2019/12/meta-machine-learning-aggregator-packages-in-r-the-2nd-generation/

# Implementation

## Get the first lab exercise 

In [None]:
#load and import necessary libraries
install.packages("tidymodels")
library(tidymodels)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

In [None]:
realty_data <- readRDS(sprintf("%s/rds/02_01_realty_data.rds", datapath))

In [None]:
realty_data

In [None]:
realty_data %>% keep(is.numeric) %>% broom::tidy() %>% mutate_if(is.numeric, round, 2) %>%
select(column, n, mean, sd, median, min, max)

In [None]:
features <- c("oda", "brut_metrekare")

In [None]:
realty_data2 <- realty_data %>%
filter(banyo_sayisi == 1 & salon == 1) %>%
select(all_of(features)) %>%
na.omit %>%
filter(between(brut_metrekare, quantile(brut_metrekare, 0.05), quantile(brut_metrekare, 0.95)))

In [None]:
realty_data2 %>% ggplot(aes(x = oda,
                           y = brut_metrekare)) +
                        geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F)

## Partitioning - with tidymodels

In [None]:
set.seed(1000)

In [None]:
realty_split <- initial_split(realty_data2, prop = 0.75, 
                                   strata = brut_metrekare)

In [None]:
realty_training <- realty_split %>% 
                        training()

In [None]:
realty_test <- realty_split %>% 
                        testing()

In [None]:
realty_training

In [None]:
realty_test

## Model Specification and Fitting

In [None]:
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% # adds lm implementation of linear regression
            set_mode('regression')

In [None]:
lm_fit <- lm_model %>% 
          fit(brut_metrekare ~ oda, data = realty_training)

In [None]:
lm_fit

## Exploring Training Results

In [None]:
names(lm_fit)

In [None]:
summary(lm_fit$fit)

In [None]:
par(mfrow=c(2,2))
plot(lm_fit$fit, 
     pch = 16,  
     col = '#006EA1')

In [None]:
#Tidy training results:

In [None]:
tidy(lm_fit)

In [None]:
glance(lm_fit)

## Evaluating Test Set Accuracy

In [None]:
predict(lm_fit, new_data = realty_test)

In [None]:
realty_test_results <- predict(lm_fit, new_data = realty_test) %>% 
                            bind_cols(realty_test)

In [None]:
realty_test_results

## RMSE and R2 on the Test Data 

In [None]:
rmse(realty_test_results, 
     truth = brut_metrekare,
     estimate = .pred)

In [None]:
rsq(realty_test_results, 
     truth = brut_metrekare,
     estimate = .pred)

In [None]:
ggplot(data = realty_test_results,
       mapping = aes(x = brut_metrekare, y = .pred)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Realty Test Set',
       x = 'Actual Metrekare',
       y = 'Predicted Metrekare')