Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep original labels in step_discretize #674

Closed
renanxcortes opened this issue Mar 27, 2021 · 6 comments · Fixed by #951
Closed

Keep original labels in step_discretize #674

renanxcortes opened this issue Mar 27, 2021 · 6 comments · Fixed by #951
Assignees

Comments

@renanxcortes
Copy link

renanxcortes commented Mar 27, 2021

Hi there!

I'd like to ask for a feature that would keep the original labels generated by the internal cut function in discretize, instead of "bin1", bin2", etc. Perhaps adding an argument keep_cut_labels = TRUE, for example.

Minimal Reproducible Example:

Current Behaviour:

library(modeldata)
data(biomass)

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ carbon,
              data = biomass_tr) %>% 
  step_discretize(carbon)

rec <- prep(rec, biomass_tr)
binned_te <- bake(rec, biomass_te)
table(binned_te$carbon)

image

Expected behaviour:

breaks <- quantile(biomass_tr$carbon, probs = seq(0, 1, length = 4 + 1))
table(cut(biomass_te$carbon, breaks = breaks))

image

@juliasilge
Copy link
Member

You can get those values out using the tidy() method for a recipe:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ carbon,
              data = biomass_tr) %>% 
  step_discretize(carbon)

rec <- prep(rec, biomass_tr)
binned_te <- bake(rec, biomass_te)
table(binned_te$carbon)
#> 
#> bin_missing        bin1        bin2        bin3        bin4 
#>           0          22          17          25          16

tidy(rec, 1)
#> # A tibble: 5 x 3
#>   terms   value id              
#>   <chr>   <dbl> <chr>           
#> 1 carbon -Inf   discretize_gclTQ
#> 2 carbon   44.7 discretize_gclTQ
#> 3 carbon   47.1 discretize_gclTQ
#> 4 carbon   49.7 discretize_gclTQ
#> 5 carbon  Inf   discretize_gclTQ

Created on 2021-03-29 by the reprex package (v1.0.0)

You can read more about tidying a recipe here. Can you say more about your use case for wanting factor levels like that in your output?

@renanxcortes
Copy link
Author

renanxcortes commented Mar 29, 2021

Hi, Julia, thanks for the reply! I didn't know that you can tidy the recipe and get these values!

But to be more clear in my use case, it would be awesome if, after you bake a recipe with new data, you could keep the label of the "step_discretize" steps instead of "bin01", "bin02", "bin03", etc., with the labels being similar of the ones generated by the "cut" function.

This would facilitate the EDA exploration/understanding of the data after baking.

Is there a workaround for this?

Minimal reprex:

library(recipes)
data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ .,
              data = biomass_tr) %>% 
  step_discretize(all_numeric(), 
                  num_breaks = 10,
                  options = list(keep_na = T, na.rm = T)) %>% 
  step_mutate_at(all_numeric(), fn = as.factor) %>% 
  step_other(all_predictors()) %>% 
  step_unknown(all_predictors())

rec <- prep(rec, biomass_tr)
binned_te <- bake(rec, biomass_te) # Perhaps add here keep_original_label = T
head(binned_te)

Tibble generated:

image

@renanxcortes
Copy link
Author

renanxcortes commented Mar 31, 2021

Also related with #157. If step_cut could allow the user to specify the number of bins instead of fixed values for cutting, I think this current issue would be solved.

@renanxcortes
Copy link
Author

renanxcortes commented Apr 3, 2021

I realized that the behaviour of step_discretize_xgb (as well as step_discretize_cart) of the embed package (https://github.com/tidymodels/embed) return the desired labels of this issue.

Minimal reprex:

library(modeldata)
library(xgboost)
library(recipes)
data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ .,
              data = biomass_tr) %>% 
  step_discretize_xgb(all_numeric(), 
                      outcome = "HHV") %>% 
  step_mutate_at(all_numeric(), fn = as.factor) %>% 
  step_other(all_predictors()) %>% 
  step_unknown(all_predictors())

rec <- prep(rec, biomass_tr)
binned_te <- bake(rec, biomass_te)
head(binned_te)

Which returns:

image

The downside of this, is that it does not necessarily return bins with roughly uniform frequencies.

@renanxcortes renanxcortes changed the title Keep original labels in discretize Keep original labels in step_discretize Apr 4, 2021
@renanxcortes
Copy link
Author

Hello, folks, is there some ongoing effort of this improvement? Cheers!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants