Feature
In the tidymodels developer section , there is an example that creates a step_percentile() function. I think this would be great to have included in receipes, there is a workaround using step_mutate() but there is easy opportunity for data-leak that isn't entirely obvious, and I would imagine it's a relatively common transformation regardless. Any thoughts?
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data("car_prices")
set.seed(24)
splits <- initial_split(car_prices)
car_train <- training(splits)
car_test <- testing(splits)
car_rec_no_leak <- recipe(Price ~ . , data = car_train) %>%
step_mutate(Mileage = ecdf(car_train$Mileage)(Mileage)) %>%
prep()
car_rec_leak <- recipe(Price ~ . , data = car_train) %>%
step_mutate(Mileage = ecdf(Mileage)(Mileage)) %>%
prep()
bake(car_rec_no_leak, new_data = car_test)
#> # A tibble: 201 x 18
#> Mileage Cylinder Doors Cruise Sound Leather Buick Cadillac Chevy Pontiac
#> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 0.682 4 2 1 0 0 0 0 0 0
#> 2 0.153 4 4 1 1 0 0 0 0 0
#> 3 0.164 4 4 1 1 0 0 0 0 0
#> 4 0.597 4 4 1 1 0 0 0 0 0
#> 5 0.891 4 4 1 1 1 0 0 0 0
#> 6 0.257 4 4 1 1 1 0 0 0 0
#> 7 0.597 4 4 1 1 1 0 0 0 0
#> 8 0.461 4 4 1 0 0 0 0 0 1
#> 9 0.365 4 4 1 1 0 0 0 0 1
#> 10 0.935 8 2 1 1 1 0 0 1 0
#> # ... with 191 more rows, and 8 more variables: Saab <int>, Saturn <int>,
#> # convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>,
#> # Price <dbl>
# Testing data is used to compute the percentile
bake(car_rec_leak, new_data = car_test)
#> # A tibble: 201 x 18
#> Mileage Cylinder Doors Cruise Sound Leather Buick Cadillac Chevy Pontiac
#> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 0.652 4 2 1 0 0 0 0 0 0
#> 2 0.169 4 4 1 1 0 0 0 0 0
#> 3 0.179 4 4 1 1 0 0 0 0 0
#> 4 0.602 4 4 1 1 0 0 0 0 0
#> 5 0.925 4 4 1 1 1 0 0 0 0
#> 6 0.289 4 4 1 1 1 0 0 0 0
#> 7 0.597 4 4 1 1 1 0 0 0 0
#> 8 0.443 4 4 1 0 0 0 0 0 1
#> 9 0.333 4 4 1 1 0 0 0 0 1
#> 10 0.970 8 2 1 1 1 0 0 1 0
#> # ... with 191 more rows, and 8 more variables: Saab <int>, Saturn <int>,
#> # convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>,
#> # Price <dbl>
Created on 2021-08-05 by the reprex package (v2.0.0)
Feature
In the tidymodels developer section , there is an example that creates a
step_percentile()function. I think this would be great to have included inreceipes, there is a workaround usingstep_mutate()but there is easy opportunity for data-leak that isn't entirely obvious, and I would imagine it's a relatively common transformation regardless. Any thoughts?Created on 2021-08-05 by the reprex package (v2.0.0)