Skip to content

Request step_percentile() , non-obvious opportunity for data leak with step_mutate() #765

@joeycouse

Description

@joeycouse

Feature

In the tidymodels developer section , there is an example that creates a step_percentile() function. I think this would be great to have included in receipes, there is a workaround using step_mutate() but there is easy opportunity for data-leak that isn't entirely obvious, and I would imagine it's a relatively common transformation regardless. Any thoughts?

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

data("car_prices")

set.seed(24)

splits <- initial_split(car_prices)
car_train <- training(splits)
car_test <- testing(splits)

car_rec_no_leak <- recipe(Price ~ . , data = car_train) %>%
 step_mutate(Mileage = ecdf(car_train$Mileage)(Mileage)) %>%
 prep()

car_rec_leak <- recipe(Price ~ . , data = car_train) %>%
 step_mutate(Mileage = ecdf(Mileage)(Mileage)) %>%
 prep()


bake(car_rec_no_leak, new_data = car_test)
#> # A tibble: 201 x 18
#>    Mileage Cylinder Doors Cruise Sound Leather Buick Cadillac Chevy Pontiac
#>      <dbl>    <int> <int>  <int> <int>   <int> <int>    <int> <int>   <int>
#>  1   0.682        4     2      1     0       0     0        0     0       0
#>  2   0.153        4     4      1     1       0     0        0     0       0
#>  3   0.164        4     4      1     1       0     0        0     0       0
#>  4   0.597        4     4      1     1       0     0        0     0       0
#>  5   0.891        4     4      1     1       1     0        0     0       0
#>  6   0.257        4     4      1     1       1     0        0     0       0
#>  7   0.597        4     4      1     1       1     0        0     0       0
#>  8   0.461        4     4      1     0       0     0        0     0       1
#>  9   0.365        4     4      1     1       0     0        0     0       1
#> 10   0.935        8     2      1     1       1     0        0     1       0
#> # ... with 191 more rows, and 8 more variables: Saab <int>, Saturn <int>,
#> #   convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>,
#> #   Price <dbl>

# Testing data is used to compute the percentile
bake(car_rec_leak, new_data = car_test)
#> # A tibble: 201 x 18
#>    Mileage Cylinder Doors Cruise Sound Leather Buick Cadillac Chevy Pontiac
#>      <dbl>    <int> <int>  <int> <int>   <int> <int>    <int> <int>   <int>
#>  1   0.652        4     2      1     0       0     0        0     0       0
#>  2   0.169        4     4      1     1       0     0        0     0       0
#>  3   0.179        4     4      1     1       0     0        0     0       0
#>  4   0.602        4     4      1     1       0     0        0     0       0
#>  5   0.925        4     4      1     1       1     0        0     0       0
#>  6   0.289        4     4      1     1       1     0        0     0       0
#>  7   0.597        4     4      1     1       1     0        0     0       0
#>  8   0.443        4     4      1     0       0     0        0     0       1
#>  9   0.333        4     4      1     1       0     0        0     0       1
#> 10   0.970        8     2      1     1       1     0        0     1       0
#> # ... with 191 more rows, and 8 more variables: Saab <int>, Saturn <int>,
#> #   convertible <int>, coupe <int>, hatchback <int>, sedan <int>, wagon <int>,
#> #   Price <dbl>

Created on 2021-08-05 by the reprex package (v2.0.0)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions