Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No formula and adding a step causes all variables to have role predictor role #296

Closed
olangfor opened this issue Mar 15, 2019 · 2 comments · Fixed by #297
Closed

No formula and adding a step causes all variables to have role predictor role #296

olangfor opened this issue Mar 15, 2019 · 2 comments · Fixed by #297

Comments

@olangfor
Copy link

Hi,

Firstly, thank you for this wonderful package.

I have query regarding the behaviour of adding a step (example below uses step_dummy). When I don't add a formula and perform juice(prepped object, all_predictors()), this returns all variables. It seems by not having a role, then adding step it makes all variables have a role of predictor.

Please see the examples below. I've tried to illustrate the issue in the first one, with expected behaviour in the next two examples.

Please excuse me if I've misused the package.

Best wishes,

Oli

library(recipes); 
library(nlme) # for data

# Add individual steps
obj_rec <- recipe(x = as.data.frame(Orthodont)) %>% 
  add_role(distance, new_role = 'outcome') %>%
  add_role(age, new_role = 'predictor') %>%
  add_role(Sex, new_role = 'predictor')  %>%
  step_dummy(Sex) 
obj_prep <- prep(obj_rec, training = Orthodont, retain = TRUE)
X <- juice(obj_prep, all_predictors()) 
head(X) # Expected?
y <- juice(obj_prep, all_outcomes()) 
head(y)

# No step_dummy
obj_rec <- recipe(x = as.data.frame(Orthodont)) %>% 
  add_role(distance, new_role = 'outcome') %>%
  add_role(age, new_role = 'predictor') %>%
  add_role(Sex, new_role = 'predictor') 
obj_prep <- prep(obj_rec, training = Orthodont, retain = TRUE)
X <- juice(obj_prep, all_predictors()) 
head(X)
y <- juice(obj_prep, all_outcomes()) 
head(y)


# Formula method works fine
obj_rec <- recipe(distance ~ age + Sex, data = as.data.frame(Orthodont)) %>% 
  step_dummy(Sex) 
obj_prep <- prep(obj_rec, training = Orthodont, retain = TRUE)
X <- juice(obj_prep, all_predictors()) 
head(X)
y <- juice(obj_prep, all_outcomes()) 
head(y)
@DavisVaughan
Copy link
Member

Two things. First this is a bug, so thanks for reporting! Second, generally you should be using update_role() rather than add_role() when you start with a data frame and not a formula. The default is to assume that you want to use all of the variables in the data frame, and they are all given an NA role. So you really want to update that, not add a new one. An open PR will prevent you from adding a role when an existing one is NA, so you should get used to this workflow below (it still has the bug though):

suppressPackageStartupMessages({
  library(recipes)
  library(nlme)
})

obj_rec <- recipe(x = as.data.frame(Orthodont)) %>% 
  update_role(distance, new_role = 'outcome') %>%
  update_role(age, new_role = 'predictor') %>%
  update_role(Sex, new_role = 'predictor')  %>%
  step_dummy(Sex) 

obj_prep <- prep(obj_rec, training = Orthodont)

juice(obj_prep, all_predictors()) 
#> # A tibble: 108 x 3
#>      age Subject Sex_Female
#>    <dbl> <ord>        <dbl>
#>  1     8 M01              0
#>  2    10 M01              0
#>  3    12 M01              0
#>  4    14 M01              0
#>  5     8 M02              0
#>  6    10 M02              0
#>  7    12 M02              0
#>  8    14 M02              0
#>  9     8 M03              0
#> 10    10 M03              0
#> # … with 98 more rows

Created on 2019-03-15 by the reprex package (v0.2.1.9000)

This is what I mean by the fact that you should not be adding roles when NA roles already exist.

suppressPackageStartupMessages({
  library(recipes)
  library(nlme)
})

obj_rec <- recipe(x = as.data.frame(Orthodont)) %>% 
  add_role(distance, new_role = 'outcome') %>%
  add_role(age, new_role = 'predictor') %>%
  add_role(Sex, new_role = 'predictor')  %>%
  step_dummy(Sex) 

summary(obj_rec)
#> # A tibble: 7 x 4
#>   variable type    role      source  
#>   <chr>    <chr>   <chr>     <chr>   
#> 1 distance numeric <NA>      original
#> 2 distance numeric outcome   original
#> 3 age      numeric <NA>      original
#> 4 age      numeric predictor original
#> 5 Subject  nominal <NA>      original
#> 6 Sex      nominal <NA>      original
#> 7 Sex      nominal predictor original

Created on 2019-03-15 by the reprex package (v0.2.1.9000)

I've opened a PR #297 to fix this issue. With that fixed version installed we get the desired behavior:

suppressPackageStartupMessages({
  library(recipes)
  library(nlme)
})

obj_rec <- recipe(x = as.data.frame(Orthodont)) %>% 
  update_role(distance, new_role = 'outcome') %>%
  update_role(age, new_role = 'predictor') %>%
  update_role(Sex, new_role = 'predictor')  %>%
  step_dummy(Sex) 

obj_prep <- prep(obj_rec, training = Orthodont)

juice(obj_prep, all_predictors()) 
#> # A tibble: 108 x 2
#>      age Sex_Female
#>    <dbl>      <dbl>
#>  1     8          0
#>  2    10          0
#>  3    12          0
#>  4    14          0
#>  5     8          0
#>  6    10          0
#>  7    12          0
#>  8    14          0
#>  9     8          0
#> 10    10          0
#> # … with 98 more rows

Created on 2019-03-15 by the reprex package (v0.2.1.9000)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants