Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected behaviour with untransformed "numeric" values in step_num2factor #575

Open
Bijaelo opened this issue Sep 24, 2020 · 3 comments
Open
Labels
bug an unexpected problem or unintended behavior

Comments

@Bijaelo
Copy link

Bijaelo commented Sep 24, 2020

A few times I've come across a situation where I want have very few levels of a numeric variable, and I want these to be factors. But I've noticed every time that these sometimes unexpectantly return NA columns

for example

library(recipes)
data(mtcars)
mtcars %>% 
  recipe( mpg ~ hp + cyl ) %>% 
  step_num2factor(cyl, ordered = TRUE, levels = letters[1:3]) %>%
  prep() %>%
  juice()
# A tibble: 32 x 3
      hp cyl     mpg
   <dbl> <ord> <dbl>
 1   110 NA     21  
 2   110 NA     21  
 3    93 NA     22.8
 4   110 NA     21.4
 5   175 NA     18.7
 6   105 NA     18.1
 7   245 NA     14.3
 8    62 NA     24.4
 9    95 NA     22.8
10   123 NA     19.2
# … with 22 more rows

This is unexpected because

  1. There is no "error" warning about "unknown levels replaced with NA"
  2. This behaviour is unlike base factor(mtcars$cyl, levels = c(4, 6, 8), labels = letters[1:3])

This is clearly a numeric error which can be alleviated by adding step_integer beforehand, but it is somewhat misleading that step_num2factor states that it converts numeric values if one has to convert them to integers first.

mtcars %>% 
  recipe( mpg ~ hp + cyl ) %>% 
  step_integer(cyl) %>%
  step_num2factor(cyl, ordered = TRUE, levels = c(letters[1:3])) %>%
  prep() %>%
  juice()
# A tibble: 32 x 3
      hp cyl     mpg
   <dbl> <ord> <dbl>
 1   110 b      21  
 2   110 b      21  
 3    93 a      22.8
 4   110 b      21.4
 5   175 c      18.7
 6   105 b      18.1
 7   245 c      14.3
 8    62 a      24.4
 9    95 a      22.8
10   123 b      19.2
# … with 22 more rows
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.0.1      ggthemes_4.2.0       microbenchmark_1.4-7 yardstick_0.0.7      workflows_0.2.0      tune_0.1.1          
 [7] tidyr_1.1.2          tibble_3.0.3         rsample_0.0.7        recipes_0.1.13       purrr_0.3.4          parsnip_0.1.3       
[13] modeldata_0.0.2      infer_0.5.3          ggplot2_3.3.2        dplyr_1.0.2          dials_0.0.9          scales_1.1.1        
[19] broom_0.7.0          tidymodels_0.1.1    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         lubridate_1.7.9    lattice_0.20-41    listenv_0.8.0      class_7.3-17       utf8_1.1.4         assertthat_0.2.1  
 [8] digest_0.6.25      ipred_0.9-9        foreach_1.5.0      R6_2.4.1           plyr_1.8.6         backports_1.1.9    evaluate_0.14     
[15] pillar_1.4.6       rlang_0.4.7        rstudioapi_0.11    DiceDesign_1.8-1   furrr_0.1.0        rpart_4.1-15       Matrix_1.2-18     
[22] rmarkdown_2.3      splines_4.0.2      stringr_1.4.0      gower_0.2.2        munsell_0.5.0      xfun_0.16          compiler_4.0.2    
[29] pkgconfig_2.0.3    htmltools_0.5.0    globals_0.13.0     nnet_7.3-14        tidyselect_1.1.0   prodlim_2019.11.13 bookdown_0.20     
[36] codetools_0.2-16   GPfit_1.0-8        fansi_0.4.1        future_1.19.1      crayon_1.3.4       withr_2.2.0        MASS_7.3-51.6     
[43] grid_4.0.2         gtable_0.3.0       lifecycle_0.2.0    magrittr_1.5       pROC_1.16.2        stringi_1.4.6      cli_2.0.2         
[50] timeDate_3043.102  ellipsis_0.3.1     lhs_1.0.2          generics_0.0.2     vctrs_0.3.4        lava_1.6.7         iterators_1.0.12  
[57] tools_4.0.2        glue_1.4.2         yaml_2.2.1         parallel_4.0.2     survival_3.1-12    colorspace_1.4-1   knitr_1.29 
```
@juliasilge
Copy link
Member

This is not as expected or intended, but I believe because something is going wrong with transform. The transform argument says:

A function taking a single argument x that can be used to modify the numeric values prior to determining the levels (perhaps using base::as.integer()). The output of a function should be an integer that corresponds to the value of levels that should be assigned. If not an integer, the value will be converted to an integer during bake().

I can't get transform to work, though.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
data(mtcars)
recipe( mpg ~ hp + cyl, data = mtcars ) %>% 
    step_num2factor(cyl, transform = function(x) as.integer(x), levels = letters[1:3]) %>%
    prep() %>%
    juice()
#> # A tibble: 32 x 3
#>       hp cyl     mpg
#>    <dbl> <fct> <dbl>
#>  1   110 <NA>   21  
#>  2   110 <NA>   21  
#>  3    93 <NA>   22.8
#>  4   110 <NA>   21.4
#>  5   175 <NA>   18.7
#>  6   105 <NA>   18.1
#>  7   245 <NA>   14.3
#>  8    62 <NA>   24.4
#>  9    95 <NA>   22.8
#> 10   123 <NA>   19.2
#> # … with 22 more rows

Created on 2020-09-24 by the reprex package (v0.3.0.9001)

@juliasilge juliasilge added the bug an unexpected problem or unintended behavior label Sep 24, 2020
@jkennel
Copy link
Contributor

jkennel commented Jun 28, 2021

This is primarily an indexing issue and should likely be an error or warning, particularly when it would return all NA. For this dataset the step_num2factor code basically executes the following in the bake phase.

y <- as.integer(mtcars$cyl)
y
#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
lvls <- letters[1:3]
lvls
#> [1] "a" "b" "c"
lvls[y] # bad indexing and no warning
#>  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA

Created on 2021-06-27 by the reprex package (v2.0.0)

step_integer converts the values between 1:3, however, as.integer does not do this. If we use a transform function that converts the values between 1 and 3 it will work. as.factor is one such function.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

recipe( mpg ~ hp + cyl, data = mtcars ) %>% 
  step_num2factor(cyl, transform = as.factor, levels = letters[1:3]) %>%
  prep() %>%
  juice()
#> # A tibble: 32 x 3
#>       hp cyl     mpg
#>    <dbl> <fct> <dbl>
#>  1   110 b      21  
#>  2   110 b      21  
#>  3    93 a      22.8
#>  4   110 b      21.4
#>  5   175 c      18.7
#>  6   105 b      18.1
#>  7   245 c      14.3
#>  8    62 a      24.4
#>  9    95 a      22.8
#> 10   123 b      19.2
#> # … with 22 more rows

Created on 2021-06-27 by the reprex package (v2.0.0)

@juliasilge
Copy link
Member

We should maybe change the documentation to say "(perhaps using base::as.integer() or base::as.factor())" at the minimum.

It is possible end up with all NA values with a lot of recipes steps that handle factor levels so I don't know if we will want to do any warnings/errors specific to this recipe step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants