New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unexpected behaviour with untransformed "numeric" values in step_num2factor #575
Comments
This is not as expected or intended, but I believe because something is going wrong with
I can't get library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(mtcars)
recipe( mpg ~ hp + cyl, data = mtcars ) %>%
step_num2factor(cyl, transform = function(x) as.integer(x), levels = letters[1:3]) %>%
prep() %>%
juice()
#> # A tibble: 32 x 3
#> hp cyl mpg
#> <dbl> <fct> <dbl>
#> 1 110 <NA> 21
#> 2 110 <NA> 21
#> 3 93 <NA> 22.8
#> 4 110 <NA> 21.4
#> 5 175 <NA> 18.7
#> 6 105 <NA> 18.1
#> 7 245 <NA> 14.3
#> 8 62 <NA> 24.4
#> 9 95 <NA> 22.8
#> 10 123 <NA> 19.2
#> # … with 22 more rows Created on 2020-09-24 by the reprex package (v0.3.0.9001) |
This is primarily an indexing issue and should likely be an error or warning, particularly when it would return all NA. For this dataset the y <- as.integer(mtcars$cyl)
y
#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
lvls <- letters[1:3]
lvls
#> [1] "a" "b" "c"
lvls[y] # bad indexing and no warning
#> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [26] NA NA NA NA NA NA NA Created on 2021-06-27 by the reprex package (v2.0.0)
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
recipe( mpg ~ hp + cyl, data = mtcars ) %>%
step_num2factor(cyl, transform = as.factor, levels = letters[1:3]) %>%
prep() %>%
juice()
#> # A tibble: 32 x 3
#> hp cyl mpg
#> <dbl> <fct> <dbl>
#> 1 110 b 21
#> 2 110 b 21
#> 3 93 a 22.8
#> 4 110 b 21.4
#> 5 175 c 18.7
#> 6 105 b 18.1
#> 7 245 c 14.3
#> 8 62 a 24.4
#> 9 95 a 22.8
#> 10 123 b 19.2
#> # … with 22 more rows Created on 2021-06-27 by the reprex package (v2.0.0) |
We should maybe change the documentation to say "(perhaps using It is possible end up with all |
A few times I've come across a situation where I want have very few levels of a numeric variable, and I want these to be factors. But I've noticed every time that these sometimes unexpectantly return
NA
columnsfor example
This is unexpected because
factor(mtcars$cyl, levels = c(4, 6, 8), labels = letters[1:3])
This is clearly a numeric error which can be alleviated by adding
step_integer
beforehand, but it is somewhat misleading thatstep_num2factor
states that it converts numeric values if one has to convert them to integers first.The text was updated successfully, but these errors were encountered: