-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
step_knnimpute
fails imputing numeric columns when other character columns are used for imputation
#213
Comments
I mentioned above that In both imputing nominal and numeric columns, it fails when character columns are introduced, and succeeds when factor columns are used. library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: broom
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
d <- data.frame(chr_col = c(sample(c("a", "b"), 99, replace = TRUE), NA),
fctr_col = factor(c(sample(c("a", "b"), 99, replace = TRUE), NA)),
num = 1:100, stringsAsFactors = TRUE)
rec_obj <-
d %>%
recipe(formula = "~.") %>%
step_knnimpute(all_nominal())
prep(rec_obj) %>%
bake(newdata = d) %>%
summarise_all(~{mean(is.na(.x))})
#> # A tibble: 1 x 3
#> chr_col fctr_col num
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
d <- data.frame(chr_col = c(sample(c("a", "b"), 99, replace = TRUE), NA),
fctr_col = factor(c(sample(c("a", "b"), 99, replace = TRUE), NA)),
num = 1:100, stringsAsFactors = FALSE)
rec_obj <-
d %>%
recipe(formula = "~.") %>%
step_knnimpute(all_nominal())
prep(rec_obj) %>%
bake(newdata = d) %>%
summarise_all(~{mean(is.na(.x))})
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bindrcpp_0.2.2 recipes_0.1.3 broom_0.5.0 dplyr_0.7.6
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_0.2.4 purrr_0.2.5 kernlab_0.9-27
#> [4] splines_3.5.1 lattice_0.20-35 geometry_0.3-6
#> [7] htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4
#> [10] survival_2.42-6 prodlim_2018.04.18 rlang_0.2.2
#> [13] pillar_1.3.0 glue_1.3.0 bindr_0.1.1
#> [16] dimRed_0.1.0 lava_1.6.3 robustbase_0.93-2
#> [19] stringr_1.3.1 timeDate_3043.102 pls_2.7-0
#> [22] evaluate_0.11 knitr_1.20 magic_1.5-8
#> [25] class_7.3-14 fansi_0.3.0 DEoptimR_1.0-8
#> [28] Rcpp_0.12.18 backports_1.1.2 ipred_0.9-7
#> [31] CVST_0.2-2 abind_1.4-5 digest_0.6.16
#> [34] stringi_1.2.4 RcppRoll_0.3.0 ddalpha_1.3.4
#> [37] grid_3.5.1 rprojroot_1.3-2 cli_1.0.0
#> [40] tools_3.5.1 magrittr_1.5 tibble_1.4.2
#> [43] crayon_1.3.4 tidyr_0.8.1 DRR_0.0.3
#> [46] pkgconfig_2.0.2 MASS_7.3-50 Matrix_1.2-14
#> [49] lubridate_1.7.4 gower_0.1.2 assertthat_0.2.0
#> [52] rmarkdown_1.10 R6_2.2.2 rpart_4.1-13
#> [55] sfsmisc_1.1-2 nnet_7.3-12 nlme_3.1-137
#> [58] compiler_3.5.1 Created on 2018-10-04 by the reprex |
For the character example, it looks like a function in The underlying issue is that the data being baked are having their character columns to factor (which it should unless |
The package maintainer has made a change that addresses the issue. I've added a warning when the types are different (character vs factor). There are some times when that will be okay and I don't think that repairing the incoming data is a good idea. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
step_knnimpute
fails imputing numeric columns when other character columns rather than factor columns are used for imputation. Below are two minimal reproducible examples that illustrate instances where factor and character columns are used.I have noticed that this fails on line 18 of
bake.step_knnimpute
. In the factor example,new_data
will only havefctr_col
whereobject$ref_data
has the entire initial dataframe. In the character example (the step that fails), bothnew_data
andobject$ref_data
will have all data from the initial df, but thechr_col
is converted to factor in onlyref_data
.Created on 2018-10-04 by the reprex
package (v0.2.0).
The text was updated successfully, but these errors were encountered: