Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_knnimpute fails imputing numeric columns when other character columns are used for imputation #213

Closed
glenrs opened this issue Oct 4, 2018 · 4 comments

Comments

@glenrs
Copy link

glenrs commented Oct 4, 2018

step_knnimpute fails imputing numeric columns when other character columns rather than factor columns are used for imputation. Below are two minimal reproducible examples that illustrate instances where factor and character columns are used.

I have noticed that this fails on line 18 of bake.step_knnimpute. In the factor example, new_data will only have fctr_col where object$ref_data has the entire initial dataframe. In the character example (the step that fails), both new_data and object$ref_data will have all data from the initial df, but the chr_col is converted to factor in only ref_data.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: broom
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

d <- data.frame(chr_col = sample(c("a", "c"), 100, replace = TRUE), 
                num = sample(c(NA, 1:3), 100, replace = TRUE), 
                stringsAsFactors = FALSE)

rec_obj <-
  d %>%
  recipe(formula = "~.") %>%
  step_knnimpute(all_numeric())

prep(rec_obj) %>%
  bake(newdata = d) %>%
  summarise_all(~{mean(is.na(.x))})
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'

d <- data.frame(fctr_col = sample(c("a", "c"), 100, replace = TRUE), 
                num = sample(c(NA, 1:3), 100, replace = TRUE))

rec_obj <-
  d %>%
  recipe(formula = "~.") %>%
  step_knnimpute(all_numeric())

prep(rec_obj) %>%
  bake(newdata = d) %>%
  summarise_all(~{mean(is.na(.x))}) # checking to make sure that everything is imputed
#> # A tibble: 1 x 2
#>   fctr_col   num
#>      <dbl> <dbl>
#> 1        0     0

d %>%
  summarise_all(~{mean(is.na(.x))}) # the original missing data
#>   fctr_col  num
#> 1        0 0.21

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bindrcpp_0.2.2 recipes_0.1.3  broom_0.5.0    dplyr_0.7.6   
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.4   purrr_0.2.5        kernlab_0.9-27    
#>  [4] splines_3.5.1      lattice_0.20-35    geometry_0.3-6    
#>  [7] htmltools_0.3.6    yaml_2.2.0         utf8_1.1.4        
#> [10] survival_2.42-6    prodlim_2018.04.18 rlang_0.2.2       
#> [13] pillar_1.3.0       glue_1.3.0         bindr_0.1.1       
#> [16] dimRed_0.1.0       lava_1.6.3         robustbase_0.93-2 
#> [19] stringr_1.3.1      timeDate_3043.102  pls_2.7-0         
#> [22] evaluate_0.11      knitr_1.20         magic_1.5-8       
#> [25] class_7.3-14       fansi_0.3.0        DEoptimR_1.0-8    
#> [28] Rcpp_0.12.18       backports_1.1.2    ipred_0.9-7       
#> [31] CVST_0.2-2         abind_1.4-5        digest_0.6.16     
#> [34] stringi_1.2.4      RcppRoll_0.3.0     ddalpha_1.3.4     
#> [37] grid_3.5.1         rprojroot_1.3-2    cli_1.0.0         
#> [40] tools_3.5.1        magrittr_1.5       tibble_1.4.2      
#> [43] crayon_1.3.4       tidyr_0.8.1        DRR_0.0.3         
#> [46] pkgconfig_2.0.2    MASS_7.3-50        Matrix_1.2-14     
#> [49] lubridate_1.7.4    gower_0.1.2        assertthat_0.2.0  
#> [52] rmarkdown_1.10     R6_2.2.2           rpart_4.1-13      
#> [55] sfsmisc_1.1-2      nnet_7.3-12        nlme_3.1-137      
#> [58] compiler_3.5.1

Created on 2018-10-04 by the reprex
package
(v0.2.0).

@glenrs
Copy link
Author

glenrs commented Oct 4, 2018

I mentioned above that step_knnimpute only fails when imputing numeric columns. I was wrong. The error is also when imputing nominal columns.

In both imputing nominal and numeric columns, it fails when character columns are introduced, and succeeds when factor columns are used.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: broom
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

d <- data.frame(chr_col = c(sample(c("a", "b"), 99, replace = TRUE), NA), 
                fctr_col = factor(c(sample(c("a", "b"), 99, replace = TRUE), NA)), 
                num = 1:100, stringsAsFactors = TRUE)

rec_obj <-
  d %>%
  recipe(formula = "~.") %>%
  step_knnimpute(all_nominal())

prep(rec_obj) %>%
  bake(newdata = d) %>%
  summarise_all(~{mean(is.na(.x))})
#> # A tibble: 1 x 3
#>   chr_col fctr_col   num
#>     <dbl>    <dbl> <dbl>
#> 1       0        0     0

d <- data.frame(chr_col = c(sample(c("a", "b"), 99, replace = TRUE), NA), 
                fctr_col = factor(c(sample(c("a", "b"), 99, replace = TRUE), NA)), 
                num = 1:100, stringsAsFactors = FALSE)

rec_obj <-
  d %>%
  recipe(formula = "~.") %>%
  step_knnimpute(all_nominal())

prep(rec_obj) %>%
  bake(newdata = d) %>%
  summarise_all(~{mean(is.na(.x))})
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bindrcpp_0.2.2 recipes_0.1.3  broom_0.5.0    dplyr_0.7.6   
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.4   purrr_0.2.5        kernlab_0.9-27    
#>  [4] splines_3.5.1      lattice_0.20-35    geometry_0.3-6    
#>  [7] htmltools_0.3.6    yaml_2.2.0         utf8_1.1.4        
#> [10] survival_2.42-6    prodlim_2018.04.18 rlang_0.2.2       
#> [13] pillar_1.3.0       glue_1.3.0         bindr_0.1.1       
#> [16] dimRed_0.1.0       lava_1.6.3         robustbase_0.93-2 
#> [19] stringr_1.3.1      timeDate_3043.102  pls_2.7-0         
#> [22] evaluate_0.11      knitr_1.20         magic_1.5-8       
#> [25] class_7.3-14       fansi_0.3.0        DEoptimR_1.0-8    
#> [28] Rcpp_0.12.18       backports_1.1.2    ipred_0.9-7       
#> [31] CVST_0.2-2         abind_1.4-5        digest_0.6.16     
#> [34] stringi_1.2.4      RcppRoll_0.3.0     ddalpha_1.3.4     
#> [37] grid_3.5.1         rprojroot_1.3-2    cli_1.0.0         
#> [40] tools_3.5.1        magrittr_1.5       tibble_1.4.2      
#> [43] crayon_1.3.4       tidyr_0.8.1        DRR_0.0.3         
#> [46] pkgconfig_2.0.2    MASS_7.3-50        Matrix_1.2-14     
#> [49] lubridate_1.7.4    gower_0.1.2        assertthat_0.2.0  
#> [52] rmarkdown_1.10     R6_2.2.2           rpart_4.1-13      
#> [55] sfsmisc_1.1-2      nnet_7.3-12        nlme_3.1-137      
#> [58] compiler_3.5.1

Created on 2018-10-04 by the reprex
package
(v0.2.0).

@topepo
Copy link
Member

topepo commented Oct 5, 2018

For the character example, it looks like a function in gower doesn't like it when one variable is factor and another is character (see markvanderloo/gower#9). If both are factors or character it works (but the results are different and shouldn't be).

The underlying issue is that the data being baked are having their character columns to factor (which it should unless prep's stringsAsFactors argument is set to FALSE).

@topepo
Copy link
Member

topepo commented Nov 8, 2018

The package maintainer has made a change that addresses the issue. I've added a warning when the types are different (character vs factor). There are some times when that will be okay and I don't think that repairing the incoming data is a good idea.

@topepo topepo closed this as completed Nov 8, 2018
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants