Skip to content

discretize( ..., keep_na = T) does not work when there's NA (related to issue #127) #982

@albertiniufu

Description

@albertiniufu

The problem

I had trouble running recipes::step_discretize to predictors with NA values. This problem is related to issue #127.

Reproducible example

Using the example in ?recipes::discretize:

library(modeldata)
suppressPackageStartupMessages(library(recipes))
data(biomass)

biomass$carbon[1] = NA
summary(biomass$carbon)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   14.61   44.70   47.10   48.29   49.70   97.18       1

# See error (with a not so clear message) because there is a NA:
discretize(biomass$carbon, cuts = 2, infs = FALSE, keep_na = T)
#> Error in quantile.default(x, probs = seq(0, 1, length = cuts + 1), ...): missing values and NaN's not allowed if 'na.rm' is FALSE

#In issue #127, I found out that na.rm = T must be passed,
#although there is no mention of na.rm in ?recipes::discretize,
#and also no mention in the related ?recipes::step_discretize
discretize(biomass$carbon, cuts = 2, infs = FALSE, keep_na = T, na.rm = T)
#> Bins: 3 (includes missing category)
#> Breaks: 14.61, 47.1, 97.18

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       Arch Linux
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  pt_BR.UTF-8
#>  ctype    pt_BR.UTF-8
#>  tz       America/Sao_Paulo
#>  date     2022-05-15
#>  pandoc   2.17.1.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  assertthat     0.2.1      2019-03-21 [2] CRAN (R 4.0.0)
#>  class          7.3-20     2022-01-16 [2] CRAN (R 4.2.0)
#>  cli            3.3.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  codetools      0.2-18     2020-11-04 [2] CRAN (R 4.2.0)
#>  crayon         1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI            1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
#>  digest         0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr        * 1.0.9      2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15       2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.0.3)
#>  fs             1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
#>  future         1.25.0     2022-04-24 [1] CRAN (R 4.2.0)
#>  future.apply   1.9.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  generics       0.1.2      2022-01-31 [1] CRAN (R 4.1.3)
#>  globals        0.15.0     2022-05-09 [1] CRAN (R 4.2.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  gower          1.0.0      2022-02-03 [1] CRAN (R 4.2.0)
#>  hardhat        0.2.0      2022-01-24 [1] CRAN (R 4.1.3)
#>  highr          0.8        2019-03-20 [2] CRAN (R 4.0.0)
#>  htmltools      0.5.2      2021-08-25 [1] CRAN (R 4.1.2)
#>  ipred          0.9-12     2021-09-15 [1] CRAN (R 4.1.2)
#>  knitr          1.39       2022-04-26 [1] CRAN (R 4.2.0)
#>  lattice        0.20-45    2021-09-22 [2] CRAN (R 4.2.0)
#>  lava           1.6.10     2021-09-02 [1] CRAN (R 4.1.2)
#>  lifecycle      1.0.1      2021-09-24 [1] CRAN (R 4.1.2)
#>  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.0.1)
#>  lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS           7.3-56     2022-03-23 [2] CRAN (R 4.2.0)
#>  Matrix         1.4-1      2022-03-23 [2] CRAN (R 4.2.0)
#>  modeldata    * 0.1.1      2021-07-14 [1] CRAN (R 4.1.3)
#>  nnet           7.3-17     2022-01-16 [2] CRAN (R 4.2.0)
#>  parallelly     1.31.1     2022-04-22 [1] CRAN (R 4.2.0)
#>  pillar         1.7.0      2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3      2019-09-22 [2] CRAN (R 4.0.0)
#>  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
#>  purrr          0.3.4      2020-04-17 [2] CRAN (R 4.0.0)
#>  R.cache        0.15.0     2021-04-30 [1] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1      2020-08-26 [1] CRAN (R 4.0.3)
#>  R.oo           1.24.0     2020-08-26 [1] CRAN (R 4.0.3)
#>  R.utils        2.11.0     2021-09-26 [1] CRAN (R 4.1.2)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.2)
#>  Rcpp           1.0.8.3    2022-03-17 [1] CRAN (R 4.2.0)
#>  recipes      * 0.2.0      2022-02-18 [1] CRAN (R 4.1.3)
#>  reprex         2.0.1      2021-08-05 [1] CRAN (R 4.1.2)
#>  rlang          1.0.2      2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown      2.14       2022-04-25 [1] CRAN (R 4.2.0)
#>  rpart          4.1.16     2022-01-24 [2] CRAN (R 4.2.0)
#>  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.0.5)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6      2021-11-29 [1] CRAN (R 4.1.3)
#>  stringr        1.4.0      2019-02-10 [2] CRAN (R 4.0.0)
#>  styler         1.7.0      2022-03-13 [1] CRAN (R 4.2.0)
#>  survival       3.3-1      2022-03-03 [2] CRAN (R 4.2.0)
#>  tibble         3.1.7      2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.1.3)
#>  timeDate       3043.102   2018-02-21 [1] CRAN (R 4.0.0)
#>  utf8           1.2.2      2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.4.1      2022-04-13 [1] CRAN (R 4.2.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun           0.31       2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml           2.2.1      2020-02-01 [2] CRAN (R 4.0.0)
#> 
#>  [1] /home/marcelo/R/x86_64-pc-linux-gnu-library/3.5
#>  [2] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Suggested solution

I suggest that

  1. ?recipes::discretize should mention argument na.rm ,
  2. ?recipes::discretize should exemplify keep_na effect meaningfully. Perhaps using the previous reprex.
  3. keep_na=T should also automatically make na.rm = T.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions