Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step-na-indicate #617

Closed
wants to merge 3 commits into from

Conversation

konradsemsch
Copy link

Hi @juliasilge!

I've fitted my earlier code into the current implementation of recipes! As of now the implementation is pretty basic and I was thinking also of adding one more functionality that I wanted to check with you: creating indicator variables only for those features that had missing values/ certain prop of missing values in the first place. What do you think? How to make it fit in the entire recipes workflow? Would you have any other recommendations/ ideas?

On my list are also some tests, but wanted to agree on a baseline implementation first :) Let me know!

@konradsemsch konradsemsch changed the title first working version step-na-indicate Dec 13, 2020
@topepo
Copy link
Member

topepo commented Dec 14, 2020

Looks good! Some minor notes in-line.

R/naindicate.R Outdated Show resolved Hide resolved
R/naindicate.R Outdated Show resolved Hide resolved
R/naindicate.R Outdated Show resolved Hide resolved
R/naindicate.R Outdated Show resolved Hide resolved
R/naindicate.R Outdated Show resolved Hide resolved
@konradsemsch
Copy link
Author

@topepo - what do you think about this proposal from my first message?

creating indicator variables only for those features that had missing values/ certain prop of missing values in the first place

For example: right now if somebody called all_predictors, indicator variables would be created for all of them even though they do not have any data missing. Of course they could be later removed with step_nz or so, but perhaps we could control already for that instead of pushing users to remove them later on.

@juliasilge
Copy link
Member

juliasilge commented Dec 14, 2020

In the original PR we had the ability to remove the original columns, an idea we are also working through in #462. I think that would be quite useful in this context. The first PR called that argument remove but in #462 we have landed on keep_original_cols.

@topepo
Copy link
Member

topepo commented Dec 14, 2020

what do you think about this proposal from my first message?

I don't want to make selectors that use the missing value pattern to select predictors, if that's what you mean.

@topepo
Copy link
Member

topepo commented Dec 14, 2020

In the original PR we had the ability to remove the original columns,

I think in 99% of the cases, we would not want to remove the original columns. Both are needed for the type of data analysis that people would do with these missing indicators.

@konradsemsch
Copy link
Author

@topepo I've worked on the points you mentioned and added some basic tests. However, there's a warning showing up now (likely related to purrr) that I'm not sure how to resolve:

Warning message:
All elements of `...` must be named.
Did you want `data = c(type, role, source)`? 

I don't want to make selectors that use the missing value pattern to select predictors, if that's what you mean.

Ok, got you here. I both agree and disagree with that approach, but I guess that not having done that automatically will reduce unwanted/ unexpected surprises.

@topepo
Copy link
Member

topepo commented Dec 15, 2020

That message is from a tidyr nest or similar. Can you show the results of your sessioninfo::session_info()?

@topepo
Copy link
Member

topepo commented Dec 15, 2020

Also, can you update from master (to avoid some of the documentation conflicts)?

@konradsemsch
Copy link
Author

konradsemsch commented Dec 17, 2020

Session info ───────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       macOS Catalina 10.15.3      
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Berlin               
 date     2020-12-17Packages ───────────────────────────────────────────────────────────────────────────────────────────
 ! package     * version    date       lib source        
   assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)
   backports     1.1.7      2020-05-13 [1] CRAN (R 3.6.2)
   callr         3.4.3      2020-03-28 [1] CRAN (R 3.6.2)
   class         7.3-15     2019-01-01 [1] CRAN (R 3.6.1)
   cli           2.2.0      2020-11-20 [1] CRAN (R 3.6.2)
   crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)
   desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)
   devtools      2.2.1      2019-09-24 [1] CRAN (R 3.6.0)
   digest        0.6.25     2020-02-23 [1] CRAN (R 3.6.0)
   dplyr       * 1.0.2      2020-08-18 [1] CRAN (R 3.6.2)
   ellipsis      0.3.1      2020-05-15 [1] CRAN (R 3.6.2)
   fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.1)
   fs            1.4.1      2020-04-04 [1] CRAN (R 3.6.2)
   generics      0.1.0      2020-10-31 [1] CRAN (R 3.6.2)
   glue          1.4.1      2020-05-13 [1] CRAN (R 3.6.2)
   gower         0.2.1      2019-05-14 [1] CRAN (R 3.6.0)
   ipred         0.9-9      2019-04-28 [1] CRAN (R 3.6.0)
   lattice       0.20-38    2018-11-04 [1] CRAN (R 3.6.1)
   lava          1.6.7      2020-03-05 [1] CRAN (R 3.6.0)
   lifecycle     0.2.0      2020-03-06 [1] CRAN (R 3.6.0)
   lubridate     1.7.8      2020-04-06 [1] CRAN (R 3.6.2)
   magrittr      2.0.1      2020-11-17 [1] CRAN (R 3.6.2)
   MASS          7.3-51.4   2019-03-31 [1] CRAN (R 3.6.1)
   Matrix        1.2-17     2019-03-22 [1] CRAN (R 3.6.1)
   memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)
   nnet          7.3-12     2016-02-02 [1] CRAN (R 3.6.1)
   packrat       0.5.0      2018-11-14 [1] CRAN (R 3.6.0)
   pillar        1.4.4      2020-05-05 [1] CRAN (R 3.6.2)
   pkgbuild      1.0.8      2020-05-07 [1] CRAN (R 3.6.2)
   pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 3.6.0)
   pkgload       1.1.0      2020-05-29 [1] CRAN (R 3.6.2)
   prettyunits   1.1.1      2020-01-24 [1] CRAN (R 3.6.1)
   processx      3.4.2      2020-02-09 [1] CRAN (R 3.6.0)
   prodlim       2019.11.13 2019-11-17 [1] CRAN (R 3.6.0)
   ps            1.3.3      2020-05-08 [1] CRAN (R 3.6.2)
   purrr         0.3.4      2020-04-17 [1] CRAN (R 3.6.2)
   R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.0)
   Rcpp          1.0.4      2020-03-17 [1] CRAN (R 3.6.0)
 P recipes     * 0.1.6.9000 2020-12-14 [?] local         
   remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.0)
   rlang         0.4.9      2020-11-26 [1] CRAN (R 3.6.2)
   rpart         4.1-15     2019-04-12 [1] CRAN (R 3.6.0)
   rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)
   rstudioapi    0.13       2020-11-12 [1] CRAN (R 3.6.2)
   sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)
   survival      2.44-1.1   2019-04-01 [1] CRAN (R 3.6.1)
   testthat    * 2.3.2      2020-03-02 [1] CRAN (R 3.6.0)
   tibble        3.0.4      2020-10-12 [1] CRAN (R 3.6.2)
   tidyr         1.1.2      2020-08-27 [1] CRAN (R 3.6.2)
   tidyselect    1.1.0      2020-05-11 [1] CRAN (R 3.6.2)
   timeDate      3043.102   2018-02-21 [1] CRAN (R 3.6.0)
   usethis       1.5.1      2019-07-04 [1] CRAN (R 3.6.0)
   vctrs         0.3.5      2020-11-17 [1] CRAN (R 3.6.2)
   withr         2.2.0      2020-04-20 [1] CRAN (R 3.6.2)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

 P ── Loaded and on-disk path mismatch.

@konradsemsch
Copy link
Author

Hmm, I'm not sure why these error are showing up... I've checked that I was up-to-date with master.

@juliasilge
Copy link
Member

This is looking really close @konradsemsch! 🙌

It seems like this branch was created from the main branch of recipes at a point is the "distant" past 🦖 👵 and some very old files have gotten included in the PR, which is causing some problems for us in installing the PR and testing it out. Such a bummer! Can we ask you to help us out in creating a more streamlined PR that will be easier to test and check? Here are the steps we think you should take:

  • Save the two 2️⃣ files that include your main changes, the main .R file for the step and the .R file for its tests, somewhere outside your repo (I usually use my desktop).
  • Burn it all down 🔥 and close this PR.
  • Use the usethis package to open a new PR with the new content (that you saved in a safe place outside your repo), via create_from_github(), then pr_init() and pr_push().

@konradsemsch
Copy link
Author

Thank you @juliasilge for explaining! Indeed something was mixed up over there, but now we can start clean :)

Things are looking good on the new PR: here. Closing this one.

@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org

@github-actions github-actions bot locked and limited conversation to collaborators Feb 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants