Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix imputation for all NA input #719

Merged
merged 16 commits into from
Jun 28, 2021
Merged

Fix imputation for all NA input #719

merged 16 commits into from
Jun 28, 2021

Conversation

juliasilge
Copy link
Member

@juliasilge juliasilge commented Jun 4, 2021

Closes #689

This PR fixes the 馃悰 that we have around imputing on new data that is all NA. Here are the new results:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

df <- data.frame(x1 = c(0, 1, 0),
                 x2 = c(1, 1, 0),
                 x3 = c(0, 1, 1))

rec <- recipe( ~ ., df) %>%
  step_impute_median(all_numeric_predictors()) %>%
  prep()

## always worked
bake(rec, new_data = data.frame(x1 = c(NA, NA, 1, 0, 1),
                                x2 = c(1, 1, 1, NA, 0),
                                x3 = c(0, 1, 1, 0, 0)))
#> # A tibble: 5 x 3
#>      x1    x2    x3
#>   <dbl> <dbl> <dbl>
#> 1     0     1     0
#> 2     0     1     1
#> 3     1     1     1
#> 4     0     1     0
#> 5     1     0     0

## works now
bake(rec, new_data = data.frame(x1 = c(NA, NA),
                                x2 = c(NA, NA),
                                x3 = c(NA, NA)))
#> # A tibble: 2 x 3
#>      x1    x2    x3
#>   <dbl> <dbl> <dbl>
#> 1     0     1     1
#> 2     0     1     1

Created on 2021-06-04 by the reprex package (v2.0.0)

There are problems with all the imputation steps, so I'll be working through them all. Once I started in on these, I found it makes more sense to use vctrs rather than hardhat::scream().

@juliasilge
Copy link
Member Author

juliasilge commented Jun 4, 2021

Looks like step_impute_lower() and step_impute_roll() don't need any changes, since it doesn't make much sense to impute for all NA for those. I did update the tests in the same way for everything else I am doing.

@juliasilge
Copy link
Member Author

Closes #704

Thank you so much @jake-mason for all your help on this! 馃檶 If you would like to install this PR and test it out, that would be great:

devtools::install_github("tidymodels/recipes@impute-na-data")

@juliasilge juliasilge marked this pull request as ready for review June 4, 2021 21:17
@jake-mason
Copy link

@juliasilge thanks so much for your help with this! Looks great.

@juliasilge juliasilge requested review from topepo and removed request for DavisVaughan June 10, 2021 15:30
Copy link
Member

@topepo topepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Can you add more to the new file about the warning that will be issued for old recipes?

@topepo topepo merged commit 82ec0f6 into master Jun 28, 2021
@topepo topepo deleted the impute-na-data branch June 28, 2021 18:02
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

step_modeimpute returns all NA when column is completely null, instead of mode from training data
3 participants