Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change printing for step_impute_knn() #837

Closed
SewerynGrodny opened this issue Oct 20, 2021 · 4 comments · Fixed by #977
Closed

Change printing for step_impute_knn() #837

SewerynGrodny opened this issue Oct 20, 2021 · 4 comments · Fixed by #977
Labels
upkeep maintenance, infrastructure, and similar

Comments

@SewerynGrodny
Copy link

Step_impute_knn - ignore variable that should be imputed, instead of this impute value in all possible numeric variables:
prep(base_rec)

K-nearest neighbor imputation for Sepal.Width, Petal.Width, Species [trained]

whereas it should be only Petal.Length.

Bests
Seweryn


iris = as_tibble(iris)

iris[1, 1] <- as.numeric(NA)
iris[1, 2] <- as.numeric(NA)
iris[1 ,3] <- as.numeric(NA)
iris[1, 4] <- as.numeric(NA)
iris[1, 5] <- as.numeric(NA)

set.seed(123)

iris_split <- iris %>% 
  initial_split(strata = Sepal.Length)

iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 5,
  min_n = 5,
  trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
                   data = iris_training) %>% 
  step_impute_knn(Petal.Length) 

prep(base_rec)


@juliasilge
Copy link
Member

There are a couple of things going on here that may be contributing to some confusion. When you do the data splitting, you end up with data in iris_training that doesn't have any NA values, so there is nothing that needs imputation. Try it just using the tibble that you add the missing data to, and you'll notice that it will fail because there is nothing to use for imputation.

If instead we only add two NA value, you'll notice that the correct column (Petal.Length) is being imputed:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

iris <- tibble(iris)
iris[1, 2] <- NA_real_
iris[1, 3] <- NA_real_
iris
#> # A tibble: 150 × 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1        NA           NA           0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

base_rec <- recipe(Sepal.Length ~ ., data = iris) %>% 
  step_impute_knn(Petal.Length) 

prep(base_rec)
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          4
#> 
#> Training data contained 150 data points and 1 incomplete row. 
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for Sepal.Width, Petal.Width, Species [trained]
prep(base_rec) %>% bake(new_data = NULL)
#> # A tibble: 150 × 5
#>    Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#>          <dbl>        <dbl>       <dbl> <fct>          <dbl>
#>  1        NA           1.44         0.2 setosa           5.1
#>  2         3           1.4          0.2 setosa           4.9
#>  3         3.2         1.3          0.2 setosa           4.7
#>  4         3.1         1.5          0.2 setosa           4.6
#>  5         3.6         1.4          0.2 setosa           5  
#>  6         3.9         1.7          0.4 setosa           5.4
#>  7         3.4         1.4          0.3 setosa           4.6
#>  8         3.4         1.5          0.2 setosa           5  
#>  9         2.9         1.4          0.2 setosa           4.4
#> 10         3.1         1.5          0.1 setosa           4.9
#> # … with 140 more rows

Created on 2021-10-20 by the reprex package (v2.0.1)

The Sepal.Width is not getting imputed because we didn't ask this recipe to do that.

The printing is confusing because it is telling you the things that are possibly being used for imputation and not what is being imputed. Maybe we can fix that here:

cat("K-nearest neighbor imputation for ", sep = "")

We could just change "for" to "with"? Or change which variables are printed out and do the ones that are being imputed rather than the ones that are being imputed with.

@SewerynGrodny
Copy link
Author

Hi, thanks, for an explanation.
I like you proposition by changing message from what was used for imputation on what had been imputed.
Thanks,
PS. Should I close this issue?

@juliasilge juliasilge changed the title step_impute_knn Change printing for step_impute_knn() Oct 22, 2021
@juliasilge juliasilge added the upkeep maintenance, infrastructure, and similar label Oct 22, 2021
@juliasilge
Copy link
Member

No, we'll leave this open and close it when we change the printing! 👍

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
upkeep maintenance, infrastructure, and similar
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants