Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning if step_normalize introduces NaN's in column #920

Closed
FieteO opened this issue Feb 25, 2022 · 3 comments
Closed

Add warning if step_normalize introduces NaN's in column #920

FieteO opened this issue Feb 25, 2022 · 3 comments
Labels
feature a feature request or enhancement

Comments

@FieteO
Copy link

FieteO commented Feb 25, 2022

Background

For columns with zero variance, step_normalize() replaces the whole column with NaN's. I had a very hard time debugging this when trying to feed the data to a model and got an Error: Missing data in columns:.

Reproducible example

x1 <- sample(c(NaN,5,10),300,replace=TRUE)
x2 <- sample(1,300,replace=TRUE)
x3 <- sample(c(0,1),300,replace=TRUE)
df1 <- data.frame(x1,x2,x3)

dfsplits <- initial_split(df1, strata = x3)
df_train <- training(dfsplits)
df_test  <- testing(dfsplits)

df_recipe <-
  recipe(x1 ~ ., data = df_train) %>%
  step_normalize(all_numeric_predictors())

df_recipe %>%
  prep(df_train) %>%
  bake(df_test)

# A tibble: 76 x 3
      x2     x3    x1
   <dbl>  <dbl> <dbl>
 1   NaN  0.946     5
 2   NaN -1.05    NaN
 3   NaN  0.946     5
 4   NaN -1.05      5
 5   NaN -1.05     10
 6   NaN  0.946    10
 7   NaN  0.946    10
 8   NaN  0.946    10
 9   NaN  0.946   NaN
10   NaN -1.05     10
# ... with 66 more rows
df_model <- rand_forest(mtry=50,trees=150) %>%
            set_engine('ranger',importance='impurity') %>%
            set_mode('regression')

df_workflow <- 
  workflow() %>% 
  add_model(df_model) %>% 
  add_recipe(df_recipe)

df_workflow %>% 
  fit(data = df_train)
# Error: Missing data in columns: x2.

I now know that this can be mitigated for this particular case with step_zv() to remove the zero-variance column, but it would be great if there had been any indication that step_normalize() introduced the NaN values.

@juliasilge
Copy link
Member

After we compute the standard deviations:

sds <- vapply(training[, col_names], sd, c(sd = 0), na.rm = x$na_rm)

we could check to see if any are ~zero, like sds < .Machine$double.eps, and throw an error saying which one is. (We could suggest using step_zv() too.)

We would also want to do this for step_scale().

@juliasilge juliasilge added the feature a feature request or enhancement label Feb 25, 2022
@EmilHvitfeldt
Copy link
Member

This issue was closed by #927 thanks to @brunocarlin

@github-actions
Copy link

github-actions bot commented Jul 2, 2022

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants