Skip to content

Add warning if step_normalize introduces NaN's in column #920

@FieteO

Description

@FieteO

Background

For columns with zero variance, step_normalize() replaces the whole column with NaN's. I had a very hard time debugging this when trying to feed the data to a model and got an Error: Missing data in columns:.

Reproducible example

x1 <- sample(c(NaN,5,10),300,replace=TRUE)
x2 <- sample(1,300,replace=TRUE)
x3 <- sample(c(0,1),300,replace=TRUE)
df1 <- data.frame(x1,x2,x3)

dfsplits <- initial_split(df1, strata = x3)
df_train <- training(dfsplits)
df_test  <- testing(dfsplits)

df_recipe <-
  recipe(x1 ~ ., data = df_train) %>%
  step_normalize(all_numeric_predictors())

df_recipe %>%
  prep(df_train) %>%
  bake(df_test)

# A tibble: 76 x 3
      x2     x3    x1
   <dbl>  <dbl> <dbl>
 1   NaN  0.946     5
 2   NaN -1.05    NaN
 3   NaN  0.946     5
 4   NaN -1.05      5
 5   NaN -1.05     10
 6   NaN  0.946    10
 7   NaN  0.946    10
 8   NaN  0.946    10
 9   NaN  0.946   NaN
10   NaN -1.05     10
# ... with 66 more rows
df_model <- rand_forest(mtry=50,trees=150) %>%
            set_engine('ranger',importance='impurity') %>%
            set_mode('regression')

df_workflow <- 
  workflow() %>% 
  add_model(df_model) %>% 
  add_recipe(df_recipe)

df_workflow %>% 
  fit(data = df_train)
# Error: Missing data in columns: x2.

I now know that this can be mitigated for this particular case with step_zv() to remove the zero-variance column, but it would be great if there had been any indication that step_normalize() introduced the NaN values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions