Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type checks fail uninformatively when outcome has "label" attribute #1060

Closed
mesdi opened this issue Feb 2, 2024 · 4 comments · Fixed by #1062
Closed

type checks fail uninformatively when outcome has "label" attribute #1060

mesdi opened this issue Feb 2, 2024 · 4 comments · Fixed by #1062
Labels
bug an unexpected problem or unintended behavior

Comments

@mesdi
Copy link

mesdi commented Feb 2, 2024

library(tidyverse)
library(WDI)

#Dataset for methane emissions
df_methane <- 
  WDI(indicator = "EN.ATM.METH.KT.CE", 
      extra = TRUE) %>% 
  as_tibble() %>% 
  janitor::clean_names() %>% 
  drop_na() %>% 
  rename(methane = en_atm_meth_kt_ce) 

#Modeling
library(tidymodels)

#Dataset for modeling
df_mod <- 
  df_methane %>% 
  filter(region != "Aggregates",
         income != "Not classified") %>% 
  mutate(latitude = latitude %>% as.numeric(),
         longitude = longitude %>% as.numeric(),
         income = income %>% as_factor(),
         region = region %>% as_factor()) %>% 
  select(region, income, longitude,latitude, methane) 


#Splitting
set.seed(12345)
df_split <- initial_split(df_mod, 
                          prop = 0.8,
                          strata = "income")

df_train <- training(df_split)
df_test <- testing(df_split)

tidymodels/workflows#10-fold cross validation for tuning
set.seed(12345)
df_fold <- 
  vfold_cv(df_train,
           strata = income,
           repeats = 5)

#Linear regression models for different engines/packages
spec_lm <- 
  linear_reg() %>% 
  set_engine("lm") 


spec_glm <- 
  linear_reg() %>% 
  set_engine("glm")

spec_glmnet <- 
  linear_reg(penalty = tune(),
             mixture = tune()) %>% 
  set_engine("glmnet")

spec_keras <- 
  linear_reg(penalty = tune()) %>% 
  set_engine("keras")

spec_stan <- 
  linear_reg() %>% 
  set_engine("stan")

#Workflow set
basic_recipe <-
  recipe(methane ~ ., data = df_train) 

all_workflows <- 
  workflow_set(
    preproc = list(basic = basic_recipe), 
    models = list(LM = spec_lm, 
                  GLM = spec_glm,
                  GLMNET = spec_glmnet,
                  Keras = spec_keras,
                  Stan = spec_stan)
  )

#Tuning and evaluating the models
grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )

grid_results <-
  all_workflows %>%
  workflow_map(
    seed = 98765,
    resamples = df_fold,
    grid = 15,
    control = grid_ctrl
  )

#one of the error lines:  
# `y` should be one of the following classes: 'data.frame', 'matrix', 'factor', 'Surv'
@simonpcouch
Copy link
Contributor

simonpcouch commented Feb 2, 2024

Ah, this was an interesting one. Thanks for the issue!

I'm able to reproduce just passing a linear regression specification to tune_grid(). The issue here is that, when fitting regression models, tidymodels usually expects the outcome to be a vector. It isn't, though:

library(tidyverse)
library(tidymodels)
library(WDI)

df_methane <- 
   WDI(indicator = "EN.ATM.METH.KT.CE", 
       extra = TRUE) %>% 
   as_tibble() %>% 
   janitor::clean_names() %>% 
   drop_na() %>% 
   rename(methane = en_atm_meth_kt_ce) 

is.vector(df_methane$methane)
#> [1] FALSE

It's not a vector because WDI() has attached a "label" attribute to it:

attr(df_methane$methane, "label")
#> [1] "Methane emissions (kt of CO2 equivalent)"

Thus:

df_mod <- 
   df_methane %>% 
   filter(region != "Aggregates",
          income != "Not classified") %>% 
   mutate(latitude = latitude %>% as.numeric(),
          longitude = longitude %>% as.numeric(),
          income = income %>% as_factor(),
          region = region %>% as_factor()) %>% 
   select(region, income, longitude,latitude, methane) 

set.seed(12345)
df_split <- initial_split(df_mod, 
                          prop = 0.8,
                          strata = "income")

df_train <- training(df_split)
df_test <- testing(df_split)

set.seed(12345)
df_fold <- 
   vfold_cv(df_train,
            strata = income,
            repeats = 5)

res <-
   tune_grid(
      linear_reg(),
      methane ~ ., 
      df_fold
   )
#> Warning: No tuning parameters have been detected, performance will be evaluated
#> using the resamples with no tuning. Did you want to [tune()] parameters?
#> → A | error:   `y` should be one of the following classes: 'data.frame', 'matrix', 'factor', 'Surv'
#> There were issues with some computations   A: x1

Your solution is to remove the label attribute:

# remove the label attribute ---------------------------------------------
attr(df_methane$methane, "label") <- NULL

df_mod <- 
   df_methane %>% 
   filter(region != "Aggregates",
          income != "Not classified") %>% 
   mutate(latitude = latitude %>% as.numeric(),
          longitude = longitude %>% as.numeric(),
          income = income %>% as_factor(),
          region = region %>% as_factor()) %>% 
   select(region, income, longitude,latitude, methane) 

set.seed(12345)
df_split <- initial_split(df_mod, 
                          prop = 0.8,
                          strata = "income")

df_train <- training(df_split)
df_test <- testing(df_split)

set.seed(12345)
df_fold <- 
   vfold_cv(df_train,
            strata = income,
            repeats = 5)


res <-
   tune_grid(
      linear_reg(),
      methane ~ ., 
      df_fold
   )
#> Warning: No tuning parameters have been detected, performance will be evaluated
#> using the resamples with no tuning. Did you want to [tune()] parameters?

Created on 2024-02-02 with reprex v2.1.0

I'm going to move this to parsnip, where the issue originates.

@simonpcouch simonpcouch changed the title Using a basic formula/preprocessor in the workflow set function does not work with workflow_map type checks fail uninformatively when outcome has "label" attribute Feb 2, 2024
@simonpcouch simonpcouch transferred this issue from tidymodels/workflows Feb 2, 2024
@simonpcouch
Copy link
Contributor

The offending lines:

parsnip/R/fit.R

Lines 414 to 417 in 8ccf1be

# `y` can be a vector (which is not a class), or a factor or
# Surv object (which are not vectors)
if (!is.null(y) && !is.vector(y))
inher(y, c("data.frame", "matrix", "factor", "Surv"), cl)

@simonpcouch
Copy link
Contributor

Here's a more minimal reprex:

library(parsnip)

fit_xy(
  linear_reg(),
  data.frame(x = 1:5),
  y = data.frame(structure(rnorm(5), label = "hi"))
)
#> Error in `inher()`:
#> ! `y` should be one of the following classes: 'data.frame', 'matrix', 'factor', 'Surv'

Created on 2024-02-02 with reprex v2.1.0

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants