Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what you can expect to do after bundling, i.e. predict #50

Closed
ClaudiuPapasteri opened this issue Mar 4, 2023 · 12 comments
Closed

Comments

@ClaudiuPapasteri
Copy link

I am not sure if this a known issue, as it doesn't appear in the docs. It seems that except predict, other methods like tidy or rank_results fail using the unbundled object.
This SO post references the same problem.

library(tidymodels)
library(agua)
h2o_start()

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

auto_spec <-
  auto_ml() %>%
  set_engine("h2o", max_runtime_secs = 120, seed = 1) %>%
  set_mode("regression")

normalized_rec <-
  recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_predictors())

auto_wflow <-
  workflow() %>%
  add_model(auto_spec) %>%
  add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

# Save
auto_fit <- fit(auto_wflow, data = concrete_train)
auto_fit_bundle <- bundle(auto_fit)
saveRDS(auto_fit_bundle, file = "test.h2o.auto_fit.rds") #save the object

# Load
auto_fit_bundle <- readRDS("test.h2o.auto_fit.rds")
auto_fit <- unbundle(auto_fit_bundle)

rank_results(auto_fit)
tidy(auto_fit)

Error in UseMethod("rank_results") :
no applicable method for 'rank_results' applied to an object of class "c('H2ORegressionModel', 'H2OModel', 'Keyed')"

@juliasilge
Copy link
Member

That's true, yep! The focus of bundle is to capture the references needed by a model to make predictions in a new environment. For more info, you can look at:

I would generally expect functions like tidy() and rank_results() to be called during model development, and not so much during model deployment. Can you share a bit more about your use case?

@ClaudiuPapasteri
Copy link
Author

Thank you for the helpful reply, I suspected this was the case and the links you shared made it much clearer. Unfortunately, although the scope of the bundle package should be clear for everyone, possible affordances of the post-bundle object (except for prediction from it) are not so obvious (for me, at least). Maybe it would be helpful to state this more clearly in the documentation.
Any way, thank you guys for the awesome package ecosystems, and thank you Julia, your work and talks inspired and helped me throughout my data journey. It's an honor ...

@juliasilge
Copy link
Member

Thank you so much for the kind words! ❤️

Let's keep this issue open and clarify some of the documentation about what you can expect to do after bundling, especially in the README and main vignette.

(As a side note, I also maintain butcher and this is about the same as how butcher works. Sometimes we keep components in butcher that are needed for something like predict(interval="prediction") but not just your typical predictions.)

@juliasilge juliasilge changed the title H2O AutoML with agua: beyond predict other methods fail Clarify what you can expect to do after bundling, i.e. predict Mar 7, 2023
@Steviey
Copy link

Steviey commented Mar 29, 2024

Can we use pkg: bundle to:

  • save a tidy model
  • reload the tidy model
  • refit the tidy model on new data
  • predict on new data
    ... and if so how- when taking parsnip::auto_ml() and engine: h2o in consideration?

refering to:

https://rstudio.github.io/bundle/

https://rstudio.github.io/bundle/articles/bundle.html

https://rstudio.github.io/bundle/reference/bundle_h2o.html

@juliasilge
Copy link
Member

@Steviey The normal usage that we expect after bundling is to predict with your model, but if can get out the parsnip object, you should be able to refit:

library(bundle)
library(parsnip)
library(callr)

## bundle a model
mod <-
    boost_tree(trees = 5, mtry = 3) %>%
    set_mode("regression") %>%
    set_engine("xgboost") %>%
    fit(mpg ~ ., data = mtcars[1:25,])

bundled_mod <- bundle(mod)

## fit the model to new data
r(
  func = function(bundled_mod) {
    library(bundle)
    library(parsnip)
    
    unbundled_mod <- unbundle(bundled_mod)
    fittable_model <- extract_spec_parsnip(unbundled_mod)
    fittable_model |> fit(mpg ~ ., data = mtcars[26:32,])
  },
  args = list(
    bundled_mod = bundled_mod
  )
)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> Handle is invalid! Suggest using xgb.Booster.complete
#> raw: 7.7 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 0.3, min_child_weight = 1, 
#>     subsample = 1), data = x$data, nrounds = 5, watchlist = x$watchlist, 
#>     verbose = 0, nthread = 1, objective = "reg:squarederror")
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.3", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 10 
#> niter: 5
#> nfeatures : 10 
#> evaluation_log:
#>   iter training_rmse
#>  <num>         <num>
#>      1     16.923941
#>      2     12.953166
#>      3     10.022720
#>      4      7.801856
#>      5      6.089100

Created on 2024-03-29 with reprex v2.1.0

@Steviey
Copy link

Steviey commented Mar 30, 2024

@juliasilge Thank you Julia. Extract_spec_parsnip() returns a parsnip model specification. Does this include hyperparameters from earlier trainings and fits before bundleling? Would this include sub models from the leaderboard of a h2o AutoML-model?

@juliasilge
Copy link
Member

@Steviey Hmmmm, I am not entirely sure as I don't have a ton of experience with H2O. I think a good venue for this kind of question is the agua repo: https://github.com/tidymodels/agua

@Steviey
Copy link

Steviey commented Apr 1, 2024

@juliasilge Thank you Julia for the response. Since the h2o-issue goes deeper to h2o itself, mentioned for example here: business-science/modeltime.h2o#14
I would guess this is still not really resolved, after some years. So my hope was pkg. bundle. would do the job entirely.

More in general related to tidymodels (other models then h2o):
I 'm mainly interested in refitting on new data- but with earlier searched hyperparameters. Let's say I train a model and search for hyperparameters on one day, bundle and save the model or workflow etc. for later use and then the next day unbundle and refit on new/more data. Can we then utilize the efforts/compute time from the day before, namely the best hyperparameters searched before bundleling? Are they included in the bundle for later use? Or do we have to save and retrieve that stuff separately?

This could be an ecological question too (green ML/AI).

Maybe related:
tidymodels/tune#84

If bundle requires separat actions in this regard, I m not sure if this is still best practice:

exec(update, object = tree_mod, !!!final_param)

@juliasilge
Copy link
Member

@Steviey The bundle package can handle bundling up the needed references but doesn't have functionality for getting the best hyperparameters; you'd need to get that through tidymodels infrastructure in either tune or agua. Once you have those hyperparameters, then definitely bundle will work. 👍

@Steviey
Copy link

Steviey commented Apr 2, 2024

@juliasilge OK, then I would bet on finalize more then on update.

@simonpcouch
Copy link
Collaborator

Feels worth mentioning that the Value documentation for each bundle method states:

The output of unbundle() is a model object that is ready to predict() on new data, and other restored functionality (like plotting or summarizing) is supported as a side effect only.

I would argue that this is sufficient to set expectations for what users can do with unbundled objects. :)

@juliasilge
Copy link
Member

That's a great point @simonpcouch. 👍

We haven't heard a lot of other confusion on this point to date, so let's close this as complete. We can revisit in the future as necessary!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants