Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

AutoML H2o #4

Closed
Shafi2016 opened this issue Feb 8, 2021 · 3 comments
Closed

AutoML H2o #4

Shafi2016 opened this issue Feb 8, 2021 · 3 comments

Comments

@Shafi2016
Copy link

Shafi2016 commented Feb 8, 2021

Thanks for the nice package. Do you have any documentation for this package? Do you have a basic example for running AutoML? I was thinking to combine it with modeltime R.

@stevenpawley
Copy link
Owner

Unfortunately I haven't gotten that far yet. However, it is mostly the same as using anything in tidymodels/parsnip, apart from you set the engine to "h2o". A conceptual struggle is that H2O is probably best when the data is kept within a H2OFrame, but that doesn't work if you are using other tidymodels features, e.g. recipes, tune etc., which require data to be in the R environment. There is a very rough tune_grid_h2o in the package, which keeps the data within the H2O cluster.

@Shafi2016
Copy link
Author

Probably you can collaborate with H2O's people to improve it further.

@mdancho84
Copy link

I agree with @stevenpawley that it's needed to minimize the data conversion (this is actually very expensive when converting to/from Data Frame / H2O Frame.

The nice thing about H2O AutoML is that it manages the whole process of tuning, so there shouldn't be much hyper param tweaking. If there is, the user can use set_engine() to specify the needed arguments, which would go straight to the h2o::automl() function and be used in the tuning process within H2O.

With that said, the only challenge I see is that (unlike other H2O algorithms) AutoML returns a Leaderboard. This requires a choice on the user's end. Typically my choices are:

  • Best (normally ends up being a Stacked Ensemble)
  • Best-Explainable (I usually take the top explainable model, which is XGBoost, GBM, Deep Learning, etc. Stacked Ensembles don't have variable importance so I don't use these for explainability).

An option during the training process would be to store both of these models. Then when the user serializes (saves), the user gets both models. Prediction happens with the Best. Explanation happens with the best explainable.

These are just my thoughts... Would be happy to discuss more as part of #5.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants