Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

H2O + Modeltime (Forecasting) #5

Open
mdancho84 opened this issue Mar 8, 2021 · 2 comments
Open

H2O + Modeltime (Forecasting) #5

mdancho84 opened this issue Mar 8, 2021 · 2 comments

Comments

@mdancho84
Copy link

mdancho84 commented Mar 8, 2021

Hi @stevenpawley ,

Nice looking package! I'm very interested in this. I wanted to see if I can help progress your development efforts.

Background

I wanted to reach out as the developer of modeltime (a forecasting ecosystem based on tidymodels) and an educator at Business Science, where I'm forming a small team of students to help with software development efforts.

Our next project is combining Modeltime (leverages tidymodels) with H2O. It appears that you have covered a lot of the heavy lifting in your h2oparsnip package.

I wanted to see if there is an opportunity to collaborate. We are looking at making a smaller, focused package that uses the h2o automl algorithm as a forecasting tool.

Next Steps

  • I'd love to see h2oparsnip make its way onto CRAN. We can help with this.
  • I'd also like to see if there is any interest in collaborative development efforts. We can help to expedite the h2oparsnip development process.
@mdancho84 mdancho84 mentioned this issue Mar 8, 2021
@stevenpawley
Copy link
Owner

Hi Matt,

Thanks for getting in touch! I would definitely welcome collaboration. Some background - I started h2oparsnip because I wanted to mix h2o models into a workflow that might not be overall centred on h2o. However, the questions that have arisen since mostly concern striking the right balance between using data efficiently within the cluster, but still working with the tidymodels approach. I took the assumption (and from my own use cases) that the data won’t be exclusively residing in the cluster because otherwise you probably should be using h2o directly. Parsnip also doesn’t currently accept H2OFrames (although I guess it could do because it supports spark).

Max Kuhn brought up the drawback of tuning within the tune infrastructure, which if working on a remote cluster would lead to lots of back and forth with the data. Although I typically work with data in the same location, occasionally I connect to a remote cluster and it does add overhead. Because of this, there is an experimental 'tune_grid_h2o' function that moves data into the cluster once and accepts/returns a resamples object that can otherwise be used in tune to finalize parsnip models etc. I’ve been working on other things over the past few months and I haven’t given it enough thought, but there are trade-offs either way. First, to minimize data transfer, the scoring also has to occur in the cluster, which restricts the metrics to those that h2o supports. Also, you cannot tune recipe parameters. So potential options to move forward include:

  1. Finish tune_grid_h2o and/or possibly extend it so that it supports recipe parameters while attempting to minimize the back-and-forth, accepting that there will be some movement of data to the cluster.

  2. Ask for parsnip to accept H2OFrames and/or use the h2o.grid function inside each model_specification so that any hyper_params supplied by engine argument are used for tuning automatically. This way the data can stay entirely within the cluster and it requires almost no work for the package, however, you obviously can’t tune recipes, nor control the resampling like in tune. It will also be awkward to select the best model other than via the default metric. Most problematic is that if someone tunes the model this way, but also tunes a recipe using tune then the resampling scheme will not be correct, so I feel that (1) is better.

  3. Managing data in the cluster in general. When you use h2o via parsnip as a drop-in replacement, it’s easy to forget about managing the cluster and removing model clutter, particularly with tuning. To some extent, this could be partially managed in (1) via some control options to specify if the predictions and/or models are removed/retained.

  4. The other aspect is the automl, which I haven’t used much, so other than the basic model specification I don’t have a good understanding of use cases, and particularly how you would want to use automl within a more composable workflow like tidymodels. So more work on that aspect in particular, or even just discussion on use cases, would also be really welcome.

I’m not sure how this fits into your plans but I’d appreciate any discussion and thoughts!

Steve

@mdancho84
Copy link
Author

Hey Steve,

Thanks for getting back. Yeah, when I read this I immediately thought about the challenges of the H2O Frame vs Data Frame and the expensive data movement/transfer headaches that can result. I feel like simplification can help, and then tackling the tougher problems down the road might be a better approach, especially if we need support from Tidymodels (Max et al).

Thoughts on H2O AutoML <- Start Here

H2O AutoML is one of the greatest features. It creates many models, and reduces the need for manual tuning. This can be a big benefit because most of the data transfer is a result of Data Frame vs H2O Frame conversions causing data to move in and out of the cluster.

My gut is telling me to start here. It's just so powerful and easy to use. Plus it removes the demands for tuning.

Individual Algorithms

It looks like you have the heavy lifting done here. I'll take a deeper look to see if there's anything that I can add.

Tuning

This is where the data transfers will cause big expenses. You're right - rather than transfer data, possibly look to transfer the tuning parameters. This could be challenging but will limit the data transfers during tuning.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants