Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add H20 models #283

Closed
zachmayer opened this issue Oct 19, 2015 · 8 comments
Closed

Add H20 models #283

zachmayer opened this issue Oct 19, 2015 · 8 comments

Comments

@zachmayer
Copy link
Collaborator

@zachmayer zachmayer commented Oct 19, 2015

GLM, RF, GBM, to start. I've been playing around with these on my local machine, and they're pretty awesome. Very fast training, and you can just install the package from CRAN to get started locally.

http://h2o-release.s3.amazonaws.com/h2o/rel-slater/8/docs-website/h2o-docs/index.html#Data%20Science%20Algorithms

Maybe add some pre-defined deep neural networks too (e.g. 1 layer, 2 layer, 3 layers).

@topepo
Copy link
Owner

@topepo topepo commented Oct 20, 2015

If you already have one, can you show a template for one?

I'm guessing that the user should use h2o.init() before running the model (as opposed to being in the train method).

It might also help to look at what Erin LeDell has done with R/H20 ensembles.

@ck37
Copy link
Contributor

@ck37 ck37 commented Dec 9, 2015

+1 to this - I'm also looking into h2o.

@coforfe
Copy link

@coforfe coforfe commented Jan 23, 2016

Well, besides the initialization of the JVM with h2o.init() how data.frames are stored changes too.
The package includes a very interesting new function/algorithm for dimension reduction "Generalized Low Rank Models" which is quite impressive in particular to treat NAs h2o.glrm().

These references are very convenient (from the "H2O World" conference):

https://www.youtube.com/watch?v=zwvzGuS82MA&list=PLNtMya54qvOH6YAVFigzoXb4iIzl0cvgd&index=57

http://www.slideshare.net/0xdata/h2o-world-geeneralized-low-rank-models-madeleine-udell

https://www.youtube.com/watch?v=gEZtZRANeLc&list=PLNtMya54qvOH6YAVFigzoXb4iIzl0cvgd&index=30

http://arxiv.org/pdf/1410.0342v4.pdf

http://www.slideshare.net/0xdata/h2o-world-glrm-anqi-fu

@topepo
Copy link
Owner

@topepo topepo commented Sep 17, 2016

Two new models were checked in if you want to test.

topepo added a commit that referenced this issue Sep 17, 2016
@ganeshmailbox
Copy link

@ganeshmailbox ganeshmailbox commented Sep 27, 2016

I tested this feature, it is a pretty nifty feature, this sets the path to a more scale-able caret. Few observations:

  1. I have a data set about 13K records with 60 features. I ran the exact same data set with exact same control features and found that gbm_h2o took about 10x more time than normal gbm caret model. It is a single laptop (windows) with 4 core and h2o set to utilize all cores (nthreads = -1 during hbo init call)
  2. During the run I also saw that CPU usage of the java process (h2o) keeps dropping and rsession cpu goes up and down. Does this mean there is some back and forth between Java and R during the train function call? I would have expected the r cpu remain very low and Java CPU being consistently high.
@topepo
Copy link
Owner

@topepo topepo commented Sep 28, 2016

ran the exact same data set with exact same control features and found that gbm_h2o took about 10x more time than normal gbm caret model.

I'm not surprised. For each model fit, the data need to be passed back to h2o and then the model is run. I'll try to think of a way to optimize there but I haven't seen many facilities to do so.

Also keep in mind that caret does a lot of optimizations. For example, if you use:

gbm_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                        n.trees = seq(100, 1000, by = 50),
                        shrinkage = c(0.01, 0.001),
                        n.minobsinnode = c(5, 10, 15))

caret only has to fit 24 models instead of the full set of 456 models (per resample).

During the run I also saw that CPU usage of the java process (h2o) keeps dropping and rsession cpu goes up and down. Does this mean there is some back and forth between Java and R during the train function call? I would have expected the r cpu remain very low and Java CPU being consistently high.

I have no idea. I would try writing a for loop and fit some models to try to reproduce the same effect outside of train.

@coforfe
Copy link

@coforfe coforfe commented Sep 28, 2016

Hello,

Thanks for this.

For sure, it will be something very basic, but I am not able to see these new h2o functions when I install caret dev version from Github. Firstly I remove the stable version 6.0-71 and then install via devtools this new one, but I do not see gbm_h2o.R to test it.

Regarding how to interact with h2o, it could help seeing what mlr has already implemented:

https://github.com/mlr-org/mlr/blob/master/R/RLearner_classif_h2ogbm.R

Thanks again,
Carlos.

@topepo
Copy link
Owner

@topepo topepo commented Jul 22, 2017

Added a card to the new models project page instead of using issues.

@topepo topepo closed this Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.