-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datacratic MLDB results #25
Comments
Fantastic results @nicolaskruchten, thanks for submitting (and congrats, @datacratic). I added the code here: and the results here: I have some questions though:
|
Re previous: or maybe |
validation_split=0.5 is a parameter for bagging, and causes each tree in the forest to run on a random 1/2 of the data. It's called validation_split because the other half of the data is available to the weak learner to use for early stopping; it's not used by the decision tree classifier but is important for when you use boosting as a weak learner. Were we to use validation_split=1.0, the only diversity would come from the selection of features in the weak learners. In our experience using bagging in this manner gives better diversity for the trees and a better AUC for held out examples. We'll double-check on the exact effect and get back here, with an extra result if appropriate. I'm certain that It's not a holdout for the classifier experiment, and all of the training data is seen by the classifier training and incorporated into the output. Categorical features are handled directly by the decision tree training, and aren't expanded as with a one-hot encoding (the decision tree training code does consider each categorical value separately, however, and so the effect is the same on the trained classifier). Thus, the number of features is the same as the number of columns in the dataset, and so sqrt(num features) will be 3 or so, which is too low (it would be faster, but accuracy would suffer, especially with such deep trees). Stepping back, we are running bagged decision trees to get an effect as close as possible to classical random forests. This produces a committee of decision trees (like random forests) but has different hyperparameters and a different means of introducing entropy into the training set. Typically we would use bagged boosted decision trees, something like 20 bags of 20 rounds of boosting of depth 5 trees, for such a task but that would be hard to meaningfully compare with the other results in the benchmark. |
Thanks @jeremybarnes for comments and I guess for authoring JML https://github.com/jeremybarnes/jml - likely the main reason why MLDB is so fast Wrt I see what you are saying on categorical data, that's what H2O does as well, and from what I can see it's a huge performance boost vs 1-hot encoding. I think my understanding now fits what you are saying except a bit your last paragraph. I have the impression now that with I'm gonna try to run the code by @nicolaskruchten soon. |
@jeremybarnes @datacratic @nicolaskruchten I'm trying to run your the code. I get the same AUC even if I change the params e.g. |
That is strange. Can you look at the stdout/err of the docker container? You can see it finishing bags as it goes... The number should correspond to the parameter you set. That leaves the possibility that there is a problem with the random selection and so each bag is the same. That's something that we can look into over the weekend. |
Yes, I can see in the output bags going up to 100 and 200 respectively. It also takes more time to train 200, but AUC is the same. Also the same AUC if I change some of the other numeric params:
I'm sure you guys can figure out what's going on 100x faster than I ;) |
I'm running this: Feel free to submit pull request for the above code if you guys make corrections. |
It took a little longer than anticipated to look into, but here are the conclusions:
I would suggest that we re-submit with the parameters |
Thanks Jeremy for clarifications (and yes, it makes absolutely sense). It's amazing a 15-yr old tool can keep up so well while there are new machine learning tools written every day. I've been saying for a while that machine learning looks more like an HPC problem to me than a "big data" one. @nicolaskruchten (or @datacratic) can you guys run it and resubmit the results here with the settings @jeremybarnes suggested above? Also would be great if you can update the code https://github.com/szilard/benchm-ml/blob/master/z-other-tools/9a-datacratic.py and send a PR. |
I will do an MLDB release and then resubmit the results and code :) |
Awesome, thanks. |
Thanks for new results. I'm gonna try to verify in a few days. |
@nicolaskruchten I was able to verify your latest results. Also it seem the "same AUC" problem has been fixed in the latest release. Thanks @datacratic @nicolaskruchten @jeremybarnes for contributing to the benchmark. |
This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/
The text was updated successfully, but these errors were encountered: