Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILE #32

Open
haifengl opened this issue Dec 23, 2015 · 9 comments
Open

SMILE #32

haifengl opened this issue Dec 23, 2015 · 9 comments

Comments

@haifengl
Copy link

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

@haifengl
Copy link
Author

A couple of questions about your benchmark. First about your data encoding. You use the origin 8 variables directly or convert them to other representation?

Besides, the data is highly unbalanced (positive : negative is about 1 : 4). Do you rebalance the data before training?

Can you also report other metric besides AUC, such as accuracy, sensitivity, specificity, etc. None of them are perfect. But it would be better report more than AUC. Thanks!

@haifengl
Copy link
Author

BTW, our random forest AUC is low. It is because the prediction probabilities are derived from votes instead of from leaf weights. We will update the calculation ASAP.

The AUC of our gradient boost trees match other systems.

@szilard
Copy link
Owner

szilard commented Dec 26, 2015

Thanks, I'll try it out.

Re: questions. I use the original (categorical) encoding for the algos/implementations that can deal with it and 1-hot encoding for the ones that cannot.

1:4 is not really "highly" unbalanced (1:100 would be), so I do not rebalance.

Surely, AUC is not "complete", but it captures a lot of what I'm interested in.

Yes, for RF averaging probabilities gives better AUC than averaging votes.

@haifengl
Copy link
Author

Thanks! There are two real valved variables (departure time and distance). You also treat them as categorical values?

This data is unbalanced. Even though AUC is at about 70%, the sensitivity is only about 10% (99% specificity), which is pretty much useless for this particular problem. Our implementation can assign different weights to classes. By adjusting the weight, we can achieve much higher sensitivity (of course lower specificity) and lower AUC. I feel that it is more meaningful in practice. As your benchmark is most about speed and memory usage, it may not be important.

@haifengl
Copy link
Author

haifengl commented Jan 7, 2016

Have you tried it? Any help I can do? Thanks!

@szilard
Copy link
Owner

szilard commented Jan 7, 2016

No, sorry. And I'll have very limited time the next 3-4 weeks for sure. How about you take a look at this https://github.com/szilard/benchm-ml/tree/master/z-other-tools and you run random forests with 100 trees on 32 cores for the 1M dataset and you tell me the run time and AUC?

@haifengl
Copy link
Author

haifengl commented Jan 7, 2016

No problem. I did run on the 1M dataset on my 4 core Mac (while I am using it for other things). Here is the print out:

--------------- 100K samples ---------------------
class: "N", "Y"
train data size: 100000, test data size: 100000
train data positive : negative = 19044 : 80956
test data positive : negative = 21617 : 78383
Training Random Forest of 500 trees...
runtime: 40691.435646 ms
Accuracy = 78.56%
Sensitivity = 2.17%
Specificity = 99.62%
AUC = 69.05%
OOB error rate = 18.93%
runtime: 6321.360014 ms

Training Gradient Boosted Trees of 300 trees...
Accuracy = 79.66%
Sensitivity = 8.84%
Specificity = 99.19%
AUC = 72.50%

Training AdaBoost of 300 trees...
runtime: 6180.334174 ms
Accuracy = 79.06%
Sensitivity = 7.85%
Specificity = 98.70%
AUC = 71.76%

--------------- 1M samples ---------------------
class: "N", "Y"
train data size: 1000000, test data size: 100000
train data positive : negative = 192982 : 807018
test data positive : negative = 21617 : 78383
Training Random Forest of 500 trees...
runtime: 1436028.498601 ms
Accuracy = 78.41%
Sensitivity = 0.15%
Specificity = 99.99%
AUC = 69.91%
OOB error rate = 19.26%

Training Gradient Boosted Trees of 300 trees...
runtime: 83840.278901 ms
Accuracy = 79.63%
Sensitivity = 8.13%
Specificity = 99.35%
AUC = 72.79%

Training AdaBoost of 300 trees...
runtime: 96979.686961 ms
Accuracy = 79.15%
Sensitivity = 8.32%
Specificity = 98.68%
AUC = 71.65%

Note that I report other metrics besides AUC and also run AdaBoost. For gradient boosting, I use your second settings (300 trees). Thanks!

@haifengl
Copy link
Author

haifengl commented Jan 7, 2016

My running time is milliseconds. So it is about 1436 seconds for random forest, 84 seconds for gradient boosting, and 97 seconds for AdaBoost on the 1M dataset. As random forest training can be linearly scaled, I expect that we will use 1/8 time on a 32 core box. We also parallelize tree training in gradient boosting and AdaBoost. I expect we will use less time but won't as little as 1/8.

@haifengl
Copy link
Author

haifengl commented Jan 7, 2016

BTW, we calculate AUC by our own implantation (https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/validation/AUC.java), which is based on Mann-Whitney U test. I am not sure if it is same as yours. If you want, I can ship you the prediction results and you can calculate it with your AUC method. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants