-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aGTBoost #35
Comments
xgboost dense 0.17 sec/tree, agtboost 11.1 sec/tree, 60x scoring (100K records): xgboost sparse 0.13sec, xgboost dense 1.26sec, agtboost 21.3sec AUC: untuned xgboost 0.7324224, agtboost 0.7247399 |
From the project's README:
|
@szilard first of all thank you for testing aGTBoost! The short reply to this is that aGTBoost is not yet optimized for speed. The focus of aGTBoost has been to develop an information criterion for gradient tree boosting that should not incur too much overhead, so to remove the (computational and code) burdens of cross-validation for tuning of hyperparameters.
In practice, As of now, however, aGTBoost should in principle be equivalent to dense deterministic xgboost In code, any faults to not reach the speed of the below version of xgboost are due to my own shortcomings:
For non-large datasets, the computation times should be fairly similar from what I have measured. But with the advantage of aGTBoost to not require any tuning. But, this is still a research project. I am very happy that you have tested aGTBoost, and would greatly appreciate it if this issue could be kept open, for me to get some more time to implement the above mentioned functionality (or for others to see this and provide assistance), and then test again. Thank you! |
Great work @Blunde1 and thanks for explanations. Absolutely, I'm happy to keep this issue open so we can post findings and updated results here. |
Thanks for clarifying aGTBoost is so far similar to Therefore here is a quick run of xgb exact for comparison:
Btw the server I'm testing on is a modest AWS instance with 8 cores and I'm using a relatively old version of xgb (0.90.0.2). |
So using histogram instead of exact will speed up training (and even more for larger datasets). Not sure if aGTBoost can be amended to have something similar. And of course sparsity and parallelization would speed things up as well, see my comments on the already opened issues Blunde1/agtboost#18 and Blunde1/agtboost#24 |
Looking at the complexity of trees: aGTBoost:
xgboost:
Looks like aGTBoost trees are very shallow, yet they can achieve relatively good accuracy AUC 0.724 vs 0.732 for xgboost. Any comments on this @Blunde1 ? (UPDATE: also see below lightgbm results, AUC is not bad even for shallow trees) Full code here:
|
For lightgbm:
|
or as usually used without
|
Restricting
so AUC is similar to aGTBoost's when |
@szilard I will try to comment on the above results in the context of aGTBoost. First of all, it is worth mentioning that there are two flavours of aGTBoost, which difference lie in how they employ However, both of these methods employ the mentioned information criterion. In particular, the implementation of the information criterion contains an independence assumption, which will in certain cases (when the design matrix is large and features are highly correlated) result in models that could benefit a little bit from slightly higher complexity. Certain things on the research side are likely to work against this: Optimal stochastic sampling (which will allow higher complexity due to the effects averaging or "bootstrap integration" of randomness), optimal L2 regularization, and obviously working on implementing the dependence effects directly. Again, an example in code.
Of course, most real data has lots of dependence, which imply that an agtboost model might need a little bit more complexity for optimal predictions (but not much). On the positive side, they are likely to not overfit, and the underfit will be very small. It would be very interesting to see what an optimally tuned xgboost model would do in the above example. Something I forgot to mention in the previous post, but which is relevant:
Papers on the above points will of course also follow... |
Thank you @Blunde1 for insights into aGTBoost. I tried
the code and training time is:
It's good to have your understanding of the algo, thanks again for input. |
Thanks also for the simulation example above. I was looking at that and shared with @Laurae2 ("It would be very interesting to see what an optimally tuned xgboost model would do in the above example.") and he'll answer here himself. |
I understand your priorities for the next updates, and it makes perfect sense in the academic setting you are in (obviously you want to do research and finish the PhD etc.) However, if the goal was (say later) for aGTBoost to become used widely by practitioners, then it would need to be fast/efficient (enough) to be able to run competitively with xgboost/lightgbm on datasets of say 100K-1M records (apart from some niche use cases with small data, say ~1000 records e.g. in the medical field or so). I'm not saying this must be the goal, it is perfectly OK to keep it a research project, heck, it's already a very good one. |
Dumb "tuning" abusing knowledge of only 1 feature is relevant allows us to choose a proper more maximum depth of 2 instead of an absurd 10, which leads us out of the box to a better result than agtboost.
Timings using a server with Dual Xeon 6154 (fixed 3.7 GHz, 36 physical cores) and 768 GB RAM (2666 MHz) xgboost has no validation set and is untuned, and thus may perform better. Note that the example no random seed set for reproducibility. I've included the example using seed 1 below. # Load
library(agtboost)
library(xgboost)
# Dimensions
n <- 1000
m <- 1000
set.seed(1)
# Generate data
x_tr <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
x_te <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
y_tr <- rnorm(n, x_tr[,1]^2, 2) # Only first feature is significant, 999 is noise
y_te <- rnorm(n, x_te[,1]^2, 2)
# Training agtboost normal
# it: 430 | n-leaves: 2 | tr loss: 4.064 | gen loss: 7.901
# user system elapsed
# 102.843 0.000 102.818
system.time({mod <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10)})
# Predict
pred <- predict(mod, x_te)
mean((y_te - pred)^2)
# [1] 4.398976
# Training agtboost vanilla
# it: 380 | n-leaves: 2 | tr loss: 4.045 | gen loss: 8.565
# user system elapsed
# 130.827 0.000 130.814
system.time({mod_van <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10, algorithm = "vanilla")})
# Predict
pred_van <- predict(mod_van, x_te)
mean((y_te - pred_van)^2)
# [1] 4.375327
# xgb with no validation set
# user system elapsed
# 230.341 0.000 3.205
dxgb_train <- xgb.DMatrix(data = x_tr, label = y_tr)
system.time({md <- xgb.train(data = dxgb_train,
nround = 1000, max_depth = 2, eta = 0.01,
verbose = 1, print_every_n = 10,
tree_method = "exact", nthread = 72)})
# xgb pred
pred_xgb <- predict(md, x_te)
mean((y_te - pred_xgb)^2)
# [1] 4.348369 |
New implementation aGTBoost https://github.com/Blunde1/agtboost
The text was updated successfully, but these errors were encountered: