New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catboost initial results #4
Comments
Is this the same speed benchmark as the front poage for xgboost or lightgbm (r packages)? |
Yes, should be comparable to the results from the main README in this repo. |
@szilard actually we did tune the code to run much faster, now we should be faster than xgboost and the same as lightGBM. We are working on more speedups now. |
@szilard And we also have implemented GPU training, we compared on Epsilon dataset, and it's 2 times faster than LightGBM and 20 times faster than XGBoost, it would be nice to add catboost to GPU benchmarks. |
Thanks @annaveronika . Yeah, I talked to @sab (Sergey Brazhnik) at the NIPS conference in December, I should definitely run the benchmarks with the latest catboost version (and with the GPU version as well). |
The CPU version is still 10x slower than lightgbm:
@annaveronika @sab The catboost code I'm running is this: Anything wrong with that? |
If you are running the latest version built from code on github than it is correct. But if I understand correctly you are running benchmarks on airlines dataset - this is actually a dataset with 6 or 8 categorical features, so it's fair to use them as categorical. LightGBM does a specific optimisation for this - they pack binary features from one-hot encoding into one histogram, so they basically work with 8 features, not with all these binary ones. We'll do the same if you use one-hot-max-size feature. On regular datasets, not one-hot encoded, for example on Epsilon, this difference will be elliminated. But anyway it's better even for quality of catboost to use categorical features as is, not one-hot encoded. |
Adding
then
so it errors out, but speed is still |
Yes, I forgot about large one-hot-max size - we didn't add it, because it is better in quality to use statistics for cat features with many values. In the meanwhile I would suggest to also try some dataset with many features. |
We will also allow for one-hot-max-size > 255 |
@annaveronika I made a very simple test for
The results are:
I upgraded to latest versions of both packages today:
Any idea what's up? Should the speeds be on par for this toy example? |
You need to set thread_count for both of them to the same number, for example to 16 so that is is parallelized with same amount of threads. Plus LightGBM builds different trees by default, so the fair comparison needs to take it into account. To build more or less same trees you need to set num_leaves=64 in LightGBM. But there will still be a difference, and the reason for it is that we have an expensive preprocessing of the data before starting iterations, and it is proportional to how many different values in data you have. If you generate it at random, then all the values will be different, so preprocessing will be long. For real data it is usually less. Plus on real scenarios when you have a thousand of iterations this preprocessing doesn't play role. And now it is more than half of the time. We have other preprocessing schemes, you can set feature_border_type='Median' or 'GreedyLogSum', they are much faster. So your script with this parameters:
will give results:
This shows that catboost is a little faster + stable in speed. And LightGBM is changing time in runs. |
Actually LightGBM performs best with 1 thread per core, and it sets 32 threads by default, so the comparison needs to be changed - if we run it without hyperthreading, we will get results like that:
|
And for large datasets we have same speed (here again 16 threads, 64 leaves, no hyperthreading, and 100k docs):
|
Back to the airline data (1M), with
|
@szilard could you try upgrading the version? This bug should have been fixed today in version 0.6.1.1. One more thing about airline data. It is a very special dataset since it has little amount of features. On this amount the bottleneck of the algorithm changes, it is usually the selection of the structure of the tree. And if you have less then 10 featues, then the bottleneck is calculation of the resulting leaf values. When calculating leaf values we do several gradient steps inside one tree, which makes this particular process longer, which is not seen usually. To compare the implementation speed you can set leaf_estimation_iterations=1. But for quality purposes I would recommend to have defaults. |
OK, now
|
Just to keep in mind the other runs: without one-hot encoding: runs 65sec AUC=0.7424685 without one-hot encoding but with I'm gonna try the GPU version as well. As far as I see the R package does not have GPU support, right? (in that case I guess I'll have to use the python API) |
Summary: airline dataset 1M records
|
One more: |
@annaveronika Thank you for your answers - they are very helpful. There is a lot of in-depth knowledge in them. May I suggest a blog post or doc or example titled something like "How to make catboost as fast as possible" wherein you cover what you did in this thread? I understand there will likely be a hit in performance, but some practitioners may be OK with that. |
on GPU: p3.2xlarge 1 GPU Tesla V100 it trains for 5sec but then also accuracy is pretty bad AUC=0.68341 |
It's again a bug (the accuracy and the wait after the training), the code of the fix is already on github, but not jet on pypi. Will be there in about two days, together with beta version of multimachine training on GPU. You could try building from source using the instruction here https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/ or wait for the fix on pypi. One more thing about the speed - the current version has feature parallellization, which will not give optimal speedups for 8 features. The document parallel version will come soon. |
Thanks for update, I'll wait 2 days and try again. |
Results now (0.6.2):
|
The GPU/CPU usage by gpustat/mpstat:
|
So while training on the GPU is only 5sec, total training time is 20sec. There is about 5sec before GPU training and 10sec after GPU training when some computations happen on the CPU, I wonder what it is, if you can elaborate on that (and maybe that can be cut/optimized)? |
The preprocessing contains data binarization and calculation of part of statistics for categorical features, loading everything on GPU. The postprocessing contains calculation of all selected statistics on categorical features, loading everything on CPU. These parts will be speed up, but for training 100 iterations on V100 GPU will not be a bottleneck anyway. In real life you don't train for 100 iterations, so I don't think we should specifically optimize for that. Also could you check that you are running 16 threads? |
@sergeyf We are planning to provide this guide. Here is the issue: catboost/catboost#253 |
Thank you!
…On Feb 13, 2018 12:57 AM, "annaveronika" ***@***.***> wrote:
@sergeyf <https://github.com/sergeyf> We are planning to provide this
guide. Here is the issue: catboost/catboost#253
<catboost/catboost#253>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABya7D77x7MIxM3P9aRWtmsL4mPCjJuYks5tUU5ugaJpZM4OcrOE>
.
|
Thanks @annaveronika for explanation on the process before and after the GPU computation. I added Not sure what you mean by "In real life you don't train for 100 iterations". For many datasets in practice you might overfit after a few hundred iterations, so one should do early stopping. Also, I would assume that 200 iterations takes more or less 2x 100 iterations (at least with the other tools on CPU), so it shouldn't really matter how many 100 iterations we benchmark (if we do that across the board). |
catboost GPU 200 iterations: GPU time 11sec (roughly 2x), wall time 32sec (so the CPU part increased only from 15sec to 21sec, not 2x). |
In comparison, for xgboost there is no pre- and post- GPU computation lag:
Similarly for lightgbm GPU, thought lightgbm implementation seems less efficient (it's only using ~5% GPU and it's slower). |
I mean 100 interations is almost never optimal. There is a tradeof between learning rate and amount of iterations, and the less learning rate the more iterations you need, and the better will be the quality. Until at some learning rate converges to best quality. To make sure you get good quality we set learning rate to 0.03 by default, which in many cases is good enough. And for large datasets you need more iterations than for small given a fixed learning rate. For large dataset you usually need thousands of iterations for best quality. And GPU you need most of all for large datasets, because on them training is really slow on CPU. |
I actually agree with what you are saying above. The setup The goal of my little benchmark is to compare speed (and also see if accuracy is not something really bad which would be sign of a bug, e.g. the one it helped you find a few days ago). With the CPU versions the speed has been pretty much linear in the number of iterations (and dataset size), also deeper trees is slower etc. So I could do However, I did all the above with a mindset of 2-3 yrs ago when there were no GPU versions (it's when I started the other "main" benchm-ml GitHub repo). I see now that some of those premisses of runtime scaling on the GPU are not true (e.g. vs dataset sizes or number of trees), so I might experiment a little bit with changing the params in the next few days (for all tools). |
Btw are you guys planning on having the GPU version available from R any time soon as well? |
We definitely will do this, here is the issue: catboost/catboost#255 |
JFYI: In attachment plot with CatBoost GPU vs CPU (dual-socket Intel Xeon E2560v2) speed comparisons for dataset with different sample count on Tesla K40, GTX1080Ti and V100 (plot was builded from samples of our internal dataset with approximately 700 numerical features, K40 is ≈6 times faster, but it's not easy to see on the plot because of V100, which is ≈45 times faster). Benchmark was run with -x32 option, for default -x128 results are slightly worse. |
It's not true. For histogram-based algorithm on decision trees learning of full ensemble in general is not a linear function. CatBoost (as well as LightGBM) uses at most half of the data to compute necessary statistics for splits after first leaf was build. For some datasets splits on latter trees are highly imbalanced and in such case we'll learn this trees faster than first balanced ones Such small benchmarks would be almost correct (in terms of speed) for oblivious trees (because they are symmetric and this gives more balanced trees), but could be very misleading for GBMs with leaf wise trees (like LightGBM). I saw several examples, when LightGBM would learn small and simple trees on first iterations and starts to build very deep imbalanced trees after. |
@Noxoomo For CPU lightgbm the training time and AUC vs number of trees (with the data and code from this repo):
so runtime is not dramatically far from linear in the number of trees (last trees are not even 2x faster than first few) The code added to the code in this repo:
|
CPU xgboost with
|
GPU xgboost
|
New boosting lib from yandex:
https://github.com/catboost/catboost
The text was updated successfully, but these errors were encountered: