catboost initial results #4

szilard · 2017-07-19T12:44:16Z

New boosting lib from yandex:

szilard · 2017-07-19T13:06:36Z

This code runs (using catboost R package commit 7e4ba38 on AWS EC2 r4.8xlarge):

              training_time       AUC
 1M records:     94sec          0.7429
10M records:     1100sec        0.7498

so slower than h2o/xgboost/lightgbm.

@catboost Can you guys tune the above catboost code run significantly faster?

ddofer · 2017-07-20T16:59:04Z

Is this the same speed benchmark as the front poage for xgboost or lightgbm (r packages)?

szilard · 2017-07-20T17:26:06Z

Yes, should be comparable to the results from the main README in this repo.

annaveronika · 2018-01-31T16:43:58Z

@szilard actually we did tune the code to run much faster, now we should be faster than xgboost and the same as lightGBM. We are working on more speedups now.

annaveronika · 2018-01-31T17:00:54Z

@szilard And we also have implemented GPU training, we compared on Epsilon dataset, and it's 2 times faster than LightGBM and 20 times faster than XGBoost, it would be nice to add catboost to GPU benchmarks.

szilard · 2018-01-31T19:55:14Z

Thanks @annaveronika . Yeah, I talked to @sab (Sergey Brazhnik) at the NIPS conference in December, I should definitely run the benchmarks with the latest catboost version (and with the GPU version as well).

szilard · 2018-02-01T19:57:47Z

The CPU version is still 10x slower than lightgbm:

                             training_time       AUC
 previous:        1M records:     94sec          0.7429
 current version: 1M records:     64sec          0.7406

@annaveronika @sab The catboost code I'm running is this:
https://github.com/szilard/GBM-perf/blob/9465eea3faf843e6133605c8e6341940da919c78/wip-testing/catboost/run.R

Anything wrong with that?

annaveronika · 2018-02-01T20:47:25Z

If you are running the latest version built from code on github than it is correct.

But if I understand correctly you are running benchmarks on airlines dataset - this is actually a dataset with 6 or 8 categorical features, so it's fair to use them as categorical.
If you want to do one-hot encoding, you could use one-hot-max-size = some large number feature for catboost. Or just pass them as categorical.

LightGBM does a specific optimisation for this - they pack binary features from one-hot encoding into one histogram, so they basically work with 8 features, not with all these binary ones. We'll do the same if you use one-hot-max-size feature.

On regular datasets, not one-hot encoded, for example on Epsilon, this difference will be elliminated.
We will also do this optimisation later.

But anyway it's better even for quality of catboost to use categorical features as is, not one-hot encoded.

szilard · 2018-02-01T21:11:46Z

Adding one_hot_max_size = 1000 will give:

Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
  catboost/libs/options/cat_feature_options.h:164: Error in one_hot_max_size: maximum value of one-hot-encoding is 255

then one_hot_max_size = 250:

Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
  util/generic/hash.h:1654: Key not found in hashtable: -2141854053
Timing stopped at: 448.5 32.24 48.02

so it errors out, but speed is still >48sec.

annaveronika · 2018-02-01T21:24:39Z

Yes, I forgot about large one-hot-max size - we didn't add it, because it is better in quality to use statistics for cat features with many values.
With cat features it works more slow, because it generates many feature combinations. We first of all optimise for quality, so we allow for deep combinations.
If you want speed, you need to do max_ctr_complexity 1 and one_hot_max_size 255
If you want quality, you need defaults.
The bug - we'll fix and I get back to you.

In the meanwhile I would suggest to also try some dataset with many features.

annaveronika · 2018-02-01T21:25:58Z

We will also allow for one-hot-max-size > 255

sergeyf · 2018-02-02T04:22:35Z

@annaveronika I made a very simple test for lightgbm regressor vs catboost regressor without any categorical variables:

import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.datasets import make_regression
import timeit

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

X, y = make_regression(n_samples=10000, n_features=500, random_state=0)

gbm1 = lgb.LGBMRegressor(objective='regression', n_estimators=100)
gbm2 = CatBoostRegressor(loss_function='RMSE', n_estimators=100, verbose=False)

print('lightgbm time: ', timeit.timeit(wrapper(gbm1.fit, X, y=y), number=1))
print('catboost time: ', timeit.timeit(wrapper(gbm2.fit, X, y=y), number=1))

The results are:

lightgbm time:  3.8449372718470847
catboost time:  20.269001481390248

I upgraded to latest versions of both packages today:

catboost.__version__
Out[3]: '0.6.1'

lgb.__version__
Out[4]: '2.1.0'

Any idea what's up? Should the speeds be on par for this toy example?

annaveronika · 2018-02-02T11:12:53Z

You need to set thread_count for both of them to the same number, for example to 16 so that is is parallelized with same amount of threads. Plus LightGBM builds different trees by default, so the fair comparison needs to take it into account. To build more or less same trees you need to set num_leaves=64 in LightGBM.

But there will still be a difference, and the reason for it is that we have an expensive preprocessing of the data before starting iterations, and it is proportional to how many different values in data you have. If you generate it at random, then all the values will be different, so preprocessing will be long. For real data it is usually less. Plus on real scenarios when you have a thousand of iterations this preprocessing doesn't play role. And now it is more than half of the time.

We have other preprocessing schemes, you can set feature_border_type='Median' or 'GreedyLogSum', they are much faster.

So your script with this parameters:

gbm1 = lgb.LGBMRegressor(objective='regression', n_estimators=100, num_leaves=64)
gbm2 = CatBoostRegressor(loss_function='RMSE', n_estimators=100, verbose=False, thread_count=16, feature_border_type='GreedyLogSum')

will give results:

('lightgbm time: ', 5.4255828857421875)
('catboost time: ', 4.542175054550171)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 4.835729122161865)
('catboost time: ', 4.658789873123169)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 5.37877082824707)
('catboost time: ', 4.683574914932251)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 6.206507205963135)
('catboost time: ', 4.6429479122161865)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 9.472325086593628)
('catboost time: ', 4.676542043685913)

This shows that catboost is a little faster + stable in speed. And LightGBM is changing time in runs.

annaveronika · 2018-02-02T12:06:49Z

Actually LightGBM performs best with 1 thread per core, and it sets 32 threads by default, so the comparison needs to be changed - if we run it without hyperthreading, we will get results like that:

espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.2300100326538086)
('catboost time: ', 4.494148015975952)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.648721933364868)
('catboost time: ', 4.5552661418914795)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.2850260734558105)
('catboost time: ', 4.801781177520752)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.6587960720062256)
('catboost time: ', 4.60023307800293)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.0055198669433594)
('catboost time: ', 4.592664957046509)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.0650908946990967)
('catboost time: ', 4.515332937240601)

annaveronika · 2018-02-02T12:46:51Z

And for large datasets we have same speed (here again 16 threads, 64 leaves, no hyperthreading, and 100k docs):

espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 12.613971948623657)
('catboost time: ', 13.314411163330078)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 12.66583800315857)
('catboost time: ', 12.891108989715576)

szilard · 2018-02-02T15:03:37Z

Back to the airline data (1M), with max_ctr_complexity=1 it runs for 15sec and AUC=0.7347226

max_ctr_complexity=1, one_hot_max_size=250 errors out (the same bug I guess):

Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
  util/generic/hash.h:1654: Key not found in hashtable: -2141854053

annaveronika · 2018-02-02T16:45:38Z

@szilard could you try upgrading the version? This bug should have been fixed today in version 0.6.1.1.

One more thing about airline data. It is a very special dataset since it has little amount of features. On this amount the bottleneck of the algorithm changes, it is usually the selection of the structure of the tree. And if you have less then 10 featues, then the bottleneck is calculation of the resulting leaf values.

When calculating leaf values we do several gradient steps inside one tree, which makes this particular process longer, which is not seen usually. To compare the implementation speed you can set leaf_estimation_iterations=1. But for quality purposes I would recommend to have defaults.

szilard · 2018-02-02T17:28:17Z

OK, now one_hot_max_size=255 runs 49sec AUC=0.7388133

max_ctr_complexity=1 & one_hot_max_size=255 runs 13sec AUC=0.733911

szilard · 2018-02-02T17:33:51Z

While having more datasets in number 1 on my wishlist for a more complete benchmark, see e.g. my KDD talk

I'm still doing this in my spare time, so probably not gonna happen any time soon.

szilard · 2018-02-02T17:42:07Z

Just to keep in mind the other runs:

without one-hot encoding: runs 65sec AUC=0.7424685

without one-hot encoding but with max_ctr_complexity=1 runs 14sec AUC=0.7338973

I'm gonna try the GPU version as well. As far as I see the R package does not have GPU support, right? (in that case I guess I'll have to use the python API)

szilard · 2018-02-02T17:49:13Z

Summary:

airline dataset 1M records
from R on r4.8xlarge (32 cores)
iterations = 100, depth = 10, learning_rate = 0.1

                                            runtime       AUC
                                            65s    0.7426548
one_hot_max_size=255                         48s.   0.7376878
max_ctr_complexity=1                         15s.     0.7345624
max_ctr_complexity=1, one_hot_max_size=255.  13s     0.7336248

szilard · 2018-02-02T17:53:22Z

One more: leaf_estimation_iterations=1. 65s. AUC=0.742523

sergeyf · 2018-02-02T18:01:16Z

@annaveronika Thank you for your answers - they are very helpful. There is a lot of in-depth knowledge in them. May I suggest a blog post or doc or example titled something like "How to make catboost as fast as possible" wherein you cover what you did in this thread? I understand there will likely be a hit in performance, but some practitioners may be OK with that.

szilard · 2018-02-02T19:04:50Z

on GPU:

p3.2xlarge 1 GPU Tesla V100
Ubuntu 16.04
CUDA 8.0
from Python with task_type = "GPU"

it trains for 5sec but then .fit() hangs for a while (transferring back data to the CPU?) and wall time is 35sec

also accuracy is pretty bad AUC=0.68341

annaveronika · 2018-02-05T14:18:12Z

It's again a bug (the accuracy and the wait after the training), the code of the fix is already on github, but not jet on pypi. Will be there in about two days, together with beta version of multimachine training on GPU. You could try building from source using the instruction here https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/ or wait for the fix on pypi.

One more thing about the speed - the current version has feature parallellization, which will not give optimal speedups for 8 features. The document parallel version will come soon.
For dataset with many features the current GPU training gives up to 46 times speedup compared to CPU, dependent on the dataset size.

szilard · 2018-02-05T15:04:40Z

Thanks for update, I'll wait 2 days and try again.

szilard · 2018-02-12T23:09:50Z

Results now (0.6.2):

96:     learn: 0.3879724063     total: 5.45s    remaining: 169ms
97:     learn: 0.3878293437     total: 5.5s     remaining: 112ms
98:     learn: 0.3877669062     total: 5.55s    remaining: 56.1ms
99:     learn: 0.3876353125     total: 5.61s    remaining: 0us
CPU times: user 40.5 s, sys: 5.46 s, total: 45.9 s
Wall time: 21 s
Out[15]: <catboost.core._CatBoostBase at 0x7fa96d514190>

In [17]: metrics.roc_auc_score(y_test, y_pred)
Out[17]: 0.7417231363364158

szilard · 2018-02-12T23:20:05Z

The GPU/CPU usage by gpustat/mpstat:

[0] Tesla V100-SXM2-16GB | 40'C,   0 % |     0 / 16160 MB |
[0] Tesla V100-SXM2-16GB | 40'C,   4 % |   436 / 16160 MB | root(426M)
[0] Tesla V100-SXM2-16GB | 40'C,   3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 40'C,   3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 40'C,   3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C,  86 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C,  83 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C,  84 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 43'C,  85 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 43'C,  81 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 41'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 41'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 41'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C,   0 % |   438 / 16160 MB | root(428M)

11:18:03 PM  all   11.67    0.00   12.55    0.25    0.00    0.00    0.00    0.00    0.00   75.53
11:18:04 PM  all   12.78    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   86.97
11:18:05 PM  all   17.17    0.00   10.78    0.13    0.00    0.00    0.13    0.00    0.00   71.80
11:18:06 PM  all   11.40    0.00   12.91    0.00    0.00    0.00    0.00    0.00    0.00   75.69
11:18:07 PM  all   10.68    0.00    8.54    0.00    0.00    0.00    0.00    0.00    0.00   80.78
11:18:08 PM  all   16.92    0.00    9.02    0.00    0.00    0.00    0.00    0.00    0.00   74.06
11:18:09 PM  all   17.38    0.00    8.62    0.00    0.00    0.00    0.00    0.00    0.00   74.00
11:18:10 PM  all   17.40    0.00    8.01    0.00    0.00    0.00    0.00    0.00    0.00   74.59
11:18:11 PM  all   10.79    0.00    3.81    0.00    0.00    0.00    0.25    0.00    0.00   85.15
11:18:12 PM  all    9.69    0.00    3.83    0.00    0.00    0.00    0.00    0.00    0.00   86.48
11:18:13 PM  all   10.56    0.00    3.18    0.00    0.00    0.00    0.13    0.00    0.00   86.13
11:18:14 PM  all   11.22    0.00    3.19    0.00    0.00    0.00    0.13    0.00    0.00   85.46
11:18:15 PM  all   10.06    0.00    4.33    0.00    0.00    0.00    0.25    0.00    0.00   85.35
11:18:16 PM  all   27.39    0.00    5.61    0.00    0.00    0.00    0.00    0.00    0.00   67.01
11:18:17 PM  all   92.38    0.00    7.38    0.00    0.00    0.00    0.00    0.00    0.00    0.25
11:18:18 PM  all   99.62    0.00    0.38    0.00    0.00    0.00    0.00    0.00    0.00    0.00
11:18:19 PM  all   78.35    0.00    1.75    0.00    0.00    0.00    0.00    0.00    0.00   19.90
11:18:20 PM  all    8.75    0.00    5.50    0.00    0.00    0.00    0.00    0.00    0.00   85.75
11:18:21 PM  all   13.75    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   86.00
11:18:22 PM  all   13.64    0.00    0.38    0.00    0.00    0.00    0.00    0.00    0.00   85.98
11:18:23 PM  all   13.50    0.00    0.50    0.00    0.00    0.00    0.12    0.00    0.00   85.88
11:18:24 PM  all   13.88    0.00    0.12    0.00    0.00    0.00    0.00    0.00    0.00   86.00
11:18:25 PM  all   25.53    0.00    0.50    0.13    0.00    0.00    0.00    0.00    0.00   73.84
11:18:26 PM  all    0.00    0.00    0.12    0.00    0.00    0.00    0.00    0.00    0.00   99.88

szilard · 2018-02-12T23:22:13Z

So while training on the GPU is only 5sec, total training time is 20sec. There is about 5sec before GPU training and 10sec after GPU training when some computations happen on the CPU, I wonder what it is, if you can elaborate on that (and maybe that can be cut/optimized)?

annaveronika · 2018-02-13T08:50:56Z

The preprocessing contains data binarization and calculation of part of statistics for categorical features, loading everything on GPU.

The postprocessing contains calculation of all selected statistics on categorical features, loading everything on CPU.

These parts will be speed up, but for training 100 iterations on V100 GPU will not be a bottleneck anyway. In real life you don't train for 100 iterations, so I don't think we should specifically optimize for that.

Also could you check that you are running 16 threads?

annaveronika · 2018-02-13T08:57:17Z

@sergeyf We are planning to provide this guide. Here is the issue: catboost/catboost#253

sergeyf · 2018-02-13T15:55:30Z

Thank you!

…

On Feb 13, 2018 12:57 AM, "annaveronika" ***@***.***> wrote: @sergeyf <https://github.com/sergeyf> We are planning to provide this guide. Here is the issue: catboost/catboost#253 <catboost/catboost#253> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7D77x7MIxM3P9aRWtmsL4mPCjJuYks5tUU5ugaJpZM4OcrOE> .

szilard · 2018-02-13T23:45:47Z

Thanks @annaveronika for explanation on the process before and after the GPU computation. I added thread_count = multiprocessing.cpu_count() to "force" multithreading, but the wall time is the same 20sec (and monitoring the CPU utilization e.g. with htop shows the same pattern as before, sometimes only 1 CPU core is utilized).

Not sure what you mean by "In real life you don't train for 100 iterations". For many datasets in practice you might overfit after a few hundred iterations, so one should do early stopping. Also, I would assume that 200 iterations takes more or less 2x 100 iterations (at least with the other tools on CPU), so it shouldn't really matter how many 100 iterations we benchmark (if we do that across the board).

szilard · 2018-02-13T23:57:29Z

catboost GPU 200 iterations: GPU time 11sec (roughly 2x), wall time 32sec (so the CPU part increased only from 15sec to 21sec, not 2x).

szilard · 2018-02-14T03:04:00Z

In comparison, for xgboost there is no pre- and post- GPU computation lag:

[0] Tesla V100-SXM2-16GB | 41'C,  50 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C,  48 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C,  49 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C,  47 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  47 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  47 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  45 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  48 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  48 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  48 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  48 % |   602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C,  45 % |   604 / 16160 MB | ubuntu(594M)
[0] Tesla V100-SXM2-16GB | 42'C,  47 % |   604 / 16160 MB | ubuntu(594M)

03:00:58 AM  all   18.63    0.00    1.61    0.00    0.00    0.00    0.00    0.00    0.00   79.75
03:00:59 AM  all   19.10    0.00    1.13    0.00    0.00    0.00    0.00    0.00    0.00   79.77
03:01:00 AM  all   19.20    0.00    1.63    0.00    0.00    0.00    0.00    0.00    0.00   79.17
03:01:01 AM  all   18.84    0.00    2.01    0.00    0.00    0.00    0.00    0.00    0.00   79.15
03:01:02 AM  all   20.70    0.00    1.87    0.00    0.00    0.00    0.00    0.00    0.00   77.43
03:01:03 AM  all   21.83    0.00    1.38    0.00    0.00    0.00    0.00    0.00    0.00   76.79
03:01:04 AM  all   23.82    0.00    1.37    0.00    0.00    0.00    0.00    0.00    0.00   74.81
03:01:05 AM  all   23.28    0.00    1.75    0.00    0.00    0.00    0.00    0.00    0.00   74.97
03:01:06 AM  all   22.54    0.00    2.05    0.00    0.00    0.00    0.00    0.00    0.00   75.42
03:01:07 AM  all   26.30    0.00    1.73    0.00    0.00    0.00    0.00    0.00    0.00   71.98
03:01:08 AM  all   25.97    0.00    1.87    0.00    0.00    0.00    0.00    0.00    0.00   72.16
03:01:09 AM  all   23.23    0.00    1.14    0.00    0.00    0.00    0.00    0.00    0.00   75.63
03:01:09 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:01:10 AM  all   25.19    0.00    1.63    0.00    0.00    0.00    0.00    0.00    0.00   73.18
03:01:11 AM  all   23.02    0.00    1.89    0.00    0.00    0.00    0.00    0.00    0.00   75.09
03:01:12 AM  all    1.38    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   98.38

Similarly for lightgbm GPU, thought lightgbm implementation seems less efficient (it's only using ~5% GPU and it's slower).

annaveronika · 2018-02-14T12:52:18Z

I mean 100 interations is almost never optimal. There is a tradeof between learning rate and amount of iterations, and the less learning rate the more iterations you need, and the better will be the quality. Until at some learning rate converges to best quality. To make sure you get good quality we set learning rate to 0.03 by default, which in many cases is good enough.

And for large datasets you need more iterations than for small given a fixed learning rate. For large dataset you usually need thousands of iterations for best quality. And GPU you need most of all for large datasets, because on them training is really slow on CPU.
For small datasets GPU and CPU training have not a huge difference.

szilard · 2018-02-14T15:57:15Z

I actually agree with what you are saying above.

The setup trees=100, learn_rate=0.1, depth=10 is not to be considered optimal, and sure often a smaller learning rate is better, and for larger datasets you get better accuracy with more trees.

The goal of my little benchmark is to compare speed (and also see if accuracy is not something really bad which would be sign of a bug, e.g. the one it helped you find a few days ago).

With the CPU versions the speed has been pretty much linear in the number of iterations (and dataset size), also deeper trees is slower etc. So I could do trees=1000 but it would be just roughly 10x the time. I could change the other params, but more likely all the tools would move in similar ways.

However, I did all the above with a mindset of 2-3 yrs ago when there were no GPU versions (it's when I started the other "main" benchm-ml GitHub repo). I see now that some of those premisses of runtime scaling on the GPU are not true (e.g. vs dataset sizes or number of trees), so I might experiment a little bit with changing the params in the next few days (for all tools).

szilard · 2018-02-14T16:01:22Z

Btw are you guys planning on having the GPU version available from R any time soon as well?

annaveronika · 2018-02-14T16:43:02Z

We definitely will do this, here is the issue: catboost/catboost#255
But I cannot tell you about the timing now, we need to finish several other tasks before - opensource distributed cpu training and more ranking modes.

Noxoomo · 2018-02-28T21:02:02Z

@szilard

JFYI: In attachment plot with CatBoost GPU vs CPU (dual-socket Intel Xeon E2560v2) speed comparisons for dataset with different sample count on Tesla K40, GTX1080Ti and V100 (plot was builded from samples of our internal dataset with approximately 700 numerical features, K40 is ≈6 times faster, but it's not easy to see on the plot because of V100, which is ≈45 times faster). Benchmark was run with -x32 option, for default -x128 results are slightly worse.

speedup.pdf

Noxoomo · 2018-02-28T21:06:45Z

Also, I would assume that 200 iterations takes more or less 2x 100 iterations (at least with the other tools on CPU), so it shouldn't really matter how many 100 iterations we benchmark (if we do that across the board).

It's not true. For histogram-based algorithm on decision trees learning of full ensemble in general is not a linear function. CatBoost (as well as LightGBM) uses at most half of the data to compute necessary statistics for splits after first leaf was build. For some datasets splits on latter trees are highly imbalanced and in such case we'll learn this trees faster than first balanced ones

Such small benchmarks would be almost correct (in terms of speed) for oblivious trees (because they are symmetric and this gives more balanced trees), but could be very misleading for GBMs with leaf wise trees (like LightGBM). I saw several examples, when LightGBM would learn small and simple trees on first iterations and starts to build very deep imbalanced trees after.

szilard · 2018-03-01T11:31:33Z

@Noxoomo For CPU lightgbm the training time and AUC vs number of trees (with the data and code from this repo):

   n_t      tm       auc      tmr
1  100   5.351 0.7660324 5.351000
2  300  14.618 0.7721249 4.872667
3 1000  40.960 0.7735019 4.096000
4 3000 119.199 0.7724849 3.973300

so runtime is not dramatically far from linear in the number of trees (last trees are not even 2x faster than first few)

The code added to the code in this repo:

d_res <- data.frame()
for (n_t in c(100,300,1000,3000)) {
  
tm <- system.time({
  md <- lgb.train(data = dlgb_train, 
                  objective = "binary", 
                  nrounds = n_t, num_leaves = 512, learning_rate = 0.1, 
                  verbose = 0)
})[[3]]

phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
auc <- performance(rocr_pred, "auc")@y.values[[1]]

d_res <- rbind(d_res, data.frame(n_t, tm, auc, tmr=tm/n_t*100))

}

d_res

szilard · 2018-03-01T12:34:19Z

CPU xgboost with tree_method = "hist"

   n_t      tm       auc     tmr
1  100  17.820 0.7494959 17.8200
2  300  39.372 0.7562533 13.1240
3 1000 122.849 0.7654567 12.2849
4 3000 323.340 0.7715660 10.7780

szilard · 2018-03-01T17:54:40Z

GPU xgboost

  n_t    tm       auc   tmr
1 100 7.857 0.7482401 7.857
2 300 16.03 0.7562974 5.343333
3 1000 40.812 0.7668228 4.0812
4 3000 114.407 0.7710445 3.813567

szilard mentioned this issue Jul 19, 2017

minimal training time benchmark catboost/catboost#33

Open

szilard mentioned this issue Feb 12, 2018

More datasets and regression problems szilard/benchm-ml#53

Open

ClimbsRocks mentioned this issue Mar 14, 2018

Request: Cache or parallelize preprocessing dataset catboost/catboost#273

Closed

szilard added the new product label Apr 12, 2018

szilard mentioned this issue Jul 20, 2018

would it be possible to add Catboost? #8

Closed

szilard changed the title ~~Benchmark for catboost (from yandex)~~ catboost initial results Apr 30, 2019

szilard added keep and removed new product labels May 2, 2019

catboost initial results #4

catboost initial results #4

Comments

szilard commented Jul 19, 2017

szilard commented Jul 19, 2017 • edited

ddofer commented Jul 20, 2017 • edited

szilard commented Jul 20, 2017

annaveronika commented Jan 31, 2018

annaveronika commented Jan 31, 2018

szilard commented Jan 31, 2018

szilard commented Feb 1, 2018

annaveronika commented Feb 1, 2018

szilard commented Feb 1, 2018

annaveronika commented Feb 1, 2018

annaveronika commented Feb 1, 2018

sergeyf commented Feb 2, 2018

annaveronika commented Feb 2, 2018 • edited

annaveronika commented Feb 2, 2018 • edited

annaveronika commented Feb 2, 2018 • edited

szilard commented Feb 2, 2018

annaveronika commented Feb 2, 2018 • edited

szilard commented Feb 2, 2018

szilard commented Feb 2, 2018

szilard commented Feb 2, 2018

szilard commented Feb 2, 2018 • edited

szilard commented Feb 2, 2018

sergeyf commented Feb 2, 2018

szilard commented Feb 2, 2018

annaveronika commented Feb 5, 2018 • edited

szilard commented Feb 5, 2018

szilard commented Feb 12, 2018

szilard commented Feb 12, 2018

szilard commented Feb 12, 2018

annaveronika commented Feb 13, 2018

annaveronika commented Feb 13, 2018

sergeyf commented Feb 13, 2018 via email

szilard commented Feb 13, 2018

szilard commented Feb 13, 2018

szilard commented Feb 14, 2018 • edited

annaveronika commented Feb 14, 2018

szilard commented Feb 14, 2018 • edited

szilard commented Feb 14, 2018

annaveronika commented Feb 14, 2018

Noxoomo commented Feb 28, 2018 • edited

Noxoomo commented Feb 28, 2018 • edited

szilard commented Mar 1, 2018 • edited

szilard commented Mar 1, 2018

szilard commented Mar 1, 2018

szilard commented Jul 19, 2017 •

edited

ddofer commented Jul 20, 2017 •

edited

annaveronika commented Feb 2, 2018 •

edited

annaveronika commented Feb 2, 2018 •

edited

annaveronika commented Feb 2, 2018 •

edited

annaveronika commented Feb 2, 2018 •

edited

szilard commented Feb 2, 2018 •

edited

annaveronika commented Feb 5, 2018 •

edited

szilard commented Feb 14, 2018 •

edited

szilard commented Feb 14, 2018 •

edited

Noxoomo commented Feb 28, 2018 •

edited

Noxoomo commented Feb 28, 2018 •

edited

szilard commented Mar 1, 2018 •

edited