Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking #9

Closed
skaae opened this issue Sep 24, 2021 · 4 comments
Closed

Benchmarking #9

skaae opened this issue Sep 24, 2021 · 4 comments

Comments

@skaae
Copy link

skaae commented Sep 24, 2021

I tried benchmarking lleaves vs treelite and found that lleaves is slightly slower than treelite.
I might be doing something wrong?

I benchmark in google benchmark with batch size 1 and random features. I have ~600 trees with 450 leaves and max depth 13.
Treelite is compiled with clang 10.0. I think we did see that treelite was a lot slower using GCC.

I noticed that the compile step for lleaves took severeal hours, so maybe the forest I'm using is somehow off?

In any case I think your library looks very nice :)

Xeon E2278G

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_LLEAVES       32488 ns        32487 ns        21564
BM_TREELITE      27251 ns        27250 ns        25635

EPYC 7402P

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
BM_LLEAVES       38020 ns        38019 ns        18308
BM_TREELITE      32155 ns        32154 ns        21579
#include <benchmark/benchmark.h>
#include <random>
#include "lleavesheader.h"
#include "treeliteheader.h"
#include <vector>
#include <random>
#include <iostream>


constexpr int NUM_FATURES = 108;

static void BM_LLEAVES(benchmark::State& state)
{

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> dist(-10,10); 

    std::size_t N = 10000000;
    std::vector<double> f;
    f.reserve(N);
    for (std::size_t i = 0; i<N; ++i){
        f.push_back(dist(rng));
    }

    double out;
    std::size_t i = 0;
    for (auto _ : state) {
        forest_root(f.data()+NUM_FATURES*i, &out, (int)0, (int)1);
        ++i;
    }
}

static void BM_TREELITE(benchmark::State& state)
{

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> dist(-10,10); // distribution in range [1, 6]

    std::size_t N = 10000000;
    std::vector<Entry> f;
    f.reserve(N);
    for (std::size_t i = 0; i<N; ++i){
        auto e = DE::Entry();
        e.missing = -1;
        e.fvalue = dist(rng);
        e.qvalue = 0;
        f.push_back(e);
    }

    std::size_t i = 0;
    union DE::Entry *pFeatures = nullptr; 
    for (auto _ : state) {
        pFeatures = f.data()+NUM_FATURES*i;
        predict(pFeatures, 1);   // call treelite predict function

        ++i;
        
    }
}
BENCHMARK(BM_LLEAVES);
BENCHMARK(BM_TREELITE);
BENCHMARK_MAIN();
@siboehm
Copy link
Owner

siboehm commented Sep 24, 2021

Thank you for reporting your results! My single-batch benchmarking setup (c_bench) is pretty poor (as you probably noticed), so I should really improve this. Could you send me your full benchmarking code / a PR for this?

Regarding the compile time: Several hours is rather curious. How long did treelite take? I don't think I've ever seen any model take longer than 30min, and those were larger 1000 tree models. Can you send me the model.txt (email is fine)? Else it'll be hard for me to debug.

Regarding the runtime: I'm not super surprised by this, the gap betw. treelite and lleaves for numerical models and single-batch prediction has always been small. I'm currently working on making lleaves more configurable by adding compiler flags: https://github.com/siboehm/lleaves/tree/compile_flags. You could install from that branch and try using the small codemodel and disable cache blocking. Be aware that without cache blocking compilation will take significantly longer (unless you also disable function inlining).

Regarding your benchmark: I'm not a fan of benchmarking using random data. LightGBM generates very imbalanced trees so out-of-distribution datasets are likely to hit only short paths in the tree which would distort results. If I were you I'd benchmark on the shuffled training dataset. Also I'd argue the data matrix generation should go inside the loop for treelite, as this is really extra overhead, but maybe I'm just sour ;)

@skaae
Copy link
Author

skaae commented Sep 24, 2021

Thank you for reporting your results! My single-batch benchmarking setup (c_bench) is pretty poor (as you probably noticed), so I should really improve this. Could you send me your full benchmarking code / a PR for this?

Sure, Do you want code for the google benchmark code and cmake changes? For Lleaves I linked the .o file and for Treelite I we exported the c code and compiled it with clang++.
Do you have something similar in mind?

Regarding your benchmark: I'm not a fan of benchmarking using random data.

True, I will try to rerun it with real data.

@siboehm
Copy link
Owner

siboehm commented Sep 24, 2021

Yes, CMake and fully benchmark code would be sweet!

@siboehm
Copy link
Owner

siboehm commented Nov 16, 2021

Closing since I cannot debug this issue without further details. lleaves now has proper benchmarking implemented using Google benchmark on the minibenchmarks branch.

Feel free to reopen with more info.

@siboehm siboehm closed this as completed Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants