Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: inst_dealloc() Function Execution Is Slow #507

Closed
hxb-galaxy opened this issue Apr 6, 2024 · 3 comments
Closed

[BUG]: inst_dealloc() Function Execution Is Slow #507

hxb-galaxy opened this issue Apr 6, 2024 · 3 comments

Comments

@hxb-galaxy
Copy link

Problem description

Hello, I have a feedback about a performance issue.

The function static void inst_dealloc(PyObject *self) in the nb_type.cpp file is experiencing performance problems.

In some application scenarios, the inst_dealloc() function accounts for over 70% of the performance flame graph.

Within the inst_dealloc() function, the operations internals.keep_alive.erase(it) and inst_c2p.erase(it) are taking a considerable amount of time.

Both keep_alive and inst_c2p are using robin_hash. The erase function of robin_hash has a complexity of O(n) in the worst case, which is related to the load factor of the hash table (load factor = number of elements / capacity). The lower the load factor and the larger the capacity, the slower the erase operation becomes.

The reason is that after deleting an element, the erase() function returns to the next element, which is searched from the current position until a non-empty position is found or the end of the table is reached.

Reproducible example code

No response

@wjakob
Copy link
Owner

wjakob commented Apr 8, 2024

It is expected that removing an entry from a hash table takes a small amount of time, potentially more so in a hash table with robin hood hashing. The point you make about the load factor does not look correct to me since the hash table will automatically expand once the load factor becomes too high.

What I cannot tell from this description is whether you simply expected this function to be faster by a constant amount, or if you were able to create some kind of worst-case behavior that triggers a particularly slow performance case. Nanobind uses a pointer hash "finisher" from the MurmurHash project that should normally exclude this possibility. If you have found a legitimate issue, then what this report needs is a minimal reproducer. There is no way that I can help without one.

I will close this issue for now but will reopen it if you are able to produce such a reproducer.

@wjakob wjakob closed this as completed Apr 8, 2024
@hxb-galaxy
Copy link
Author

@wjakob, I've provided a reproducible example code for the issue. Could you please review and consider reopening it.

Reproducible example code

c++file:

#include <nanobind/nanobind.h>
#include <nanobind/stl/shared_ptr.h>

#include <memory>
#include <chrono>
#include <time.h>

namespace nb = nanobind;
using namespace nb::literals;


struct test_struct {
    int a = 0;
};
typedef std::shared_ptr<test_struct> test_struct_ptr;

nb::handle fun_expend_hash_table, fun_clear_hash_table, fun_test1, fun_test2;

void set_test_fun(nb::handle h1, nb::handle h2,nb::handle h3,nb::handle h4) noexcept { 
    fun_expend_hash_table = h1;
    fun_clear_hash_table = h2;
    fun_test1 = h3;
    fun_test2 = h4;
    return ; 
}
void test_hash_op()  { 
    const int LOOP_CNT_MAX = 10000;
    for(int i=0; i < LOOP_CNT_MAX; ++i){
        test_struct_ptr p(new test_struct() );
        fun_test1( p );
    }
}
void test_expend_hash_table()  { 
    const int LOOP_CNT_MAX = 0x8000 / 2 + 1;  // expend to 32k
    for(int i=0; i < LOOP_CNT_MAX; ++i){
        test_struct_ptr p(new test_struct() );        
        fun_expend_hash_table( p );
    }
}
void test_run(int run_flame_test_s)  { 
    using Duration = std::chrono::duration<double, std::nano>;
    using time_point_double = std::chrono::time_point<std::chrono::high_resolution_clock, Duration>;

    //normal op
    time_point_double time1 = std::chrono::high_resolution_clock::now();
    test_hash_op();
    time_point_double time2 = std::chrono::high_resolution_clock::now();

    //expend_hash_table to big
    test_expend_hash_table() ;
    fun_clear_hash_table();

    //here will be slowed
    time_point_double time3 = std::chrono::high_resolution_clock::now();
    test_hash_op();
    time_point_double time4 = std::chrono::high_resolution_clock::now();

    double time_cost1 = (time2 - time1).count();
    double time_cost2 = (time4 - time3).count();
    printf("time_cost1=%f time_cost2=%f  mult=%f\n", time_cost1,time_cost2, time_cost2/time_cost1);

    //loop x seconds for flame graph,  Use a profiling tool to profile and generate a flame graph.
    if(run_flame_test_s > 0){
        printf("begin for run_flame_test\n");
        time_t loop_time_len_s = (time_t)run_flame_test_s;
        time_t time_begin = time(nullptr);
        do{
            for(int i=0; i< 1000; ++i){
                test_struct_ptr p(new test_struct() );
                fun_test1(p);
                fun_test2(1);
            }
        }while( (time(nullptr) - time_begin) < loop_time_len_s );
    }

    printf("finish\n");
    return ;
}

NB_MODULE(test_hash, m) {
    nb::class_<test_struct>(m, "test_struct")
        .def_ro("a", &test_struct::a);

    m.def("set_test_fun", set_test_fun, "set_test_fun");
    m.def("test_run", test_run, "test_run");
}

python file: test_instdealloc.py

import test_hash as t
import sys

hold_obj = []

def expend_hash(p): # used in test_expend_hash_table
    hold_obj.append(p)

def clear_hash():
    hold_obj.clear()

def test_fun1(p): 
    pass

def test_fun2(n:int): 
    pass

run_flame_test_s = 0
if len(sys.argv) > 1:
    run_flame_test_s = int(sys.argv[1])

t.set_test_fun(expend_hash, clear_hash, test_fun1, test_fun2) 
t.test_run(run_flame_test_s) 

run result:
running the command "python3 test_instdealloc.py 0" multiple times. the output as follows:
time_cost1=4197632.000000 time_cost2=1061792256.000000 mult=252.950296
time_cost1=3419904.000000 time_cost2=1072608768.000000 mult=313.637099
time_cost1=3160064.000000 time_cost2=1766464000.000000 mult=558.996273
time_cost1=4704512.000000 time_cost2=1334190080.000000 mult=283.597976
time_cost1=4020480.000000 time_cost2=1275046400.000000 mult=317.137854
time_cost1=3365120.000000 time_cost2=1523500288.000000 mult=452.732826
time_cost1=3752192.000000 time_cost2=1046175744.000000 mult=278.817220

running the command "python3 test_instdealloc.py 120",
after output "begin for run_flame_test", then use a profiling tool to profile and generate a flame graph.
the graph as flow:
inst_dealloc_flame

@wjakob wjakob reopened this Apr 10, 2024
@wjakob wjakob closed this as completed in ad03e8a Apr 17, 2024
@wjakob
Copy link
Owner

wjakob commented Apr 17, 2024

Thank you for finding this issue -- it should now be fixed.

wjakob added a commit that referenced this issue Apr 19, 2024
See Tessil/robin-map#76 and
#507 for context.

(This is still a pre-release version of robin-map, I will update
 the referenced commit once a proper release has been made)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants