[BUG]: inst_dealloc() Function Execution Is Slow #507

hxb-galaxy · 2024-04-06T06:03:44Z

Problem description

Hello, I have a feedback about a performance issue.

The function static void inst_dealloc(PyObject *self) in the nb_type.cpp file is experiencing performance problems.

In some application scenarios, the inst_dealloc() function accounts for over 70% of the performance flame graph.

Within the inst_dealloc() function, the operations internals.keep_alive.erase(it) and inst_c2p.erase(it) are taking a considerable amount of time.

Both keep_alive and inst_c2p are using robin_hash. The erase function of robin_hash has a complexity of O(n) in the worst case, which is related to the load factor of the hash table (load factor = number of elements / capacity). The lower the load factor and the larger the capacity, the slower the erase operation becomes.

The reason is that after deleting an element, the erase() function returns to the next element, which is searched from the current position until a non-empty position is found or the end of the table is reached.

Reproducible example code

No response

wjakob · 2024-04-08T08:16:00Z

It is expected that removing an entry from a hash table takes a small amount of time, potentially more so in a hash table with robin hood hashing. The point you make about the load factor does not look correct to me since the hash table will automatically expand once the load factor becomes too high.

What I cannot tell from this description is whether you simply expected this function to be faster by a constant amount, or if you were able to create some kind of worst-case behavior that triggers a particularly slow performance case. Nanobind uses a pointer hash "finisher" from the MurmurHash project that should normally exclude this possibility. If you have found a legitimate issue, then what this report needs is a minimal reproducer. There is no way that I can help without one.

I will close this issue for now but will reopen it if you are able to produce such a reproducer.

hxb-galaxy · 2024-04-10T14:24:15Z

@wjakob, I've provided a reproducible example code for the issue. Could you please review and consider reopening it.

Reproducible example code

c++file:

#include <nanobind/nanobind.h>
#include <nanobind/stl/shared_ptr.h>

#include <memory>
#include <chrono>
#include <time.h>

namespace nb = nanobind;
using namespace nb::literals;


struct test_struct {
    int a = 0;
};
typedef std::shared_ptr<test_struct> test_struct_ptr;

nb::handle fun_expend_hash_table, fun_clear_hash_table, fun_test1, fun_test2;

void set_test_fun(nb::handle h1, nb::handle h2,nb::handle h3,nb::handle h4) noexcept { 
    fun_expend_hash_table = h1;
    fun_clear_hash_table = h2;
    fun_test1 = h3;
    fun_test2 = h4;
    return ; 
}
void test_hash_op()  { 
    const int LOOP_CNT_MAX = 10000;
    for(int i=0; i < LOOP_CNT_MAX; ++i){
        test_struct_ptr p(new test_struct() );
        fun_test1( p );
    }
}
void test_expend_hash_table()  { 
    const int LOOP_CNT_MAX = 0x8000 / 2 + 1;  // expend to 32k
    for(int i=0; i < LOOP_CNT_MAX; ++i){
        test_struct_ptr p(new test_struct() );        
        fun_expend_hash_table( p );
    }
}
void test_run(int run_flame_test_s)  { 
    using Duration = std::chrono::duration<double, std::nano>;
    using time_point_double = std::chrono::time_point<std::chrono::high_resolution_clock, Duration>;

    //normal op
    time_point_double time1 = std::chrono::high_resolution_clock::now();
    test_hash_op();
    time_point_double time2 = std::chrono::high_resolution_clock::now();

    //expend_hash_table to big
    test_expend_hash_table() ;
    fun_clear_hash_table();

    //here will be slowed
    time_point_double time3 = std::chrono::high_resolution_clock::now();
    test_hash_op();
    time_point_double time4 = std::chrono::high_resolution_clock::now();

    double time_cost1 = (time2 - time1).count();
    double time_cost2 = (time4 - time3).count();
    printf("time_cost1=%f time_cost2=%f  mult=%f\n", time_cost1,time_cost2, time_cost2/time_cost1);

    //loop x seconds for flame graph,  Use a profiling tool to profile and generate a flame graph.
    if(run_flame_test_s > 0){
        printf("begin for run_flame_test\n");
        time_t loop_time_len_s = (time_t)run_flame_test_s;
        time_t time_begin = time(nullptr);
        do{
            for(int i=0; i< 1000; ++i){
                test_struct_ptr p(new test_struct() );
                fun_test1(p);
                fun_test2(1);
            }
        }while( (time(nullptr) - time_begin) < loop_time_len_s );
    }

    printf("finish\n");
    return ;
}

NB_MODULE(test_hash, m) {
    nb::class_<test_struct>(m, "test_struct")
        .def_ro("a", &test_struct::a);

    m.def("set_test_fun", set_test_fun, "set_test_fun");
    m.def("test_run", test_run, "test_run");
}

python file: test_instdealloc.py

import test_hash as t
import sys

hold_obj = []

def expend_hash(p): # used in test_expend_hash_table
    hold_obj.append(p)

def clear_hash():
    hold_obj.clear()

def test_fun1(p): 
    pass

def test_fun2(n:int): 
    pass

run_flame_test_s = 0
if len(sys.argv) > 1:
    run_flame_test_s = int(sys.argv[1])

t.set_test_fun(expend_hash, clear_hash, test_fun1, test_fun2) 
t.test_run(run_flame_test_s)

run result:
running the command "python3 test_instdealloc.py 0" multiple times. the output as follows:
time_cost1=4197632.000000 time_cost2=1061792256.000000 mult=252.950296
time_cost1=3419904.000000 time_cost2=1072608768.000000 mult=313.637099
time_cost1=3160064.000000 time_cost2=1766464000.000000 mult=558.996273
time_cost1=4704512.000000 time_cost2=1334190080.000000 mult=283.597976
time_cost1=4020480.000000 time_cost2=1275046400.000000 mult=317.137854
time_cost1=3365120.000000 time_cost2=1523500288.000000 mult=452.732826
time_cost1=3752192.000000 time_cost2=1046175744.000000 mult=278.817220

running the command "python3 test_instdealloc.py 120",
after output "begin for run_flame_test", then use a profiling tool to profile and generate a flame graph.
the graph as flow:

wjakob · 2024-04-17T11:58:44Z

Thank you for finding this issue -- it should now be fixed.

See Tessil/robin-map#76 and #507 for context. (This is still a pre-release version of robin-map, I will update the referenced commit once a proper release has been made)

wjakob closed this as completed Apr 8, 2024

wjakob reopened this Apr 10, 2024

wjakob closed this as completed in ad03e8a Apr 17, 2024

wjakob mentioned this issue Apr 17, 2024

erase() function that takes an iterator but does not return one. Tessil/robin-map#75

Closed

wjakob added a commit that referenced this issue Apr 19, 2024

Update robin_map and use its new .erase_fast() method

9f4e46b

See Tessil/robin-map#76 and #507 for context. (This is still a pre-release version of robin-map, I will update the referenced commit once a proper release has been made)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: inst_dealloc() Function Execution Is Slow #507

[BUG]: inst_dealloc() Function Execution Is Slow #507

hxb-galaxy commented Apr 6, 2024

wjakob commented Apr 8, 2024

hxb-galaxy commented Apr 10, 2024

wjakob commented Apr 17, 2024

[BUG]: inst_dealloc() Function Execution Is Slow #507

[BUG]: inst_dealloc() Function Execution Is Slow #507

Comments

hxb-galaxy commented Apr 6, 2024

Problem description

Reproducible example code

wjakob commented Apr 8, 2024

hxb-galaxy commented Apr 10, 2024

Reproducible example code

wjakob commented Apr 17, 2024