count_distinct can be faster when both columns are numerical #69

CarstVaartjes · 2016-07-04T11:51:25Z

Because we have always a numerical (integer) groupby column, the chance is high that the "distinct" column is also numerical.
This means that instead of the string concatenation method we can use a much faster numerical hashing
Here we should also check if the groupby column now is the groupby index or actually the separate columns (should be the first one). See also the python hashing method (but also look at float hashing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

count_distinct can be faster when both columns are numerical #69

count_distinct can be faster when both columns are numerical #69

CarstVaartjes commented Jul 4, 2016

count_distinct can be faster when both columns are numerical #69

count_distinct can be faster when both columns are numerical #69

Comments

CarstVaartjes commented Jul 4, 2016