Optimise `HashingEncoder` for both large and small dataframes #428

bkhant1 · 2023-10-08T15:09:46Z

I used the HashingEncoder recently and found weird that any call to fit or transform, even for a dataframe with only 10s of rows and a couple of columns took at least 2s...

I also had quite a large amount of data to encode, and that took a long time.

That got me started on improving the performance of HashingEncoder, and here's the result! There are quite a few changes in there, each individual change should be in it's own commit, and here's a summary of the performance gain on my machine (macOS Monteray, i7 2.3ghz).

	Baseline	Numpy arrays instead of apply	Shared memory instead of queue	Fork instead of spawn	Faster hashlib usage
n_rows=30 n_features=3 n_components=10 n_process=4	3.55 s ± 150 ms per loop (mean ± std. dev. of ...	3.62 s ± 140 ms per loop (mean ± std. dev. of ...	2.2 s ± 41.6 ms per loop (mean ± std. dev. of ...	56.6 ms ± 2.91 ms per loop (mean ± std. dev. o...	47.3 ms ± 516 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1	1.24 s ± 52.6 ms per loop (mean ± std. dev. of...	1.42 s ± 170 ms per loop (mean ± std. dev. of ...	1.74 ms ± 32.2 µs per loop (mean ± std. dev. o...	2.08 ms ± 91.7 µs per loop (mean ± std. dev. o...	1.86 ms ± 173 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=100 n_process=1	1.22 s ± 51.5 ms per loop (mean ± std. dev. of...	1.33 s ± 60.7 ms per loop (mean ± std. dev. of...	1.73 ms ± 29.7 µs per loop (mean ± std. dev. o...	2.01 ms ± 148 µs per loop (mean ± std. dev. of...	2.01 ms ± 225 µs per loop (mean ± std. dev. of...
n_rows=10000 n_features=10 n_components=10 n_process=4	5.45 s ± 85.8 ms per loop (mean ± std. dev. of...	5.36 s ± 57.5 ms per loop (mean ± std. dev. of...	2.23 s ± 39.6 ms per loop (mean ± std. dev. of...	120 ms ± 3.02 ms per loop (mean ± std. dev. of...	96.4 ms ± 2.33 ms per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=1	1.61 s ± 30.1 ms per loop (mean ± std. dev. of...	1.45 s ± 27.2 ms per loop (mean ± std. dev. of...	227 ms ± 6.03 ms per loop (mean ± std. dev. of...	236 ms ± 3.06 ms per loop (mean ± std. dev. of...	170 ms ± 1.35 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4	5.99 s ± 215 ms per loop (mean ± std. dev. of ...	5.71 s ± 148 ms per loop (mean ± std. dev. of ...	4.8 s ± 25.4 ms per loop (mean ± std. dev. of ...	836 ms ± 42.3 ms per loop (mean ± std. dev. of...	622 ms ± 33.2 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1	5.38 s ± 53 ms per loop (mean ± std. dev. of 7...	3.73 s ± 56.5 ms per loop (mean ± std. dev. of...	2.25 s ± 57.4 ms per loop (mean ± std. dev. of...	3.76 s ± 1.61 s per loop (mean ± std. dev. of ...	1.68 s ± 19.9 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4	50.8 s ± 1.17 s per loop (mean ± std. dev. of ...	56.4 s ± 2.11 s per loop (mean ± std. dev. of ...	37.1 s ± 576 ms per loop (mean ± std. dev. of ...	36.9 s ± 2.19 s per loop (mean ± std. dev. of ...	26.6 s ± 1.8 s per loop (mean ± std. dev. of 7...
n_rows=1000000 n_features=50 n_components=10 n_process=1	2min 22s ± 2.05 s per loop (mean ± std. dev. o...	2min 19s ± 3.08 s per loop (mean ± std. dev. o...	1min 47s ± 1.15 s per loop (mean ± std. dev. o...	2min 10s ± 18.4 s per loop (mean ± std. dev. o...	1min 21s ± 1.67 s per loop (mean ± std. dev. o...

The notebook that produced that table can be found here

Proposed Changes

The changes are listed by commit.

Add a simple non-regression HashEncoder test

To make sure I am not breaking it.

In HashingEncoder process the df as a numpy array instead of using apply

It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance.

In HashEncoder use shared memory instead of queue for multiproccessing

It is faster to write directly in memory that to have to data transit through a queue.

The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe.

Allow forking processes instead of spwaning them and make it default

This makes the HashEncoder transform method a lot faster on small datasets.

The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform.

Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?)

It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.

Remove python 2 check code and faster use of hashlib

Python 2 is not supported on master, the check isn't useful.

Create int indexes from hashlib bytes digest instead of hex digest as it's faster.

Call the md5 hashlib constructor directly instead of new('md5'), which is also faster.

Pandas lints

It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance. See future commits.

It is faster to write directly in memory that to have to data transit through a queue. The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe. Tested on MacOS Monterey, 2.3 GHz Quad-Core i7, it is between 1.3 and 1.6 times faster depending on the df size and how many processes are used.

This makes the HashEncoder transform method a lot faster on small datasets. The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform. Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?) It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.

Python 2 is not supported on master, the check isn't useful. Create int indexes from hashlib bytes digest instead of hex digest as it's faster. Call the md5 hashlib constructor directly instead of new('md5'), which is also faster. Overall the performance gain is about 40% to 60%.

PaulWestenthanner

Thanks for that contribution! Seems to improve the times a lot.
I tried to understand everything and left some comments where further clarification is necessary

category_encoders/hashing.py

PaulWestenthanner · 2023-10-10T20:25:21Z

category_encoders/hashing.py

@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):
    """
    prefit_ordinal = False
    encoding_relation = util.EncodingRelation.ONE_TO_M
+    default_int_np_array = np.array(np.zeros((2,2), dtype='int'))


why is this 2x2?

Im not sure 😅 I am just using it to get the type of the default int array name in that line so it could be 1 by 1!

maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there

PaulWestenthanner · 2023-10-10T20:27:48Z

category_encoders/hashing.py

+        n_process = []
+        chunk_size = int(len(np_df)/self.max_process)
+        ctx = multiprocessing.get_context(self.process_creation_method)
+        for i in range(0, self.max_process-1):


splitting this way is not very elegant if the last chunk has to be treated separately.
using numpy array split could be helpful: https://stackoverflow.com/a/75981560.

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

shm stands for "shared_memory" - will update the variable name 👍

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

That's a very good point, I will try that

I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR

category_encoders/hashing.py

PaulWestenthanner · 2023-10-10T20:36:37Z

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

bkhant1 · 2023-10-20T09:13:16Z

Hi @PaulWestenthanner ! Thanks for your review!

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

It was necessary because all processes were writting to the same queue. With shared memory they write to their own section of the shared memory, so no there is no concurrency to manage.

I think I replied to all your comments/addressed them.

The one outstanding thing is that you suggested:

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

It's a little more complex than that because return doesn't really work accross processes (hence the use of queues or shared memory), but I remembered that python has a nice API, ProcessPoolExecutor, (that uses queues behind the scenes) to make it look like it's possible to return something from a process.

Here's a PR in my fork showing what I tried. It could be a little slower but the code is cleaner. I think I should add this commit to this PR but let me know what you think!

	Baseline	Before review	After review	Multiproc pool instead of SHM
n_rows=30 n_features=3 n_components=10 n_process=4	3.12 s ± 20.2 ms per loop (mean ± std. dev. of...	53.9 ms ± 443 µs per loop (mean ± std. dev. of...	55 ms ± 1.21 ms per loop (mean ± std. dev. of ...	59.8 ms ± 989 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1	1.1 s ± 17 ms per loop (mean ± std. dev. of 7 ...	1.56 ms ± 45.5 µs per loop (mean ± std. dev. o...	1.5 ms ± 44.8 µs per loop (mean ± std. dev. of...	1.52 ms ± 39.8 µs per loop (mean ± std. dev. o...
n_rows=30 n_features=3 n_components=100 n_process=1	1.1 s ± 14.5 ms per loop (mean ± std. dev. of ...	1.52 ms ± 25.2 µs per loop (mean ± std. dev. o...	1.77 ms ± 317 µs per loop (mean ± std. dev. of...	1.49 ms ± 21.7 µs per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=4	4.75 s ± 36 ms per loop (mean ± std. dev. of 7...	99.3 ms ± 649 µs per loop (mean ± std. dev. of...	122 ms ± 13 ms per loop (mean ± std. dev. of 7...	113 ms ± 992 µs per loop (mean ± std. dev. of ...
n_rows=10000 n_features=10 n_components=10 n_process=1	1.51 s ± 13.5 ms per loop (mean ± std. dev. of...	156 ms ± 2.67 ms per loop (mean ± std. dev. of...	165 ms ± 17.6 ms per loop (mean ± std. dev. of...	146 ms ± 1.79 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4	5.53 s ± 30.6 ms per loop (mean ± std. dev. of...	584 ms ± 6.74 ms per loop (mean ± std. dev. of...	576 ms ± 25.2 ms per loop (mean ± std. dev. of...	572 ms ± 12.4 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1	5.22 s ± 55.3 ms per loop (mean ± std. dev. of...	1.55 s ± 11 ms per loop (mean ± std. dev. of 7...	1.48 s ± 12.6 ms per loop (mean ± std. dev. of...	1.46 s ± 32.7 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4	53.2 s ± 2.35 s per loop (mean ± std. dev. of ...	23.5 s ± 256 ms per loop (mean ± std. dev. of ...	22.3 s ± 200 ms per loop (mean ± std. dev. of ...	25.6 s ± 1.6 s per loop (mean ± std. dev. of 7...

category_encoders/hashing.py

PaulWestenthanner · 2023-10-28T13:10:38Z

category_encoders/hashing.py

+        n_process = []
+        chunk_size = int(len(np_df)/self.max_process)
+        ctx = multiprocessing.get_context(self.process_creation_method)
+        for i in range(0, self.max_process-1):


I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR

PaulWestenthanner · 2023-10-28T13:14:17Z

category_encoders/hashing.py

@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):
    """
    prefit_ordinal = False
    encoding_relation = util.EncodingRelation.ONE_TO_M
+    default_int_np_array = np.array(np.zeros((2,2), dtype='int'))


maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there

PaulWestenthanner · 2023-10-28T14:28:56Z

Thanks for the changes. Looks really cool now.
I'd prefer the clean ProcessPoolExecutor implementation. Then we'll be able to merge!

PaulWestenthanner · 2023-10-28T14:30:33Z

could you maybe also add a high level summary to the changelog in the unreleased section (https://github.com/scikit-learn-contrib/category_encoders/blob/master/CHANGELOG.md)?

It makes for cleaner, easier to read code, at very little performance impact

bkhant1 · 2023-11-09T15:26:44Z

I added the ProcessPoolExecutor commit to this branch and updated the changelog!

PaulWestenthanner · 2023-11-11T14:34:21Z

Looks good! Thanks for this improvement! Makes it both more readable and faster

Merge pull request scikit-learn-contrib#426 from s-banach/pandas-lints

57b7706

Pandas lints

bkhant1 force-pushed the all_optis branch from d2d535b to 49822a9 Compare October 8, 2023 15:22

bkhant1 added 5 commits October 8, 2023 16:27

Add a simple non-regression HashEncoder test

77d295a

bkhant1 force-pushed the all_optis branch from 49822a9 to 206de2d Compare October 8, 2023 15:27

PaulWestenthanner reviewed Oct 10, 2023

View reviewed changes

Implementing some review comments

2c0db31

bkhant1 requested a review from PaulWestenthanner October 25, 2023 09:39

Add some documentation and handle windows

b214cbd

bkhant1 force-pushed the all_optis branch from 78b8763 to b214cbd Compare October 25, 2023 10:00

PaulWestenthanner reviewed Oct 28, 2023

View reviewed changes

bkhant1 added 2 commits November 7, 2023 13:41

Use multiprocessing pool instead of shared memory

bb3ba1f

It makes for cleaner, easier to read code, at very little performance impact

Update CHANGELOG.md

e2c1b79

bkhant1 requested a review from PaulWestenthanner November 9, 2023 15:26

PaulWestenthanner merged commit 5c94e27 into scikit-learn-contrib:master Nov 11, 2023
5 checks passed

bmreiniger mentioned this pull request Mar 9, 2024

EOF Error Raised while Calling HashingEncoders function #434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise `HashingEncoder` for both large and small dataframes #428

Optimise `HashingEncoder` for both large and small dataframes #428

bkhant1 commented Oct 8, 2023 •

edited

Loading

PaulWestenthanner left a comment

PaulWestenthanner Oct 10, 2023

bkhant1 Oct 20, 2023

PaulWestenthanner Oct 28, 2023

PaulWestenthanner Oct 10, 2023

bkhant1 Oct 12, 2023

PaulWestenthanner Oct 28, 2023

PaulWestenthanner commented Oct 10, 2023

bkhant1 commented Oct 20, 2023 •

edited

Loading

PaulWestenthanner Oct 28, 2023

PaulWestenthanner Oct 28, 2023

PaulWestenthanner commented Oct 28, 2023

PaulWestenthanner commented Oct 28, 2023

bkhant1 commented Nov 9, 2023

PaulWestenthanner commented Nov 11, 2023

Optimise HashingEncoder for both large and small dataframes #428

Optimise HashingEncoder for both large and small dataframes #428

Conversation

bkhant1 commented Oct 8, 2023 • edited Loading

Proposed Changes

PaulWestenthanner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulWestenthanner commented Oct 10, 2023

bkhant1 commented Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulWestenthanner commented Oct 28, 2023

PaulWestenthanner commented Oct 28, 2023

bkhant1 commented Nov 9, 2023

PaulWestenthanner commented Nov 11, 2023

Optimise `HashingEncoder` for both large and small dataframes #428

Optimise `HashingEncoder` for both large and small dataframes #428

bkhant1 commented Oct 8, 2023 •

edited

Loading

bkhant1 commented Oct 20, 2023 •

edited

Loading