Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise HashingEncoder for both large and small dataframes #428

Merged
merged 10 commits into from
Nov 11, 2023

Conversation

bkhant1
Copy link
Contributor

@bkhant1 bkhant1 commented Oct 8, 2023

I used the HashingEncoder recently and found weird that any call to fit or transform, even for a dataframe with only 10s of rows and a couple of columns took at least 2s...

I also had quite a large amount of data to encode, and that took a long time.

That got me started on improving the performance of HashingEncoder, and here's the result! There are quite a few changes in there, each individual change should be in it's own commit, and here's a summary of the performance gain on my machine (macOS Monteray, i7 2.3ghz).

Baseline Numpy arrays instead of apply Shared memory instead of queue Fork instead of spawn Faster hashlib usage
n_rows=30 n_features=3 n_components=10 n_process=4 3.55 s ± 150 ms per loop (mean ± std. dev. of ... 3.62 s ± 140 ms per loop (mean ± std. dev. of ... 2.2 s ± 41.6 ms per loop (mean ± std. dev. of ... 56.6 ms ± 2.91 ms per loop (mean ± std. dev. o... 47.3 ms ± 516 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1 1.24 s ± 52.6 ms per loop (mean ± std. dev. of... 1.42 s ± 170 ms per loop (mean ± std. dev. of ... 1.74 ms ± 32.2 µs per loop (mean ± std. dev. o... 2.08 ms ± 91.7 µs per loop (mean ± std. dev. o... 1.86 ms ± 173 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=100 n_process=1 1.22 s ± 51.5 ms per loop (mean ± std. dev. of... 1.33 s ± 60.7 ms per loop (mean ± std. dev. of... 1.73 ms ± 29.7 µs per loop (mean ± std. dev. o... 2.01 ms ± 148 µs per loop (mean ± std. dev. of... 2.01 ms ± 225 µs per loop (mean ± std. dev. of...
n_rows=10000 n_features=10 n_components=10 n_process=4 5.45 s ± 85.8 ms per loop (mean ± std. dev. of... 5.36 s ± 57.5 ms per loop (mean ± std. dev. of... 2.23 s ± 39.6 ms per loop (mean ± std. dev. of... 120 ms ± 3.02 ms per loop (mean ± std. dev. of... 96.4 ms ± 2.33 ms per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=1 1.61 s ± 30.1 ms per loop (mean ± std. dev. of... 1.45 s ± 27.2 ms per loop (mean ± std. dev. of... 227 ms ± 6.03 ms per loop (mean ± std. dev. of... 236 ms ± 3.06 ms per loop (mean ± std. dev. of... 170 ms ± 1.35 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4 5.99 s ± 215 ms per loop (mean ± std. dev. of ... 5.71 s ± 148 ms per loop (mean ± std. dev. of ... 4.8 s ± 25.4 ms per loop (mean ± std. dev. of ... 836 ms ± 42.3 ms per loop (mean ± std. dev. of... 622 ms ± 33.2 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1 5.38 s ± 53 ms per loop (mean ± std. dev. of 7... 3.73 s ± 56.5 ms per loop (mean ± std. dev. of... 2.25 s ± 57.4 ms per loop (mean ± std. dev. of... 3.76 s ± 1.61 s per loop (mean ± std. dev. of ... 1.68 s ± 19.9 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4 50.8 s ± 1.17 s per loop (mean ± std. dev. of ... 56.4 s ± 2.11 s per loop (mean ± std. dev. of ... 37.1 s ± 576 ms per loop (mean ± std. dev. of ... 36.9 s ± 2.19 s per loop (mean ± std. dev. of ... 26.6 s ± 1.8 s per loop (mean ± std. dev. of 7...
n_rows=1000000 n_features=50 n_components=10 n_process=1 2min 22s ± 2.05 s per loop (mean ± std. dev. o... 2min 19s ± 3.08 s per loop (mean ± std. dev. o... 1min 47s ± 1.15 s per loop (mean ± std. dev. o... 2min 10s ± 18.4 s per loop (mean ± std. dev. o... 1min 21s ± 1.67 s per loop (mean ± std. dev. o...

The notebook that produced that table can be found here

Proposed Changes

The changes are listed by commit.

Add a simple non-regression HashEncoder test

To make sure I am not breaking it.

In HashingEncoder process the df as a numpy array instead of using apply

It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance.

In HashEncoder use shared memory instead of queue for multiproccessing

It is faster to write directly in memory that to have to data transit through a queue.

The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe.

Allow forking processes instead of spwaning them and make it default

This makes the HashEncoder transform method a lot faster on small datasets.

The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform.

Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?)

It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.

Remove python 2 check code and faster use of hashlib

Python 2 is not supported on master, the check isn't useful.

Create int indexes from hashlib bytes digest instead of hex digest as it's faster.

Call the md5 hashlib constructor directly instead of new('md5'), which is also faster.

It has no direct impact on performance, however it allows accessing
the memory layout of the dataframe directly. That allows using
shared memory to communicate between processes instead of a data queue,
which does improve performance. See future commits.
It is faster to write directly in memory that to have to data transit
through a queue.

The multiprocessing method is similar to what it was with queues: the
dataframe is split into chunks, and each process applies the hashing
trick to its chunk of the dataframe. Instead of writting the result
to a queue, it writes it directly in a shared memory segment, that is
also the underlying memory of a numpy array that is used to build the
output dataframe.

Tested on MacOS Monterey, 2.3 GHz Quad-Core i7, it is between 1.3 and
1.6 times faster depending on the df size and how many processes are
used.
This makes the HashEncoder transform method a lot faster on small
datasets.

The spawn process creation method creates a new python interpreter
from scratch, and re-import all required module. In a minimal case
(pandas and category_encoders.hashing only are imported) this adds
a ~2s overhead to any call to transform.

Fork creates a copy of the current process, and that's it. It is
unsafe to use with threads, locks, file descriptors, ... but in that
case the only thing the forked process will do is process some data
and write it to ITS OWN segment of a shared memory. It is a lot
faster as pandas doesn't have to be re-imported (around 20ms?)

It might take up more memory as more than the necessary variables
(the largest one by far being the HashEncoder instance, which
include the user dataframe) will be copied. Add the option to use
spawn instead of fork to potentially save some memory.
Python 2 is not supported on master, the check isn't useful.

Create int indexes from hashlib bytes digest instead of hex digest
as it's faster.

Call the md5 hashlib constructor directly instead of new('md5'),
which is also faster.

Overall the performance gain is about 40% to 60%.
Copy link
Collaborator

@PaulWestenthanner PaulWestenthanner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that contribution! Seems to improve the times a lot.
I tried to understand everything and left some comments where further clarification is necessary

category_encoders/hashing.py Outdated Show resolved Hide resolved
category_encoders/hashing.py Outdated Show resolved Hide resolved
category_encoders/hashing.py Show resolved Hide resolved
category_encoders/hashing.py Outdated Show resolved Hide resolved
category_encoders/hashing.py Outdated Show resolved Hide resolved
category_encoders/hashing.py Show resolved Hide resolved
category_encoders/hashing.py Outdated Show resolved Hide resolved
@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):
"""
prefit_ordinal = False
encoding_relation = util.EncodingRelation.ONE_TO_M
default_int_np_array = np.array(np.zeros((2,2), dtype='int'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 2x2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure 😅 I am just using it to get the type of the default int array name in that line so it could be 1 by 1!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there

n_process = []
chunk_size = int(len(np_df)/self.max_process)
ctx = multiprocessing.get_context(self.process_creation_method)
for i in range(0, self.max_process-1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting this way is not very elegant if the last chunk has to be treated separately.
using numpy array split could be helpful: https://stackoverflow.com/a/75981560.

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shm stands for "shared_memory" - will update the variable name 👍

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

That's a very good point, I will try that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR

category_encoders/hashing.py Outdated Show resolved Hide resolved
@PaulWestenthanner
Copy link
Collaborator

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

@bkhant1
Copy link
Contributor Author

bkhant1 commented Oct 20, 2023

Hi @PaulWestenthanner ! Thanks for your review!

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

It was necessary because all processes were writting to the same queue. With shared memory they write to their own section of the shared memory, so no there is no concurrency to manage.

I think I replied to all your comments/addressed them.

The one outstanding thing is that you suggested:

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

It's a little more complex than that because return doesn't really work accross processes (hence the use of queues or shared memory), but I remembered that python has a nice API, ProcessPoolExecutor, (that uses queues behind the scenes) to make it look like it's possible to return something from a process.

Here's a PR in my fork showing what I tried. It could be a little slower but the code is cleaner. I think I should add this commit to this PR but let me know what you think!

Baseline Before review After review Multiproc pool instead of SHM
n_rows=30 n_features=3 n_components=10 n_process=4 3.12 s ± 20.2 ms per loop (mean ± std. dev. of... 53.9 ms ± 443 µs per loop (mean ± std. dev. of... 55 ms ± 1.21 ms per loop (mean ± std. dev. of ... 59.8 ms ± 989 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1 1.1 s ± 17 ms per loop (mean ± std. dev. of 7 ... 1.56 ms ± 45.5 µs per loop (mean ± std. dev. o... 1.5 ms ± 44.8 µs per loop (mean ± std. dev. of... 1.52 ms ± 39.8 µs per loop (mean ± std. dev. o...
n_rows=30 n_features=3 n_components=100 n_process=1 1.1 s ± 14.5 ms per loop (mean ± std. dev. of ... 1.52 ms ± 25.2 µs per loop (mean ± std. dev. o... 1.77 ms ± 317 µs per loop (mean ± std. dev. of... 1.49 ms ± 21.7 µs per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=4 4.75 s ± 36 ms per loop (mean ± std. dev. of 7... 99.3 ms ± 649 µs per loop (mean ± std. dev. of... 122 ms ± 13 ms per loop (mean ± std. dev. of 7... 113 ms ± 992 µs per loop (mean ± std. dev. of ...
n_rows=10000 n_features=10 n_components=10 n_process=1 1.51 s ± 13.5 ms per loop (mean ± std. dev. of... 156 ms ± 2.67 ms per loop (mean ± std. dev. of... 165 ms ± 17.6 ms per loop (mean ± std. dev. of... 146 ms ± 1.79 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4 5.53 s ± 30.6 ms per loop (mean ± std. dev. of... 584 ms ± 6.74 ms per loop (mean ± std. dev. of... 576 ms ± 25.2 ms per loop (mean ± std. dev. of... 572 ms ± 12.4 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1 5.22 s ± 55.3 ms per loop (mean ± std. dev. of... 1.55 s ± 11 ms per loop (mean ± std. dev. of 7... 1.48 s ± 12.6 ms per loop (mean ± std. dev. of... 1.46 s ± 32.7 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4 53.2 s ± 2.35 s per loop (mean ± std. dev. of ... 23.5 s ± 256 ms per loop (mean ± std. dev. of ... 22.3 s ± 200 ms per loop (mean ± std. dev. of ... 25.6 s ± 1.6 s per loop (mean ± std. dev. of 7...

category_encoders/hashing.py Show resolved Hide resolved
n_process = []
chunk_size = int(len(np_df)/self.max_process)
ctx = multiprocessing.get_context(self.process_creation_method)
for i in range(0, self.max_process-1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"ve seen your implementation of the mulitprocess pool (here https://github.com/bkhant1/category_encoders/compare/all_optis...bkhant1:category_encoders:multiproc_pool?expand=1) and like it a lot. I think it is very clean and you should add it to the PR

@@ -101,9 +102,10 @@ class HashingEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):
"""
prefit_ordinal = False
encoding_relation = util.EncodingRelation.ONE_TO_M
default_int_np_array = np.array(np.zeros((2,2), dtype='int'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would make sense to chance it to 1 by 1 as it would be sort of a minimal example. Also probably add a comment that you only need it for the datatype so people won't wonder why it is there

@PaulWestenthanner
Copy link
Collaborator

Thanks for the changes. Looks really cool now.
I'd prefer the clean ProcessPoolExecutor implementation. Then we'll be able to merge!

@PaulWestenthanner
Copy link
Collaborator

could you maybe also add a high level summary to the changelog in the unreleased section (https://github.com/scikit-learn-contrib/category_encoders/blob/master/CHANGELOG.md)?

It makes for cleaner, easier to read code, at very little performance
impact
@bkhant1
Copy link
Contributor Author

bkhant1 commented Nov 9, 2023

I added the ProcessPoolExecutor commit to this branch and updated the changelog!

@PaulWestenthanner
Copy link
Collaborator

Looks good! Thanks for this improvement! Makes it both more readable and faster

@PaulWestenthanner PaulWestenthanner merged commit 5c94e27 into scikit-learn-contrib:master Nov 11, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants