Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low decompression speed in multithreading #269

Open
Dalbasar opened this issue Jan 2, 2023 · 2 comments
Open

Low decompression speed in multithreading #269

Dalbasar opened this issue Jan 2, 2023 · 2 comments

Comments

@Dalbasar
Copy link

Dalbasar commented Jan 2, 2023

As far as i understand, the GIL should be dropped when calling the underlying LZ4 C library during compression and decompression, see lz4.block.decompress

I only see a minor speedup when using multiple threads for decompression using python-lz4 4.3.2 on a 6 core Intel i5-8400 on Debian 11 and also not on a Amd Ryzen 5900X on Windows 10, neither with Python 3.11.1 nor 3.8.10.

Compression speed seems to increase almost linearly with the number of threads.

The following code gives me about 4500MB/s decompression speed (slight underestimation due to some overhead from starting the threads) when using 6 threads and ~4300MB/s when using 1 thread on an AMD Ryzen 5900X on Windows 10. Using lz4.frame yields similar results. Using py-lz4framed instead gives me about 13000MB/s using 6 threads and ~8300MB/s on 1 thread (not sure the compression settings are the same, but at the very least there is some speedup for multithreading).

import os
import threading
import time
import lz4.block

size_mb = 2000
n_threads = 6

input_data = size_mb * 1024 * os.urandom(1024)
compressed = lz4.block.compress(input_data)
input_data = None


def decompress(data):
    start_time = time.perf_counter()
    thread_start_time = time.thread_time()
    lz4.block.decompress(data)
    stop_time = time.perf_counter()
    thread_stop_time = time.thread_time()
    print(f"{threading.current_thread()} Decompression took {(stop_time-start_time)*1000:.3f}ms "
          f"(Thread time: {(thread_stop_time-thread_start_time)*1000:.3f}ms)\n")


threads = [threading.Thread(name=str(i), target=decompress, args=(compressed,)) for i in range(n_threads)]

start_thread_time = time.perf_counter()
[t.start() for t in threads]
[t.join() for t in threads]
done_thread_time = time.perf_counter()
duration = done_thread_time - start_thread_time
MBs = n_threads*size_mb/duration
print(f"Total time: {duration}s : {MBs}MB/s")
@jonathanunderwood
Copy link
Member

Yes, a quick glanced at lz4framed, and wading through the macro definitions, the key difference is that lz4framed is calling
PyThread_acquire_lock and PyThread_release_lock from pythread.h before/after calling the lz4 library functions, which we're not doing in this library at the moment. The main challenge is knowing when to release the lock - it's not consistently done in lz4framed.

Shouldn't be too hard to do similarly here. Will have a look when I get time, unless you beat me to it.

@jonathanunderwood
Copy link
Member

For when I come back to this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants