Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please support runtimes other than tokio #6

Closed
joshtriplett opened this issue Aug 10, 2021 · 11 comments · Fixed by #7
Closed

Please support runtimes other than tokio #6

joshtriplett opened this issue Aug 10, 2021 · 11 comments · Fixed by #7

Comments

@joshtriplett
Copy link
Contributor

joshtriplett commented Aug 10, 2021

I'd love to use gzp in libraries that are already using a different async runtime, and I'd like to avoid adding the substantial additional dependencies of a separate async runtime that isn't otherwise used by the library.

Would you consider either supporting a thread pool like Rayon, or supporting other async runtimes?

(If it helps, you could use a channel library like flume that works on any runtime, so that you only need a spawn function.)

@sstadick
Copy link
Owner

sstadick commented Aug 10, 2021

I will look into doing this! Flume looks excellent btw.

I'll see what I can do with rayon threadpools, and if I can make things contributor friendly to add more backends in an easy manner. I agree that gzp has a rather large footprint as is.

@joshtriplett
Copy link
Contributor Author

joshtriplett commented Aug 10, 2021 via email

@sstadick
Copy link
Owner

So, I have a few branches going and some interesting initial results. See the explore_runtimes branch for a rayon threadpool impl in pargz and feature/futures for pargz and parsnap on the futures executor (with flume) instead of tokio.

The notable conclusion is that rayon threadpools lag in performance when using less threads.

I'm pretty sure that this is mainly a result of the rayon version basically just spinning on 2 threads:

  • 1 thread blocking and waiting for chunks to be sent to it
  • 1 thread blocking and waiting for compressed chunks to write to file
  • remainder doing the actual compression

Additionally the rayon version adds an extra channel that acts a bit like a future to eventually pull a result. Maybe there is a better way to organize all this on a rayon threadpool.

In general the rayon version performs as well as the futures async version - 2 cores, i.e. Gzip/6 with rayon ~= Gzip/4 features/futures

The tokio runtime with its explicit spawn_blocking is nearly 2x faster than the futures runtime. Looking at the deps brought in by the tokio version with feature flags set, it really doesn't seem to be that much heavier than either rayon or futures.

I'll try to formalize this some more with tables comparing things. I'm mostly surprised that the tokio version is so much faster than the futures version. I also need to make sure flume, which I'm not using in the tokio one currently isn't the culprit for that slowdown.

I'm not fully convinced I all things are equal between impls yet, these are just some interesting preliminary results.

@joshtriplett
Copy link
Contributor Author

It'd be worth trying tokio with flume channels, to see if that's causing any performance delta (in either direction).

I don't know anything about the performance of the futures runtime, or whether it's been optimized for use cases like this.

For Rayon, you may also want to try ScopeFifo/in_place_scope_fifo or similar to run tasks in close to FIFO order (since you need the data in that order).

@sstadick
Copy link
Owner

Flume ends up being a negligible difference. The real difference is with tokio::task::spawn_blocking which ends up being way more performant than a regular task.

tokio (blocking) tokio (spawn) futures rayon
Gzip 2 2.27s 6.5s 6.7s 6.5s
Gzip 4 1.39s 2.2s 2.2s 2.3s
Gzip 8 0.79s 1.0s 1.0s 1.1s
Gzip 16 0.52s 0.58s 0.56s 0.74s
Gzip 30 0.44s 0.41s 0.36s 0.49s

I did manage to get rayon down to a more reasonable performance. It's worth noting that running the same benchmark with vanilla single threaded gzip encoding takes about 6.6s.

So for all runtimes except tokio 2 threads breaks even with the overhead of multithreading and 4 sees some substantial performance improvements.

Getting rayon / sync threads to be speedy here required some largish changes.

Branches:
tokio: feature/tokio
futures: feature/futures
rayon: explore_runtimes

Make bench data:

cd bench-data
cp shakespeare.txt shakespeare.txt.orig
for i in {0..100}; do cat shakespeare.txt.orig >> shakespeare.txt; done

Run benchmarks:

cargo bench --features pargz,zlib-ng-compat,parsnap_default -- Gzip --sample-size 10 

At the moment I'm inclined to say that tokio is pulling its weight here and proves to be a worth-while dependency.

@godmar
Copy link

godmar commented Aug 13, 2021

Is 'Gzip/2' supposed to use 2 compression threads in rayon? Your code runs it with only 1:

let handle = std::thread::spawn(move || {
     ParGz::run(rx_compressor, rx_writer, self.writer, self.num_threads - 1, comp_level)
}); 

similarly, Gzip/4 uses 3 threads (you can observe that with htop)

@godmar
Copy link

godmar commented Aug 13, 2021

By contrast, your tokio (blocking) version doesn't perform any concurrency throttling at all - you spawn a new thread for each chunk.

Could it be you're comparing apples and oranges here?

@sstadick
Copy link
Owner

@godmar, that is correct, the Gzip/2 threadpool gets 1 less thread than the number passed to the ParGzBuilder to account for the background thread that is spawned to orchestrate everything (that same line you indicated).

The tokio runtimes have the thread count set in the tokio runtime builder, and have the same -1 from num_threads passed in to account for the initial background thread that is spawned.

        let rt = tokio::runtime::Builder::new_multi_thread()
            .worker_threads(num_threads)
            .build()?;

The docs on spawn_blocking say that if a task is spawned but we are out of threads, that task is put onto a queue and run next, which will definitely eat memory, but that's a fine tradeoff here.

So, I'm pretty sure this is all still apples to apples, but I appreciate trying to find holes in it as I've been staring at it for too long at this point.

@godmar
Copy link

godmar commented Aug 13, 2021

I repeated your experiments and monitored the CPU usage.
Your rayon test case uses 1 CPU. Your tokio (blocking) case uses multiples CPUs. Fire up htop to see that.

Of course, if you're using more CPUs, results come in faster (as long as you're not out of CPUs as in the case with 30 threads where you'll see the performance become roughly equal).

This is not an apples to apples comparison. Your rayon threadpool is explicitly instructed to use only 1 thread, and it does that. Since the compression is CPU bound, this uses 100% of one core or CPU.
Your tokio/blocking adds a new thread for each chunk in the compression scheme. This will create as many threads as there are chunks, certainly more than one. These threads will be scheduled onto different CPUs by the OS, providing for parallelism that explains the speedup you're observing. You can also see this with htop.

If you're asking about the impact of different threadpools, you need to apply the same concurrency throttling strategy to all scenarios, in my opinion, or else the results don't make sense.

to account for the background thread that is spawned to orchestrate everything (that same line you indicated).

The "background" thread does hardly any CPU work, so I wouldn't count it.

The tokio runtimes have the thread count set in the tokio runtime builder, and have the same -1 from num_threads passed in to account for the initial background thread that is spawned.

This is referring to the threadpool tokio uses for async tasks, which again here do very little work, if any. You offload all CPU intensive work onto the so-called "spawn_blocking" threadpool which is not subject to concurrency control unless the (very large, larger than the number of CPUs) limit is reached.

@sstadick
Copy link
Owner

You are correct. I thought that worker_threads limited the number of blocking spawned threads AND async threads, but it does not. https://docs.rs/tokio/1.10.0/tokio/runtime/struct.Builder.html#method.max_blocking_threads does though, and when that is employed tokio performance drops quickly.

I'm going to rework these benchmarks and likely just move entirely to rayon. Thanks for pushing on this till I read the docs 👍

@sstadick sstadick linked a pull request Aug 15, 2021 that will close this issue
@sstadick
Copy link
Owner

Please see release v0.4.0 for gzp using rayon backend with accurate thread usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants