Please support runtimes other than tokio #6

joshtriplett · 2021-08-10T06:26:29Z

I'd love to use gzp in libraries that are already using a different async runtime, and I'd like to avoid adding the substantial additional dependencies of a separate async runtime that isn't otherwise used by the library.

Would you consider either supporting a thread pool like Rayon, or supporting other async runtimes?

(If it helps, you could use a channel library like flume that works on any runtime, so that you only need a spawn function.)

sstadick · 2021-08-10T14:05:36Z

I will look into doing this! Flume looks excellent btw.

I'll see what I can do with rayon threadpools, and if I can make things contributor friendly to add more backends in an easy manner. I agree that gzp has a rather large footprint as is.

joshtriplett · 2021-08-10T14:54:15Z

On Tue, Aug 10, 2021 at 07:05:46AM -0700, Seth wrote: I will look into doing this! Flume looks excellent btw.

Yeah, I've found it to be incredibly useful in a variety of projects. I particularly enjoy that it supports both sync and async operations on the same channel, so that you can (for instance) have one end in a thread and the other in an async task. Great for bridging the two worlds.

If you had to choose one backend that would be acceptable across other projects would you have a preference?

Definitely Rayon. There's no one async backend that'll work for the majority of people, and folks not using async will prefer to not have an async backend at all. Rayon should be acceptable in any project. Thanks for looking into it!

sstadick · 2021-08-10T23:03:10Z

So, I have a few branches going and some interesting initial results. See the explore_runtimes branch for a rayon threadpool impl in pargz and feature/futures for pargz and parsnap on the futures executor (with flume) instead of tokio.

The notable conclusion is that rayon threadpools lag in performance when using less threads.

I'm pretty sure that this is mainly a result of the rayon version basically just spinning on 2 threads:

1 thread blocking and waiting for chunks to be sent to it
1 thread blocking and waiting for compressed chunks to write to file
remainder doing the actual compression

Additionally the rayon version adds an extra channel that acts a bit like a future to eventually pull a result. Maybe there is a better way to organize all this on a rayon threadpool.

In general the rayon version performs as well as the futures async version - 2 cores, i.e. Gzip/6 with rayon ~= Gzip/4 features/futures

The tokio runtime with its explicit spawn_blocking is nearly 2x faster than the futures runtime. Looking at the deps brought in by the tokio version with feature flags set, it really doesn't seem to be that much heavier than either rayon or futures.

I'll try to formalize this some more with tables comparing things. I'm mostly surprised that the tokio version is so much faster than the futures version. I also need to make sure flume, which I'm not using in the tokio one currently isn't the culprit for that slowdown.

I'm not fully convinced I all things are equal between impls yet, these are just some interesting preliminary results.

joshtriplett · 2021-08-11T00:47:26Z

It'd be worth trying tokio with flume channels, to see if that's causing any performance delta (in either direction).

I don't know anything about the performance of the futures runtime, or whether it's been optimized for use cases like this.

For Rayon, you may also want to try ScopeFifo/in_place_scope_fifo or similar to run tasks in close to FIFO order (since you need the data in that order).

sstadick · 2021-08-13T14:18:23Z

Flume ends up being a negligible difference. The real difference is with tokio::task::spawn_blocking which ends up being way more performant than a regular task.

	tokio (blocking)	tokio (spawn)	futures	rayon
Gzip 2	2.27s	6.5s	6.7s	6.5s
Gzip 4	1.39s	2.2s	2.2s	2.3s
Gzip 8	0.79s	1.0s	1.0s	1.1s
Gzip 16	0.52s	0.58s	0.56s	0.74s
Gzip 30	0.44s	0.41s	0.36s	0.49s

I did manage to get rayon down to a more reasonable performance. It's worth noting that running the same benchmark with vanilla single threaded gzip encoding takes about 6.6s.

So for all runtimes except tokio 2 threads breaks even with the overhead of multithreading and 4 sees some substantial performance improvements.

Getting rayon / sync threads to be speedy here required some largish changes.

Branches:
tokio: feature/tokio
futures: feature/futures
rayon: explore_runtimes

Make bench data:

cd bench-data
cp shakespeare.txt shakespeare.txt.orig
for i in {0..100}; do cat shakespeare.txt.orig >> shakespeare.txt; done

Run benchmarks:

cargo bench --features pargz,zlib-ng-compat,parsnap_default -- Gzip --sample-size 10

At the moment I'm inclined to say that tokio is pulling its weight here and proves to be a worth-while dependency.

godmar · 2021-08-13T17:47:21Z

Is 'Gzip/2' supposed to use 2 compression threads in rayon? Your code runs it with only 1:

let handle = std::thread::spawn(move || {
     ParGz::run(rx_compressor, rx_writer, self.writer, self.num_threads - 1, comp_level)
});

similarly, Gzip/4 uses 3 threads (you can observe that with htop)

godmar · 2021-08-13T17:55:53Z

By contrast, your tokio (blocking) version doesn't perform any concurrency throttling at all - you spawn a new thread for each chunk.

Could it be you're comparing apples and oranges here?

sstadick · 2021-08-13T20:46:22Z

@godmar, that is correct, the Gzip/2 threadpool gets 1 less thread than the number passed to the ParGzBuilder to account for the background thread that is spawned to orchestrate everything (that same line you indicated).

The tokio runtimes have the thread count set in the tokio runtime builder, and have the same -1 from num_threads passed in to account for the initial background thread that is spawned.

        let rt = tokio::runtime::Builder::new_multi_thread()
            .worker_threads(num_threads)
            .build()?;

The docs on spawn_blocking say that if a task is spawned but we are out of threads, that task is put onto a queue and run next, which will definitely eat memory, but that's a fine tradeoff here.

So, I'm pretty sure this is all still apples to apples, but I appreciate trying to find holes in it as I've been staring at it for too long at this point.

godmar · 2021-08-13T20:55:25Z

I repeated your experiments and monitored the CPU usage.
Your rayon test case uses 1 CPU. Your tokio (blocking) case uses multiples CPUs. Fire up htop to see that.

Of course, if you're using more CPUs, results come in faster (as long as you're not out of CPUs as in the case with 30 threads where you'll see the performance become roughly equal).

This is not an apples to apples comparison. Your rayon threadpool is explicitly instructed to use only 1 thread, and it does that. Since the compression is CPU bound, this uses 100% of one core or CPU.
Your tokio/blocking adds a new thread for each chunk in the compression scheme. This will create as many threads as there are chunks, certainly more than one. These threads will be scheduled onto different CPUs by the OS, providing for parallelism that explains the speedup you're observing. You can also see this with htop.

If you're asking about the impact of different threadpools, you need to apply the same concurrency throttling strategy to all scenarios, in my opinion, or else the results don't make sense.

to account for the background thread that is spawned to orchestrate everything (that same line you indicated).

The "background" thread does hardly any CPU work, so I wouldn't count it.

The tokio runtimes have the thread count set in the tokio runtime builder, and have the same -1 from num_threads passed in to account for the initial background thread that is spawned.

This is referring to the threadpool tokio uses for async tasks, which again here do very little work, if any. You offload all CPU intensive work onto the so-called "spawn_blocking" threadpool which is not subject to concurrency control unless the (very large, larger than the number of CPUs) limit is reached.

sstadick · 2021-08-13T21:56:21Z

You are correct. I thought that worker_threads limited the number of blocking spawned threads AND async threads, but it does not. https://docs.rs/tokio/1.10.0/tokio/runtime/struct.Builder.html#method.max_blocking_threads does though, and when that is employed tokio performance drops quickly.

I'm going to rework these benchmarks and likely just move entirely to rayon. Thanks for pushing on this till I read the docs 👍

sstadick · 2021-08-15T00:07:41Z

Please see release v0.4.0 for gzp using rayon backend with accurate thread usage.

sstadick linked a pull request Aug 15, 2021 that will close this issue

[feature] Move to rayon backend #7

Merged

sstadick closed this as completed in #7 Aug 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please support runtimes other than tokio #6

Please support runtimes other than tokio #6

joshtriplett commented Aug 10, 2021 •

edited

sstadick commented Aug 10, 2021 •

edited

joshtriplett commented Aug 10, 2021 via email

sstadick commented Aug 10, 2021

joshtriplett commented Aug 11, 2021

sstadick commented Aug 13, 2021

godmar commented Aug 13, 2021

godmar commented Aug 13, 2021 •

edited

sstadick commented Aug 13, 2021

godmar commented Aug 13, 2021 •

edited

sstadick commented Aug 13, 2021

sstadick commented Aug 15, 2021

Please support runtimes other than tokio #6

Please support runtimes other than tokio #6

Comments

joshtriplett commented Aug 10, 2021 • edited

sstadick commented Aug 10, 2021 • edited

joshtriplett commented Aug 10, 2021 via email

sstadick commented Aug 10, 2021

joshtriplett commented Aug 11, 2021

sstadick commented Aug 13, 2021

godmar commented Aug 13, 2021

godmar commented Aug 13, 2021 • edited

sstadick commented Aug 13, 2021

godmar commented Aug 13, 2021 • edited

sstadick commented Aug 13, 2021

sstadick commented Aug 15, 2021

joshtriplett commented Aug 10, 2021 •

edited

sstadick commented Aug 10, 2021 •

edited

godmar commented Aug 13, 2021 •

edited

godmar commented Aug 13, 2021 •

edited