How to use dyn Trait with rayon without performance loss? #1038

kloki · 2023-03-26T17:06:00Z

I noticed a massive performance loss when looping over a Vec<Box<dyn Trait>> compared to looping over a Vec<struct> using into_par_iter. I'm not sure what is causing and if I'm doing something wrong.

Below I have a snippet that reproduces the behavior.

use rayon::prelude::*;
use std::f64::consts::PI;
use std::time::Instant;

pub trait Body: Sync + Send {
    fn size(&self) -> f64 {
        0.
    }
}

#[derive(Clone)]
pub struct Sphere {
    radius: f64,
}

impl Sphere {
    pub fn new() -> Self {
        Sphere { radius: 4.0 }
    }
}

impl Body for Sphere {
    fn size(&self) -> f64 {
        4.0 / 3.0 * PI * self.radius * self.radius * self.radius
    }
}
pub struct World {
    bodies: Vec<Box<dyn Body>>,
    spheres: Vec<Sphere>,
}

impl World {
    fn new() -> Self {
        let mut world = World {
            bodies: vec![],
            spheres: vec![],
        };
        for _ in 0..100000000 {
            world.spheres.push(Sphere::new());
            world.bodies.push(Box::new(Sphere::new()));
        }
        world
    }
}

pub fn main() {
    println!("Creating structs");
    let world = World::new();
    println!("Normal,Spheres");
    let timer = Instant::now();
    world.spheres.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Normal,Bodies");
    let timer = Instant::now();
    world.bodies.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Spheres");
    let timer = Instant::now();
    world.spheres.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Bodies");
    let timer = Instant::now();
    world.bodies.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());
}

When running with the release build, this produces the output:


Creating structs
Normal,Spheres
elapsed time:97ns
Normal,Bodies
elapsed time:323.960404ms
Parallel,Spheres
elapsed time:88.960257ms
Parallel,Bodies
elapsed time:6.015788183s

The text was updated successfully, but these errors were encountered:

adamreichold · 2023-03-26T18:07:53Z

97ns is suspiciously short and I suspect that the optimizer is computing the sum at ~~runtime~~ compile time.

The slow down is due to consuming the values in the parallel computations whereas the sequential ones just iterate over references. This has a large effect because the Box<dyn Body> case needs ~~dynamic dispatch both to compute size and to drop the values themselves~~ to free the individual heap allocations.

If I just replace .into_par_iter() by .par_iter(), I get

Creating structs
Normal,Spheres
elapsed time:80ns
Normal,Bodies
elapsed time:183.164156ms
Parallel,Spheres
elapsed time:25.692394ms
Parallel,Bodies
elapsed time:153.980776ms

which seems reasonable expect for the 80ns case which is probably not doing any computation at runtime at all.

EDIT: It really is just the parallel freeing and glibc's malloc implementation, not the dynamic dispatch itself. Remove either one from the picture and the pathological slowdown disappears.

adamreichold · 2023-03-26T18:09:35Z

(As to why the parallel versions suffers so much from freeing the heap allocations, note that the heap is a shared resource and even the best heap allocators will need some level of synchronization which will most likely experience significant contention when all hardware threads are basically freeing allocations from a single global heap.)

adamreichold · 2023-03-26T18:12:24Z

((For example, using glibc's default allocator I get 15s for your original "Parallel,Bodies" example whereas using a more specialized allocator which is well-suited to such a scenario like snmalloc-rs brings this down to 175ms.))

kloki · 2023-03-26T19:34:41Z

Thanks for the response. By fixing the into_par_iter. I get similar results.

I expected some slowdown but was unsure how much. I wanted to make sure I wasn't making an obvious mistake.

cuviper · 2023-03-27T18:57:39Z

I suspect that the optimizer is computing the sum at runtime.

Not even that - it's completely throwing away the unused computation. Try printing it:

    let sum = /*...*/.sum::<f64>();
    println!("elapsed time:{:?}, sum:{sum}", timer.elapsed());

That slows down the serial versions, while the parallel versions are about the same.

In general, you could also use black_box for benchmarking, but it's sometimes tricky to make sure that you don't inhibit too much that way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use dyn Trait with rayon without performance loss? #1038

How to use dyn Trait with rayon without performance loss? #1038

kloki commented Mar 26, 2023

adamreichold commented Mar 26, 2023 •

edited

adamreichold commented Mar 26, 2023 •

edited

adamreichold commented Mar 26, 2023 •

edited

kloki commented Mar 26, 2023

cuviper commented Mar 27, 2023

How to use dyn Trait with rayon without performance loss? #1038

How to use dyn Trait with rayon without performance loss? #1038

Comments

kloki commented Mar 26, 2023

adamreichold commented Mar 26, 2023 • edited

adamreichold commented Mar 26, 2023 • edited

adamreichold commented Mar 26, 2023 • edited

kloki commented Mar 26, 2023

cuviper commented Mar 27, 2023

adamreichold commented Mar 26, 2023 •

edited

adamreichold commented Mar 26, 2023 •

edited

adamreichold commented Mar 26, 2023 •

edited