Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use dyn Trait with rayon without performance loss? #1038

Open
kloki opened this issue Mar 26, 2023 · 5 comments
Open

How to use dyn Trait with rayon without performance loss? #1038

kloki opened this issue Mar 26, 2023 · 5 comments

Comments

@kloki
Copy link

kloki commented Mar 26, 2023

I noticed a massive performance loss when looping over a Vec<Box<dyn Trait>> compared to looping over a Vec<struct> using into_par_iter. I'm not sure what is causing and if I'm doing something wrong.

Below I have a snippet that reproduces the behavior.

use rayon::prelude::*;
use std::f64::consts::PI;
use std::time::Instant;

pub trait Body: Sync + Send {
    fn size(&self) -> f64 {
        0.
    }
}

#[derive(Clone)]
pub struct Sphere {
    radius: f64,
}

impl Sphere {
    pub fn new() -> Self {
        Sphere { radius: 4.0 }
    }
}

impl Body for Sphere {
    fn size(&self) -> f64 {
        4.0 / 3.0 * PI * self.radius * self.radius * self.radius
    }
}
pub struct World {
    bodies: Vec<Box<dyn Body>>,
    spheres: Vec<Sphere>,
}

impl World {
    fn new() -> Self {
        let mut world = World {
            bodies: vec![],
            spheres: vec![],
        };
        for _ in 0..100000000 {
            world.spheres.push(Sphere::new());
            world.bodies.push(Box::new(Sphere::new()));
        }
        world
    }
}

pub fn main() {
    println!("Creating structs");
    let world = World::new();
    println!("Normal,Spheres");
    let timer = Instant::now();
    world.spheres.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Normal,Bodies");
    let timer = Instant::now();
    world.bodies.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Spheres");
    let timer = Instant::now();
    world.spheres.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Bodies");
    let timer = Instant::now();
    world.bodies.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());
}

When running with the release build, this produces the output:


Creating structs
Normal,Spheres
elapsed time:97ns
Normal,Bodies
elapsed time:323.960404ms
Parallel,Spheres
elapsed time:88.960257ms
Parallel,Bodies
elapsed time:6.015788183s

@adamreichold
Copy link
Collaborator

adamreichold commented Mar 26, 2023

97ns is suspiciously short and I suspect that the optimizer is computing the sum at runtime compile time.

The slow down is due to consuming the values in the parallel computations whereas the sequential ones just iterate over references. This has a large effect because the Box<dyn Body> case needs dynamic dispatch both to compute size and to drop the values themselves to free the individual heap allocations.

If I just replace .into_par_iter() by .par_iter(), I get

Creating structs
Normal,Spheres
elapsed time:80ns
Normal,Bodies
elapsed time:183.164156ms
Parallel,Spheres
elapsed time:25.692394ms
Parallel,Bodies
elapsed time:153.980776ms

which seems reasonable expect for the 80ns case which is probably not doing any computation at runtime at all.

EDIT: It really is just the parallel freeing and glibc's malloc implementation, not the dynamic dispatch itself. Remove either one from the picture and the pathological slowdown disappears.

@adamreichold
Copy link
Collaborator

adamreichold commented Mar 26, 2023

(As to why the parallel versions suffers so much from freeing the heap allocations, note that the heap is a shared resource and even the best heap allocators will need some level of synchronization which will most likely experience significant contention when all hardware threads are basically freeing allocations from a single global heap.)

@adamreichold
Copy link
Collaborator

adamreichold commented Mar 26, 2023

((For example, using glibc's default allocator I get 15s for your original "Parallel,Bodies" example whereas using a more specialized allocator which is well-suited to such a scenario like snmalloc-rs brings this down to 175ms.))

@kloki
Copy link
Author

kloki commented Mar 26, 2023

Thanks for the response. By fixing the into_par_iter. I get similar results.

I expected some slowdown but was unsure how much. I wanted to make sure I wasn't making an obvious mistake.

@cuviper
Copy link
Member

cuviper commented Mar 27, 2023

I suspect that the optimizer is computing the sum at runtime.

Not even that - it's completely throwing away the unused computation. Try printing it:

    let sum = /*...*/.sum::<f64>();
    println!("elapsed time:{:?}, sum:{sum}", timer.elapsed());

That slows down the serial versions, while the parallel versions are about the same.

In general, you could also use black_box for benchmarking, but it's sometimes tricky to make sure that you don't inhibit too much that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants