Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30% performance hit due to changing of the code in the non evaluating match arm #111075

Open
alexshagiev opened this issue May 1, 2023 · 2 comments
Labels
C-bug Category: This is a bug. C-discussion Category: Discussion or questions that doesn't represent real issues. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@alexshagiev
Copy link

alexshagiev commented May 1, 2023

I tried this code: related to #110764

#![feature(test)]
extern crate test;
use test::Bencher;

use std::{self, hint::black_box};
#[derive(Debug)]
struct Error {
    message: String,
}
type Result<T> = std::result::Result<T, Error>;

struct Deserializer<'a> {
    bytes: &'a [u8],
    idx: usize,
}

impl<'a> Deserializer<'a> {
    pub fn deserialize_bytes<const N: usize>(&mut self) -> Result<[u8; N]> {
        match self.bytes.get(self.idx..self.idx + N) {
            Some(v) => {
                self.idx += N;
                Ok(v.try_into().unwrap()) // this shall not panic since slice method succeeded
            }
            None => {
                // ****** WE KNOW THIS ARM IS NEVER MATCHED BUT CHANGING IRRELEVANT HERE AFFECTS BENCHMARKS ******
                let e = Error {
                    message: "".to_string(),                            // (A)    test test_playground ... bench:         702 ns/iter (+/- 76)
                    // message: format!("{}", "a".repeat(usize::MAX)),  // (B)    test test_playground ... bench:         977 ns/iter (+/- 112)
                };

                // uncommenting print                                   // (A=on) test test_playground ... bench:       1,347 ns/iter (+/- 58)
                // uncommenting print                                   // (B=on) test test_playground ... bench:         972 ns/iter (+/- 104)
                // println!("Does not appear in terminal");                  
                Err(e)
            }
        }
    }
    pub fn new(bytes: &'a [u8]) -> Self {
        Self { bytes, idx: 0 }
    }
}

fn workload(buf: &[u8]) -> Result<()> {
    let mut des = black_box(Deserializer::new(buf));
    for _ in 0..400 {
        let r = black_box(des.deserialize_bytes::<2>());
        match r {
            Err(e) => {
                panic!("failed {}", e.message);
            }
            _ => {}
        }
    }
    Ok(())
}
#[bench]
fn test_playground(b: &mut Bencher) {
    let buf: [u8; 1024] = [0; 1024];
    b.iter(|| workload(&buf));
}

I expected to see this happen:

I expected to not see performance degradation of the function by making changes to the match arm which is never evaluated.

Instead, this happened:

Changing the string of the error message of adding println! command in the march arm that will never be evaluated significantly affects performance of the function, I have achieved identical results on stable version of rust and using criterion instead of std bench, but included result here from the nightly so that it is easy to replicate for someone else.

I am certain that match arm is not evaluated because the workload is set to panic but it succeeds nevertheless.

Meta

rustc --version --verbose:

rustc 1.71.0-nightly (8bdcc62cb 2023-04-20)
binary: rustc
commit-hash: 8bdcc62cb0362869f0e7b43a6ae4f96b953d3cbc
commit-date: 2023-04-20
host: x86_64-apple-darwin
release: 1.71.0-nightly
LLVM version: 16.0.2
Backtrace

<backtrace>

@alexshagiev alexshagiev added the C-bug Category: This is a bug. label May 1, 2023
@alexshagiev alexshagiev changed the title function performance appears to be impacted due to changes in the non evaluated match arm 30% performance hit due to changing of the code in the non evaluating match arm May 1, 2023
@quaternic
Copy link

quaternic commented May 1, 2023

In general it's not unexpected that even code paths that are never taken at runtime still affect the performance, because the compiler has to generate code that handles all the possibilities that it can't prove impossible. In your example code, the black_box is there specifically to prevent things like inlining the function into a context where it might be clear that the Option will always be Some.

I can't test it right now, but I'll take a guess that one significant difference in the variations is that an empty String can be created without allocating, so the fastest version could simply be returning a constant sentinel value, while the others will include some formatting code and function calls, which could significantly increase the size of the function and affect register allocation. I'll try to take a closer look in a bit.

edit: I looked at these on godbolt.org. The meaningful differences caused by adding the println seem to be:

  • One extra register value is saved on the stack
  • Stack pointer moved by another 72 bytes (goes unused in the Some-case)
  • The function becomes potentially panicking
  • Code size for the function nearly triples

I can't know if this is the exact same code generation, and haven't benchmarked it.

@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. C-discussion Category: Discussion or questions that doesn't represent real issues. labels May 1, 2023
@CAD97
Copy link
Contributor

CAD97 commented May 2, 2023

Using #[cold] is the standard way to smooth over cases like this, e.g. (bench results on my x86_64-pc-windows-msvc machine)

impl<'a> Deserializer<'a> {
    pub fn deserialize_bytes<const N: usize>(&mut self) -> Result<[u8; N]> {
        match self.bytes.get(self.idx..self.idx + N) {
            Some(v) => {
                self.idx += N;
                Ok(v.try_into().unwrap()) // this shall not panic since slice method succeeded
            }
            None => {
                // ****** WE KNOW THIS ARM IS NEVER MATCHED BUT CHANGING IRRELEVANT HERE AFFECTS BENCHMARKS ******
                Err(self.deserialize_bytes_error())
            }
        }
    }

    #[cold]
    fn deserialize_bytes_error(&self) -> Error {
        let e = Error {
            message: "".to_string(),                            // (A)    test test_playground ... bench:         397 ns/iter (+/- 7)
            // message: format!("{}", "a".repeat(usize::MAX)),  // (B)    test test_playground ... bench:         784 ns/iter (+/- 26)
        };

        // uncommenting print                                   // (A=on) test test_playground ... bench:         780 ns/iter (+/- 6)
        // uncommenting print                                   // (B=on) test test_playground ... bench:         783 ns/iter (+/- 4)
        println!("Does not appear in terminal");

        e
    }

If I add #[inline(never)] or move the Err inside deserialize_bytes_error, the standard case also takes ~780 ns/iter. Without #[cold], I get ~400/780/600/600, so slightly better performance than with #[cold], actually, and never worse. So use #[cold] sparingly and only when benchmarks show it beneficial; here it's probably not.

This does suggest that the impact comes just from the ability to completely inline the error case to essentially just a trivial return (0, 1, 0), meaning the function frame is a tail fame and calls no other functions, to having to support calling another function if an error occurs.

The difference from godbolt:

constant trivial error
example::Deserializer::deserialize_bytes_2:
        push    rbp
        mov     rbp, rsp
        mov     rax, rdi
        mov     rcx, qword ptr [rsi + 16]
        cmp     rcx, -3
        ja      LBB0_2
        lea     rdx, [rcx + 2]
        cmp     rdx, qword ptr [rsi + 8]
        ja      LBB0_2
        mov     rdi, qword ptr [rsi]
        mov     qword ptr [rsi + 16], rdx
        movzx   ecx, word ptr [rdi + rcx]
        mov     word ptr [rax], cx
        mov     qword ptr [rax + 8], 0
        pop     rbp
        ret
LBB0_2:
        mov     qword ptr [rax], 0
        mov     qword ptr [rax + 8], 1
        mov     qword ptr [rax + 16], 0
        pop     rbp
        ret

cold noninlined function
example::Deserializer::deserialize_bytes_2:
        push    rbp
        mov     rbp, rsp
        push    rbx
        push    rax
        mov     rbx, rdi
        mov     rax, qword ptr [rsi + 16]
        cmp     rax, -3
        ja      LBB2_2
        lea     rcx, [rax + 2]
        cmp     rcx, qword ptr [rsi + 8]
        ja      LBB2_2
        mov     rdx, qword ptr [rsi]
        mov     qword ptr [rsi + 16], rcx
        movzx   eax, word ptr [rdx + rax]
        mov     word ptr [rbx], ax
        mov     qword ptr [rbx + 8], 0
LBB2_4:
        mov     rax, rbx
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
LBB2_2:
        mov     rdi, rbx
        call    example::Deserializer::deserialize_bytes_error
        jmp     LBB2_4

HOWEVER, I will argue that a better measure of actual performance would be something like

fn workload(buf: &[u8]) {
    let mut des = Deserializer::new(buf);
    for _ in 0..400 {
        match des.deserialize_bytes::<2>() {
            Ok(ok) => {
                let _ = black_box(ok);
            }
            Err(err) => {
                let _ = black_box(err);
            }
        }
    }
}

#[bench]
fn test_playground(b: &mut Bencher) {
    let buf: [u8; 1024] = [0; 1024];
    b.iter(|| workload(black_box(&buf)));
}

That gives me a consistent 200 ns/iter no matter what I do in the error case1. And it's not that the error case is being completely optimized out; the error handling code is still present and fn workload isn't being optimized for a specific buffer size. [godbolt] Conclusion: benchmarks are hard.

Footnotes

  1. This includes using #[inline(never)] on fn workload OR #[cold] on fn deserialize_bytes_error, but both together are resulting in pessimizing to ~580 ns/iter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. C-discussion Category: Discussion or questions that doesn't represent real issues. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants