Poor optimization of iter().skip() #101814

Tearth · 2022-09-14T16:58:14Z

Using iter().skip() functions leads to poor optimization compared to the manually done loop with range.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=b7ed8bf9e4fc3341a92f301fa5185cc5

pub fn test_1(a: [i32; 10]) -> i32 {
    let mut sum = 0;
    for v in a.iter().skip(8) {
        sum += v;
    }
    
    sum
}

pub fn test_2(a: [i32; 10]) -> i32 {
    let mut sum = 0;
    for index in 8..10 {
        sum += a[index];
    }
    
    sum
}

This produces the following asm output:

playground::test_1:
	movq	%rdi, %r8
	addq	$40, %r8
	xorl	%esi, %esi
	movl	$8, %edx
	xorl	%eax, %eax
	testb	$1, %sil
	jne	.LBB0_2

.LBB0_5:
	leaq	-1(%rdx), %rsi
	movq	%r8, %rcx
	subq	%rdi, %rcx
	shrq	$2, %rcx
	cmpq	%rsi, %rcx
	jbe	.LBB0_4
	leaq	(%rdi,%rdx,4), %rdi

.LBB0_2:
	cmpq	%r8, %rdi
	je	.LBB0_4
	testq	%rdi, %rdi
	je	.LBB0_4
	addl	(%rdi), %eax
	addq	$4, %rdi
	movb	$1, %sil
	xorl	%edx, %edx
	testb	$1, %sil
	je	.LBB0_5
	jmp	.LBB0_2

.LBB0_4:
	retq

playground::test_2:
	movl	36(%rdi), %eax
	addl	32(%rdi), %eax
	retq

Considering the zero-cost abstraction rule and the fact that the compiler knows the size of the array, it should optimize test_1 to at least the same form as test_2 where it correctly detected that we only need two values summed. Instead, there's quite a chunk of asm with lots of branches.

The issue is present both in the stable version (1.63.0) and nightly/beta channels.

MatiF100 · 2022-09-14T18:51:37Z

Rewriting the function the following way produces the same assembly as the better optimized variant. Seems like the issue happens when using both iterators and for loop at the same time.
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bcb4551c22ef765a682c1d6c41eb285f

pub fn test_3(a: [i32; 10]) -> i32 {
    a.iter().skip(8).fold(0, |sum, v| sum + v)
}

nikic · 2022-09-14T19:02:11Z

This general class of problem is well known -- optimization of exterior iteration in Rust is very challenging. Using interior iteration (as in the previous comment) will generally optimize much better.

That said, in this case optimization is likely feasible. Looking at the IR (https://rust.godbolt.org/z/cevdKWcTn) there is a clear opportunity for peeling based on phi invariance here, which should allow follow-on optimization. Would have to investigate closer to find out why it does not trigger.

Tearth · 2022-09-14T19:57:03Z

Thanks, I wasn't aware that the compiler can have this kind of trouble with exterior iterations, but it's understandable - I will leave this issue open if you're saying that this case has the potential to improve.

With #[inline(always)] the body of default() will be inlined into external crates but the body will still contain calls to the LZOxide::new(), ParamsOxide::new(DEFAULT_FLAGS), Box::default() and DictOxide::new(DEFAULT_FLAGS). This ends up causing a copy of the large LZOxide to end up on the stack when used with Box::default as seen in: rust-lang/rust#101814

nikic · 2022-09-27T14:41:42Z

I took a closer look, and the reason why this doesn't peel are multiple checks in canPeel(): https://github.com/llvm/llvm-project/blob/2769ceb0e7a4b4f11c2bf5bd21fd69c154c17ff8/llvm/lib/Transforms/Utils/LoopPeel.cpp#L88 We have a non-exiting latch here, and because of that the non-latch exits are also not terminated by unreachable. It should be possible to relax these requirements, but would need some effort to support branch weight updates.

nikic · 2022-09-28T16:37:04Z

Upstream patch: https://reviews.llvm.org/D134803

nikic · 2023-04-03T12:24:46Z

Fixed by the LLVM 16 upgrade.

Add codegen tests for issues fixed by LLVM 16 Fixes rust-lang#75978. Fixes rust-lang#99960. Fixes rust-lang#101048. Fixes rust-lang#101082. Fixes rust-lang#101814. Fixes rust-lang#103132. Fixes rust-lang#103327.

An unfortunate find is that .skip(1) is actually slower than .collect::<Vec<_>>[1..].to_vec(), poor performance of .skip() has already been noted here rust-lang/rust/issues/101814.

Tearth added the C-bug Category: This is a bug. label Sep 14, 2022

nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 14, 2022

jrmuizel mentioned this issue Sep 15, 2022

Remove #[inline(always)] from CompressorOxide::default() Frommi/miniz_oxide#125

Merged

nikic self-assigned this Sep 27, 2022

Shnatsel mentioned this issue Dec 30, 2022

Iterators: recommend interior iteration - iter.for_each() instead of for x in iter nnethercote/perf-book#51

Closed

nikic added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Apr 3, 2023

nikic mentioned this issue Apr 3, 2023

Add codegen tests for issues fixed by LLVM 16 #109895

Merged

Nilstrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

bors closed this as completed in 73f40d4 Apr 12, 2023

Shnatsel mentioned this issue Oct 13, 2023

Document external .next() vs internal .try_fold() iteration nnethercote/perf-book#70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor optimization of iter().skip() #101814

Poor optimization of iter().skip() #101814

Tearth commented Sep 14, 2022 •

edited

Loading

MatiF100 commented Sep 14, 2022 •

edited

Loading

nikic commented Sep 14, 2022

Tearth commented Sep 14, 2022

nikic commented Sep 27, 2022

nikic commented Sep 28, 2022

nikic commented Apr 3, 2023

Poor optimization of iter().skip() #101814

Poor optimization of iter().skip() #101814

Comments

Tearth commented Sep 14, 2022 • edited Loading

MatiF100 commented Sep 14, 2022 • edited Loading

nikic commented Sep 14, 2022

Tearth commented Sep 14, 2022

nikic commented Sep 27, 2022

nikic commented Sep 28, 2022

nikic commented Apr 3, 2023

Tearth commented Sep 14, 2022 •

edited

Loading

MatiF100 commented Sep 14, 2022 •

edited

Loading