Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed optimization comparing TypeId arrays #84253

Closed
AngelicosPhosphoros opened this issue Apr 16, 2021 · 6 comments
Closed

Missed optimization comparing TypeId arrays #84253

AngelicosPhosphoros opened this issue Apr 16, 2021 · 6 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@AngelicosPhosphoros
Copy link
Contributor

AngelicosPhosphoros commented Apr 16, 2021

Edit: Look nikic example below.

I tried this code:

use std::any::TypeId;

trait CheckEqTuple{
    type OutSliceType: AsRef<[TypeId]> + AsMut<[TypeId]>;
    fn get_unsorted_ids()->Self::OutSliceType;
    fn get_quick_sorted_ids()->Self::OutSliceType{
        let mut ret = Self::get_unsorted_ids();
        ret.as_mut().sort_unstable();
        ret
    }
    fn get_slow_sorted_ids()->Self::OutSliceType{
        let mut ret = Self::get_unsorted_ids();
        ret.as_mut().sort();
        ret
    }
}

impl<A: 'static, B: 'static> CheckEqTuple for (A, B){
    type OutSliceType = [TypeId; 2];
    fn get_unsorted_ids()->Self::OutSliceType{
        [TypeId::of::<A>(), TypeId::of::<B>()]
    }
}

fn cmp_unord<A: CheckEqTuple, B: CheckEqTuple>()->bool{
    A::get_unsorted_ids().as_ref()
     == B::get_unsorted_ids().as_ref()
}

fn cmp_ord_q<A: CheckEqTuple, B: CheckEqTuple>()->bool{
    A::get_quick_sorted_ids().as_ref() 
        == B::get_quick_sorted_ids().as_ref()
}

fn cmp_ord_s<A: CheckEqTuple, B: CheckEqTuple>()->bool{
    A::get_slow_sorted_ids().as_ref()
     == B::get_slow_sorted_ids().as_ref()
}

pub fn cmp_same()->bool{
    cmp_unord::<(i32, bool), (i32, bool)>()
}

pub fn cmp_diff_order()->bool{
    cmp_unord::<(bool, i32), (i32, bool)>()
}

pub fn cmp_ord()->bool{
    cmp_ord_q::<(bool, i32), (i32, bool)>()
}

pub fn cmp_ord2()->bool{
    cmp_ord_s::<(bool, i32), (i32, bool)>()
}

I expected to see this happen:

This code should compile to return true or return false.

example::cmp_same:
        mov     al, 1
        ret

example::cmp_diff_order:
        xor     eax, eax
        ret

example::cmp_ord:
        mov     al, 1
        ret

Instead, this happened:

asm contains useless operations

example::cmp_same:
        sub     rsp, 32
        movabs  rax, -8661621401413125213
        mov     qword ptr [rsp + 8], rax
        mov     al, 1
        add     rsp, 32
        ret

example::cmp_diff_order:
        sub     rsp, 32
        movabs  rax, -5015437470765251660
        mov     qword ptr [rsp + 8], rax
        xor     eax, eax
        add     rsp, 32
        ret

example::cmp_ord:
        sub     rsp, 32
        movabs  rax, -5015437470765251660
        mov     qword ptr [rsp + 8], rax
        mov     al, 1
        add     rsp, 32
        ret

Meta

rustc --version --verbose:

rustc 1.51.0 (2fd73fabe 2021-03-23)

Proposed fix

If I add additional SROA flag, useless code is eliminated. I think, we need to add another SROA to pipeline somewhere before last instcombine and dead code elimination.

Here godbolt link.

@AngelicosPhosphoros AngelicosPhosphoros added the C-bug Category: This is a bug. label Apr 16, 2021
@AngelicosPhosphoros AngelicosPhosphoros changed the title Missed optimization with tuple equality Missed optimization with tuple type equality Apr 16, 2021
@nikic
Copy link
Contributor

nikic commented Apr 16, 2021

A simpler starting point that already doesn't optimize: https://godbolt.org/z/rvMah5b5o

@AngelicosPhosphoros
Copy link
Contributor Author

@nikic Do you know if rustc provides some custom LLVM passes to pipeline? After skimming of source, it looks like it just uses default optimization levels from LLVM.

@AngelicosPhosphoros AngelicosPhosphoros changed the title Missed optimization with tuple type equality Missed optimization comparing TypeId arrays Apr 17, 2021
@nikic
Copy link
Contributor

nikic commented Apr 17, 2021

@AngelicosPhosphoros Yes, rust uses only the default optimization pipeline.

In this case, I believe the problem lies with unrolling. After some cleanup, we have this IR going into loop-unroll-full: https://gist.github.com/nikic/9516dede0acedd8d320e9d067559e667

If I understand correctly, it is not unrolled fully because we know small upper bound on the tripcount (3), but the target (X86) does not enable UpperBound unrolling. It does work with -mtriple=arm, which enables it. This was kinda surprising to me, given LLVMs usual propensity to vectorize and unroll everything beyond any reasonable bounds. There's this comment in the implementation:

  // We can unroll by the upper bound amount if it's generally allowed or if
  // we know that the loop is executed either the upper bound or zero times.
  // (MaxOrZero unrolling keeps only the first loop test, so the number of
  // loop tests remains the same compared to the non-unrolled version, whereas
  // the generic upper bound unrolling keeps all but the last loop test so the
  // number of loop tests goes up which may end up being worse on targets with
  // constrained branch predictor resources so is controlled by an option.)
  // In addition we only unroll small upper bounds.

Which makes me think that this option should probably enabled on X86, which is certainly not a system with "constrained branch predictor resources".

Though there is another important factor here: What the constant upper bound doesn't capture, is the fact that in this case, there's two loop exits. One of the exits will be replicated by unrolling. However, the other one is fully determined by unrolling. These exits will be eliminated, together with the induction variable.

This means that the loop unrolling heuristic is inconsistent: If the loop had an interior (non-exiting) branch, then it would get fully unrolled, even though that replicates the exact same number of branches. It's just a difference between whether those branches are exiting or not.

@nikic
Copy link
Contributor

nikic commented Apr 17, 2021

Looking a bit further, there is this code: https://github.com/llvm/llvm-project/blob/ebc6608fb79057eaed27435d62d5dea0979bd9d3/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp#L1099-L1108 For exact trip counts, LoopUnroll will actually not use the loop trip count, but rather the trip count of a single exit or the latch exit. This is really the case we want to fall into. Unfortunately, in our case the latch exit is the comparison of the loads (which has an unknown exit count) rather than the comparison of the induction variant (which is known).

There's two ways that could be addressed: The LoopUnroll logic could be more general about which exit count it uses. Or LoopRotate could be adjusted to perform a rotation in this case despite already having an exiting latch (there is already a heuristic for that, but it does not account for this case).

@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Apr 17, 2021
@nikic nikic self-assigned this Jun 26, 2021
@nikic
Copy link
Contributor

nikic commented Jun 26, 2021

This issue should be addressed by https://reviews.llvm.org/D102982 (plus about a dozen preparatory patches to make loop unrolling robust against multiple exits).

@nikic
Copy link
Contributor

nikic commented Aug 22, 2021

This is fixed by #87570. New IR folds to true/false:

@_ZN5test28cmp_same17hf172c8bcc9b7859cE = unnamed_addr alias i1 (), i1 ()* @_ZN5test27cmp_ord17ha9ee7b72078df8e5E
@_ZN5test28cmp_ord217h23f0a5da1c84ecaaE = unnamed_addr alias i1 (), i1 ()* @_ZN5test27cmp_ord17ha9ee7b72078df8e5E

; test2::cmp_diff_order
; Function Attrs: mustprogress nofree norecurse nosync nounwind nonlazybind readnone uwtable willreturn
define zeroext i1 @_ZN5test214cmp_diff_order17hd0c16acee48eb39bE() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  ret i1 false
}

; test2::cmp_ord
; Function Attrs: mustprogress nofree norecurse nosync nounwind nonlazybind readnone uwtable willreturn
define zeroext i1 @_ZN5test27cmp_ord17ha9ee7b72078df8e5E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  ret i1 true
}

@nikic nikic closed this as completed Aug 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

2 participants