RISC-V:Add CMO extension support #1

yulong18 · 2023-08-20T02:02:15Z

No description provided.

pz9115

LGTM, merged

Hi all, We noticed that calls to the vadcq and vsbcq intrinsics, both of which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in the FPSCR, would produce the following code: ``` < r2 is the *carry input > vmrs r3, FPSCR_nzcvqc bic r3, r3, #536870912 orr r3, r3, r2, lsl #29 vmsr FPSCR_nzcvqc, r3 ``` when the MVE ACLE instead gives a different instruction sequence of: ``` < Rt is the *carry input > VMRS Rs,FPSCR_nzcvqc BFI Rs,Rt,#29,ruyisdk#1 VMSR FPSCR_nzcvqc,Rs ``` the bic + orr pair is slower and it's also wrong, because, if the *carry input is greater than 1, then we risk overwriting the top two bits of the FPSCR register (the N and Z flags). This turned out to be a problem in the header file and the solution was to simply add a `& 1x0u` to the `*carry` input: then the compiler knows that we only care about the lowest bit and can optimise to a BFI. Ok for trunk? Thanks, Stam Markianos-Wright gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic. (__arm_vadcq_u32): Likewise. (__arm_vadcq_m_s32): Likewise. (__arm_vadcq_m_u32): Likewise. (__arm_vsbcq_s32): Likewise. (__arm_vsbcq_u32): Likewise. (__arm_vsbcq_m_s32): Likewise. (__arm_vsbcq_m_u32): Likewise. * config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New.

This patch adjusts the costs so that we treat REG and SUBREG expressions the same for costing. This was motivated by bt_skip_func and bt_find_func in xz and results in nearly a 5% improvement in the dynamic instruction count for input #2 and smaller, but definitely visible improvements pretty much across the board. Exceptions would be perlbench input ruyisdk#1 and exchange2 which showed very small regressions. In the bt_find_func and bt_skip_func cases we have something like this: > (insn 10 7 11 2 (set (reg/v:DI 136 [ x ]) > (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip} > (nil)) > (insn 11 10 12 2 (set (reg:DI 142 [ _1 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3} > (nil)) [ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ]) > (plus:DI (reg/v:DI 136 [ x ]) > (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3} > (nil)) Note the two uses of (reg 136). The best way to handle that in combine might be a 3->2 split. But there's a much better approach if we look at fwprop... (set (reg:DI 142 [ _1 ]) (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0)) (reg/v:DI 139 [ b ]))) change not profitable (cost 4 -> cost 8) So that should be the same cost as a regular DImode addition when the ZBA extension is enabled. But it ends up costing more because the clause to cost this variant isn't prepared to handle a SUBREG. That results in the RTL above having too high a cost and fwprop gives up. One approach would be to replace the REG_P with REG_P || SUBREG_P in the costing code. I ultimately decided against that and instead check if the operand in question passes register_operand. By far the most important case to handle is the DImode PLUS. But for the sake of consistency, I changed the other instances in riscv_rtx_costs as well. For those other cases we're talking about improvements in the .000001% range. While we are into stage4, this just hits cost modeling which we've generally agreed is still appropriate (though we were mostly talking about vector). So I'm going to extend that general agreement ever so slightly and include scalar cost modeling :-) gcc/ * config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG similarly. gcc/testsuite/ * gcc.target/riscv/reg_subreg_costs.c: New test. Co-authored-by: Jivan Hakobyan <jivanhakobyan9@gmail.com>

Since the last import from upstream libsanitizer, the output has changed and now looks more like this: READ of size 6 at 0x7ff7beb2a144 thread T0 #0 0x101cf7796 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) sanitizer_common_interceptors.inc:813 ruyisdk#1 0x101cf7b99 in memcmp sanitizer_common_interceptors.inc:840 #2 0x108a0c39f in __stack_chk_guard+0xf (dyld:x86_64+0x8039f) so let's adjust the pattern accordingly. gcc/testsuite/ChangeLog: * c-c++-common/asan/memcmp-1.c: Adjust pattern on darwin.

During partial ordering, we want to look through dependent alias template specializations within template arguments and otherwise treat them as opaque in other contexts (see e.g. r7-7116-g0c942f3edab108 and r11-7011-g6e0a231a4aa240). To that end template_args_equal was given a partial_order flag that controls this behavior. This flag does the right thing when a dependent alias template specialization appears as template argument of the partial specialization, e.g. in template<class T, class...> using first_t = T; template<class T> struct traits; template<class T> struct traits<first_t<T, T&>> { }; // ruyisdk#1 template<class T> struct traits<first_t<const T, T&>> { }; // #2 we correctly consider #2 to be more specialized than ruyisdk#1. But if the alias specialization appears as a nested template argument of another class template specialization, e.g. in template<class T> struct traits<A<first_t<T, T&>>> { }; // ruyisdk#1 template<class T> struct traits<A<first_t<const T, T&>>> { }; // #2 then we incorrectly consider ruyisdk#1 and #2 to be unordered. This is because 1. we don't propagate the flag to recursive template_args_equal calls 2. we don't use structural equality for class template specializations written in terms of dependent alias template specializations This patch fixes the first issue by turning the partial_order flag into a global. This patch fixes the second issue by making us propagate structural equality appropriately when building a class template specialization. In passing this patch also improves hashing of specializations that use structural equality. PR c++/90679 gcc/cp/ChangeLog: * cp-tree.h (comp_template_args): Remove partial_order parameter. (template_args_equal): Likewise. * pt.cc (comparing_for_partial_ordering): New global flag. (iterative_hash_template_arg) <case tcc_type>: Hash the template and arguments for specializations that use structural equality. (template_args_equal): Remove partial order parameter and use comparing_for_partial_ordering instead. (comp_template_args): Likewise. (comp_template_args_porder): Set comparing_for_partial_ordering instead. Make static. (any_template_arguments_need_structural_equality_p): Return true for an argument that's a dependent alias template specialization or a class template specialization that itself needs structural equality. * tree.cc (cp_tree_equal) <case TREE_VEC>: Adjust call to comp_template_args. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/alias-decl-75a.C: New test. * g++.dg/cpp0x/alias-decl-75b.C: New test.

Hi All, This patch adds initial support for early break vectorization in GCC. In other words it implements support for vectorization of loops with multiple exits. The support is added for any target that implements a vector cbranch optab, this includes both fully masked and non-masked targets. Depending on the operation, the vectorizer may also require support for boolean mask reductions using Inclusive OR/Bitwise AND. This is however only checked then the comparison would produce multiple statements. This also fully decouples the vectorizer's notion of exit from the existing loop infrastructure's exit. Before this patch the vectorizer always picked the natural loop latch connected exit as the main exit. After this patch the vectorizer is free to choose any exit it deems appropriate as the main exit. This means that even if the main exit is not countable (i.e. the termination condition could not be determined) we might still be able to vectorize should one of the other exits be countable. In such situations the loop is reflowed which enabled vectorization of many other loop forms. Concretely the kind of loops supported are of the forms: for (int i = 0; i < N; i++) { <statements1> if (<condition>) { ... <action>; } <statements2> } where <action> can be: - break - return - goto Any number of statements can be used before the <action> occurs. Since this is an initial version for GCC 14 it has the following limitations and features: - Only fixed sized iterations and buffers are supported. That is to say any vectors loaded or stored must be to statically allocated arrays with known sizes. N must also be known. This limitation is because our primary target for this optimization is SVE. For VLA SVE we can't easily do cross page iteraion checks. The result is likely to also not be beneficial. For that reason we punt support for variable buffers till we have First-Faulting support in GCC 15. - any stores in <statements1> should not be to the same objects as in <condition>. Loads are fine as long as they don't have the possibility to alias. More concretely, we block RAW dependencies when the intermediate value can't be separated fromt the store, or the store itself can't be moved. - Prologue peeling, alignment peelinig and loop versioning are supported. - Fully masked loops, unmasked loops and partially masked loops are supported - Any number of loop early exits are supported. - No support for epilogue vectorization. The only epilogue supported is the scalar final one. Peeling code supports it but the code motion code cannot find instructions to make the move in the epilog. - Early breaks are only supported for inner loop vectorization. With the help of IPA and LTO this still gets hit quite often. During bootstrap it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now since these are tests for support for early exit vectorization. This implementation does not support completely handling the early break inside the vector loop itself but instead supports adding checks such that if we know that we have to exit in the current iteration then we branch to scalar code to actually do the final VF iterations which handles all the code in <action>. For the scalar loop we know that whatever exit you take you have to perform at most VF iterations. For vector code we only case about the state of fully performed iteration and reset the scalar code to the (partially) remaining loop. That is to say, the first vector loop executes so long as the early exit isn't needed. Once the exit is taken, the scalar code will perform at most VF extra iterations. The exact number depending on peeling and iteration start and which exit was taken (natural or early). For this scalar loop, all early exits are treated the same. When we vectorize we move any statement not related to the early break itself and that would be incorrect to execute before the break (i.e. has side effects) to after the break. If this is not possible we decline to vectorize. The analysis and code motion also takes into account that it doesn't introduce a RAW dependency after the move of the stores. This means that we check at the start of iterations whether we are going to exit or not. During the analyis phase we check whether we are allowed to do this moving of statements. Also note that we only move the scalar statements, but only do so after peeling but just before we start transforming statements. With this the vector flow no longer necessarily needs to match that of the scalar code. In addition most of the infrastructure is in place to support general control flow safely, however we are punting this to GCC 15. Codegen: for e.g. unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i] > x) break; vect_a[i] = x; } return ret; } We generate for Adv. SIMD: test4: adrp x2, .LC0 adrp x3, .LANCHOR0 dup v2.4s, w0 add x3, x3, :lo12:.LANCHOR0 movi v4.4s, 0x4 add x4, x3, 3216 ldr q1, [x2, #:lo12:.LC0] mov x1, 0 mov w2, 0 .p2align 3,,7 .L3: ldr q0, [x3, x1] add v3.4s, v1.4s, v2.4s add v1.4s, v1.4s, v4.4s cmhi v0.4s, v0.4s, v2.4s umaxp v0.4s, v0.4s, v0.4s fmov x5, d0 cbnz x5, .L6 add w2, w2, 1 str q3, [x1, x4] str q2, [x3, x1] add x1, x1, 16 cmp w2, 200 bne .L3 mov w7, 3 .L2: lsl w2, w2, 2 add x5, x3, 3216 add w6, w2, w0 sxtw x4, w2 ldr w1, [x3, x4, lsl 2] str w6, [x5, x4, lsl 2] cmp w0, w1 bcc .L4 add w1, w2, 1 str w0, [x3, x4, lsl 2] add w6, w1, w0 sxtw x1, w1 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 add w4, w2, 2 str w0, [x3, x1, lsl 2] sxtw x1, w4 add w6, w1, w0 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] add w2, w2, 3 cmp w7, 3 beq .L4 sxtw x1, w2 add w2, w2, w0 ldr w4, [x3, x1, lsl 2] str w2, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] .L4: mov w0, 0 ret .p2align 2,,3 .L6: mov w7, 4 b .L2 and for SVE: test4: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 add x5, x2, 3216 mov x3, 0 mov w1, 0 cntw x4 mov z1.s, w0 index z0.s, #0, ruyisdk#1 ptrue p1.b, all ptrue p0.s, all .p2align 3,,7 .L3: ld1w z2.s, p1/z, [x2, x3, lsl 2] add z3.s, z0.s, z1.s cmplo p2.s, p0/z, z1.s, z2.s b.any .L2 st1w z3.s, p1, [x5, x3, lsl 2] add w1, w1, 1 st1w z1.s, p1, [x2, x3, lsl 2] add x3, x3, x4 incw z0.s cmp w3, 803 bls .L3 .L5: mov w0, 0 ret .p2align 2,,3 .L2: cntw x5 mul w1, w1, w5 cbz w5, .L5 sxtw x1, w1 sub w5, w5, ruyisdk#1 add x5, x5, x1 add x6, x2, 3216 b .L6 .p2align 2,,3 .L14: str w0, [x2, x1, lsl 2] cmp x1, x5 beq .L5 mov x1, x4 .L6: ldr w3, [x2, x1, lsl 2] add w4, w0, w1 str w4, [x6, x1, lsl 2] add x4, x1, 1 cmp w0, w3 bcs .L14 mov w0, 0 ret On the workloads this work is based on we see between 2-3x performance uplift using this patch. Follow up plan: - Boolean vectorization has several shortcomings. I've filed PR110223 with the bigger ones that cause vectorization to fail with this patch. - SLP support. This is planned for GCC 15 as for majority of the cases build SLP itself fails. This means I'll need to spend time in making this more robust first. Additionally it requires: * Adding support for vectorizing CFG (gconds) * Support for CFG to differ between vector and scalar loops. Both of which would be disruptive to the tree and I suspect I'll be handling fallouts from this patch for a while. So I plan to work on the surrounding building blocks first for the remainder of the year. Additionally it also contains reduced cases from issues found running over various codebases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Also regtested with: -march=armv8.3-a+sve -march=armv8.3-a+nosve -march=armv9-a -mcpu=neoverse-v1 -mcpu=neoverse-n2 Bootstrapped Regtested x86_64-pc-linux-gnu and no issues. Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues. gcc/ChangeLog: * tree-if-conv.cc (idx_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use it. * tree-vect-loop-manip.cc (vect_iv_increment_position): New. (vect_set_loop_controls_directly, vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal): Support multiple exits. (slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for multiple exits. (slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB count and instead look at loop shape. (vect_update_ivs_after_vectorizer): Drop asserts. (vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations. (vect_do_peeling): Support multiple exits. (vect_loop_versioning): Likewise. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise early_breaks. (vect_analyze_loop_form): Support loop flows with more than single BB loop body. (vect_create_loop_vinfo): Support niters analysis for multiple exits. (vect_analyze_loop): Likewise. (vect_get_vect_def): New. (vect_create_epilog_for_reduction): Support early exit reductions. (vectorizable_live_operation_1): New. (find_connected_edge): New. (vectorizable_live_operation): Support early exit live operations. (move_early_exit_stmts): New. (vect_transform_loop): Use it. * tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond. (vect_recog_bitfield_ref_pattern): Support gconds and bools. (vect_recog_gcond_pattern): New. (possible_vector_mask_operation_p): Support gcond masks. (vect_determine_mask_precision): Likewise. (vect_mark_pattern_stmts): Set gcond def type. (can_vectorize_live_stmts): Force early break inductions to be live. * tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for early breaks. (vect_mark_stmts_to_be_vectorized): Process gcond usage. (perm_mask_for_reverse): Expose. (vectorizable_comparison_1): New. (vectorizable_early_exit): New. (vect_analyze_stmt): Support early break and gcond. (vect_transform_stmt): Likewise. (vect_is_simple_use): Likewise. (vect_get_vector_types_for_stmt): Likewise. * tree-vectorizer.cc (pass_vectorize::execute): Update exits for value numbering. * tree-vectorizer.h (enum vect_def_type): Add vect_condition_def. (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES, LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB, LOOP_VINFO_EARLY_BRK_VUSES): New. (is_loop_header_bb_p): Drop assert. (class loop): Add early_breaks, early_break_stores, early_break_dest_bb, early_break_vuses. (vect_iv_increment_position, perm_mask_for_reverse, ref_within_array_bound): New. (slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.