Skip to content

Conversation

kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented May 23, 2025

Overview

In .NET 9, we added SVE support to work on hardware that has vector length (VL) of 16-bytes (16B) long. This prohibits developer from using SVE feature on hardware that supports different vector lengths or for NativeAOT scenarios, where binaries once compiled for a particular VL, will need recompilation to run on hardware having different VL. This PR adds the preliminary support of limited vector lengths (32 bytes and 64 bytes) for JIT scenario. There will be follow-up PRs to include support for other vector lengths as well as for NativeAOT.

Vector<T> is the .NET's vector length agnostic type and we will leverage this type to generate SVE instructions. Currently, the heuristics is set such that Vector<T> will continue to generate NEON instructions if underlying VL is 16B. Only if VL > 16B, we will start generating SVE instructions for them.
 

TYP_SIMD*

SVE has variable length vectors ranging from 16B ~ 256B and should be power of 2. So applicable vector lengths can be 16B, 32B, 64B, 128B and 256B. This PR adds preliminary support for agnostic VL by reusing some of the existing logic of xarch around TYP_SIMD32 and TYP_SIMD64 and can be further expanded to TYP_SIMD128 and TYP_SIMD256. It was easier to port the logic at various places using existing higher vector length types rather than creating a type whose size will be determined at runtime and then handling the new type throughout the code base specially around value numbering. Once all the issues are ironed out, I will reconsider adding generalized TYP_SIMD type instead of 32B, 64B, etc.

Vector

Today, Vector<T> type is mapped to corresponding Vector128<T> intrinsics methods to generate NEON instructions. This is because NEON instructions operate on 16B data. We will detect the vector length and if it is > 16B, we will use SVE instructions. To do that, we will stop mapping Vector<T> -> Vector128<T>, but instead, introduced new intrinsics based on Vector<T>. These intrinsics correspond to the methods available on Vector<T>. Next, we will propagate these intrinsics throughout the code base. During codegen, when we see an intrinsic of Vector<T> type, we would know that we need to generate SVE instruction instead of NEON instruction.

Register allocation

In .NET 9, we adopted custom ABI for SVE registers. For now, we will continue to use that ABI. At call boundary, only lower-half of v8~v15 is callee-saved and today, we preserve the upper-half of live SIMD registers into those registers. Since SVE registers are wider, we might need more than v8~v15 to preserve the upper portion of the killed registers. Hence, I decided to just spill them on stack. In future, when we fine tune our ABI, we will update this design.

Other optimizations

In xarch, there are several other optimizations like ReadUtf8 or Memmove that takes benefit of higher VL. I tried to enable them for Arm64 with higher VL, but for some of them, I was not able to find an optimal equivalent SVE instructions. Some needed support of SVE2 instructions. Hence, I decided to not do any optimization around this. We will enable them in future incrementally.

Testing

I have introduced a DEBUG flag DOTNET_UseSveForVectorT. When this is set, we will hardcode the VL to 32B in order to kick off the Vector<T>/SVE path I mentioned above. This approach will work for superpmi / jitstress testing. I need to still validate its functioning during actual execution on Cobalt machines that just have 16B VL. I thought about introducing a flag like DOTNET_MinVectorTLengthForSve, which basically will specify what is the minimum vector length needed to trigger SVE instructions, and during testing, we could have set it to 16B, however I soon realized that there were lot of code paths, that takes dependency on TYP_SIMD16 and generate NEON instructions. Having DOTNET_UseSveForVectorT felt like better approach.

TODOs

There are several TODOs that I will address before marking the PR for review, but others might have to be done incrementally.

Reference: #115037

Examples

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static bool Test2(Vector<int> a, Vector<int> b)
    {
        return Vector.LessThanAll(a, b);
    }

image

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Test()
    {
        var a = GetVector<int>(5);
        var b = GetVector<int>(5);
        Vector<int> c = a + b;
        Consume(c);
    }

image

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Test(Vector<int> a)
    {
        var b = GetVector<int>(5);
        Vector<int> c = a + b;
        Consume(c);
    }

image

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static Vector<int> Test(Vector<int> a)
    {
        var b = GetVector<int>(5);
        var c = a << 8;
        Consume(c);
        return Cond() ? c : b;
    }

image

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Test(Vector<int> a)
    {
        Vector<float> b = GetVector<float>(5.9f);
        Vector<float> c = GetVector<float>(5.9f);
        var result = Sve.CompareGreaterThan(b, c);
        Consume(result);
    }

image

@kunalspathak
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kunalspathak
Copy link
Contributor Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr jitstressregs

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@kunalspathak
Copy link
Contributor Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr jitstressregs

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@risc-vv
Copy link

risc-vv commented Jul 4, 2025

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

@risc-vv
Copy link

risc-vv commented Jul 9, 2025

RISC-V Release-FX-QEMU: 275972 / 277031 (99.62%)
=======================
      passed: 275972
      failed: 1053
     skipped: 39
      killed: 6
------------------------
 TOTAL tests: 277070
VIRTUAL time: 30h 12min 7s 686ms
   REAL time: 1h 11min 49s 285ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 7f88033009d1b80bdc860e9ead1343b2dae4b7aa
CI: d6c9c1ab3a7411819463edc05ded301e89ba586a
REPO: kunalspathak/runtime
BRANCH: variable-vl-3
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jul 28, 2025

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

@a74nh
Copy link
Contributor

a74nh commented Aug 7, 2025

Checking the status of this PR on a graviton 3 (which has SVE256)

With this program

using System;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;

public class Program
{
    public static sbyte s_1;

    public static void Main()
    {
        if (Sve.IsSupported)
        {
            var a = Vector.Create<int>(42);
            var b = Vector.Create<int>(43);
            var c = a + b;
            Consume(c);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Consume<T>(T v)
    {
        Console.WriteLine(v);
    }
}

Running as is:

❯ $CORE_ROOT/corerun ./bin/Release/net10.0/vector.dll

Assert failure(PID 228275 [0x00037bb3], Thread: 228275 [0x37bb3]): Assertion failed 'unreached' in 'System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int' during 'Generate code' (IL size 308; hash 0xbda601aa; Instrumented Tier0)

    File: /home/alahay01/dotnet/runtime_table/src/coreclr/jit/emitarm64sve.cpp:4438
    Image: /home/alahay01/dotnet/runtime_table/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun

Breaking just before the error with dumps on:

Generating: N182 (???,???) [000249] -----------                            IL_OFFSET void   INLRT @ 0x066[E-] REG NA
Generating: N184 (???,???) [000141] -----+-----                  t141 =    LCL_VAR   byref  V00 arg0          x0 REG x0
Mapped BB30 to G_M65109_IG07
IN0044:             ldr     x0, [fp, #0x58]	// [V00 arg0]
							Byref regs: 0000 {} => 0001 {x0}
Generating: N186 (???,???) [000142] -----+-----                  t142 =    LCL_VAR   long   V06 loc2          x1 REG x1
IN0045:             ldr     x1, [fp, #0x30]	// [V06 loc2]
Generating: N188 (???,???) [000143] -c---+-----                  t143 =    CNS_INT   long   1 REG NA
                                                                        /--*  t142   long
                                                                        +--*  t143   long
Generating: N190 (???,???) [000144] -----+-----                  t144 = *  LSH       long   REG x1
IN0046:             lsl     x1, x1, #1
                                                                        /--*  t141   byref
                                                                        +--*  t144   long
Generating: N192 (???,???) [000145] -c---+-----                  t145 = *  LEA(b+(i*1)+0) byref  REG NA
                                                                        /--*  t145   byref
Generating: N194 (???,???) [000146] U--XG+-----                  t146 = *  IND       simd32 REG d16
							Byref regs: 0001 {x0} => 0000 {}

Thread 1 "corerun" hit Breakpoint 1, emitter::emitInsSve_R_R_R (this=0xffbef4036260, ins=INS_sve_ldr,
    attr=EA_SCALABLE, reg1=JITREG_V16, reg2=JITREG_R0, reg3=JITREG_R1, opt=INS_OPTS_NONE,
    sopt=INS_SCALABLE_OPTS_NONE) at /home/alahay01/dotnet/runtime_table/src/coreclr/jit/emitarm64sve.cpp:2954

I get the same error regardless of whether DOTNET_UseSveForVectorT is set or not. That'll be due to coreclr auto-enabling SVE for VectorT when vector length >128.


When in 128bit SVE mode on graviton3:

❯ $CORE_ROOT/corerun ./bin/Release/net10.0/vector.dll
<85, 85, 85, 85>

IN000b: 000000      stp     fp, lr, [sp, #-0x20]!
IN000c: 000004      mov     fp, sp
IN000d: 000008      str     xzr, [fp, #0x10]	// [V00 loc0]
IN000e: 00000C      str     xzr, [fp, #0x18]	// [V00 loc0+0x08]

G_M27646_IG02:        ; offs=0x000010, size=0x0028, bbWeight=1, PerfScore 10.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], byref

IN0001: 000010      movi    v16.4s, #0x2B
IN0002: 000014      str     q16, [fp, #0x10]	// [V00 loc0]
IN0003: 000018      ldr     q16, [fp, #0x10]	// [V00 loc0]
IN0004: 00001C      movi    v17.4s, #0x2A
IN0005: 000020      add     v0.4s, v16.4s, v17.4s
IN0006: 000024      movz    x0, #0x4B70      // code for Program:Consume[System.Numerics.Vector`1[int]](System.Numerics.Vector`1[int])
IN0007: 000028      movk    x0, #0x9E01 LSL #16
IN0008: 00002C      movk    x0, #0xF720 LSL #32
IN0009: 000030      ldr     x0, [x0]
IN000a: 000034      blr     x0

G_M27646_IG03:        ; offs=0x000038, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN000f: 000038      ldp     fp, lr, [sp], #0x20
IN0010: 00003C      ret     lr

With DOTNET_UseSveForVectorT=1:

❯ $CORE_ROOT/corerun ./bin/Release/net10.0/vector.dll
<85, 85, 85, 85>

IN000b: 000000      stp     fp, lr, [sp, #-0x20]!
IN000c: 000004      mov     fp, sp
IN000d: 000008      str     xzr, [fp, #0x10]	// [V00 loc0]
IN000e: 00000C      str     xzr, [fp, #0x18]	// [V00 loc0+0x08]

G_M27646_IG02:        ; offs=0x000010, size=0x0028, bbWeight=1, PerfScore 14.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], byref

IN0001: 000010      mov     z16.s, #43
IN0002: 000014      str     q16, [fp, #0x10]	// [V00 loc0]
IN0003: 000018      mov     z16.s, #42
IN0004: 00001C      ldr     q17, [fp, #0x10]	// [V00 loc0]
IN0005: 000020      add     z0.s, z16.s, z17.s
IN0006: 000024      movz    x0, #0x4B40      // code for Program:Consume[System.Numerics.Vector`1[int]](System.Numerics.Vector`1[int])
IN0007: 000028      movk    x0, #0x8E03 LSL #16
IN0008: 00002C      movk    x0, #0xE411 LSL #32
IN0009: 000030      ldr     x0, [x0]
IN000a: 000034      blr     x0

G_M27646_IG03:        ; offs=0x000038, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN000f: 000038      ldp     fp, lr, [sp], #0x20
IN0010: 00003C      ret     lr

@a74nh
Copy link
Contributor

a74nh commented Aug 7, 2025

Assert failure(PID 228275 [0x00037bb3], Thread: 228275 [0x37bb3]): Assertion failed 'unreached' in 'System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int' during 'Generate code' (IL size 308; hash 0xbda601aa; Instrumented Tier0) - this is due to the WriteLine call.

Removing the WriteLine (so that Consume is empty) gives a segfault.

; Assembly listing for method Program:Main() (Tier0)
; Emitting BLENDED_CODE for generic ARM64 + SVE on Unix
; Tier0 code
; fp based frame
; partially interruptible
; compiling with minopt
; Final local variable assignments
;
;  V00 loc0         [V00    ] (  1,  1   )  simd32  ->  [fp+0x10]   do-not-enreg[S] must-init <System.Numerics.Vector`1[int]>
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [sp+0x00]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 32

G_M27646_IG01:  ;; offset=0x0000
            nop
            brk     #0
            stp     fp, lr, [sp, #-0x30]!
            mov     fp, sp
            str     xzr, [fp, #0x10]	// [V00 loc0]
            str     xzr, [fp, #0x18]	// [V00 loc0+0x08]
            str     xzr, [fp, #0x20]	// [V00 loc0+0x10]
            str     xzr, [fp, #0x28]	// [V00 loc0+0x18]
						;; size=32 bbWeight=1 PerfScore 7.00
G_M27646_IG02:  ;; offset=0x0020
            mov     z16.s, #43
            str     z16, [fp, #1, mul vl]	// [V00 loc0]
            mov     z16.s, #42
            ldr     z17, [fp, #1, mul vl]	// [V00 loc0]
            add     z0.s, z16.s, z17.s
            movz    x0, #0x4A50      // code for Program:Consume[System.Numerics.Vector`1[int]](System.Numerics.Vector`1[int])
            movk    x0, #0x5E02 LSL #16
            movk    x0, #0xF909 LSL #32
            ldr     x0, [x0]
            blr     x0
						;; size=40 bbWeight=1 PerfScore 18.50
G_M27646_IG03:  ;; offset=0x0048
            ldp     fp, lr, [sp], #0x30
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00

; Total bytes of code 80, prolog size 32, PerfScore 27.50, instruction count 20, allocated bytes for code 80 (MethodHash=cb019401) for method Program:Main() (Tier0)
Thread 1 "corerun" received signal SIGTRAP, Trace/breakpoint trap.
0x0000ffffa9bc32ac in ?? ()
(gdb) bt
#0  0x0000ffffa9bc32ac in ?? ()
#1  0x0000fffff769389c in NativeExceptionHolderBase::Push (this=0xffffffffdb20)
    at /home/alahay01/dotnet/runtime_table/src/coreclr/pal/inc/pal.h:4032
#2  CallDescrWorkerWithHandler (pCallDescrData=0xffffffffdd20, fCriticalCall=0)
    at /home/alahay01/dotnet/runtime_table/src/coreclr/vm/callhelpers.cpp:57
#3  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

Edit: With DOTNET_TieredCompilation=0 get the same error. The SVE ldr/str above are optimised away, so it can't be that

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI NO-REVIEW Experimental/testing PR, do NOT review it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants