-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Sve: Preliminary support for agnostic VL for JIT scenarios #115948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr jitstressregs |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr jitstressregs |
Azure Pipelines successfully started running 3 pipeline(s). |
@dotnet/samsung Could you please take a look? These changes may be related to riscv64. |
7f88033 is being scheduled for building and testingGIT: |
@dotnet/samsung Could you please take a look? These changes may be related to riscv64. |
7f88033 is being scheduled for building and testingGIT: |
@dotnet/samsung Could you please take a look? These changes may be related to riscv64. |
7f88033 is being scheduled for building and testingGIT: |
@dotnet/samsung Could you please take a look? These changes may be related to riscv64. |
7f88033 is being scheduled for building and testingGIT: |
Overview
In .NET 9, we added SVE support to work on hardware that has vector length (VL) of 16-bytes (16B) long. This prohibits developer from using SVE feature on hardware that supports different vector lengths or for NativeAOT scenarios, where binaries once compiled for a particular VL, will need recompilation to run on hardware having different VL. This PR adds the preliminary support of limited vector lengths (32 bytes and 64 bytes) for JIT scenario. There will be follow-up PRs to include support for other vector lengths as well as for NativeAOT.
Vector<T>
is the .NET's vector length agnostic type and we will leverage this type to generate SVE instructions. Currently, the heuristics is set such thatVector<T>
will continue to generate NEON instructions if underlying VL is 16B. Only if VL > 16B, we will start generating SVE instructions for them.TYP_SIMD*
SVE has variable length vectors ranging from 16B ~ 256B and should be power of 2. So applicable vector lengths can be 16B, 32B, 64B, 128B and 256B. This PR adds preliminary support for agnostic VL by reusing some of the existing logic of xarch around
TYP_SIMD32
andTYP_SIMD64
and can be further expanded toTYP_SIMD128
andTYP_SIMD256
. It was easier to port the logic at various places using existing higher vector length types rather than creating a type whose size will be determined at runtime and then handling the new type throughout the code base specially around value numbering. Once all the issues are ironed out, I will reconsider adding generalizedTYP_SIMD
type instead of32B
,64B
, etc.Vector
Today,
Vector<T>
type is mapped to correspondingVector128<T>
intrinsics methods to generate NEON instructions. This is because NEON instructions operate on 16B data. We will detect the vector length and if it is> 16B
, we will use SVE instructions. To do that, we will stop mappingVector<T> -> Vector128<T>
, but instead, introduced new intrinsics based onVector<T>
. These intrinsics correspond to the methods available onVector<T>
. Next, we will propagate these intrinsics throughout the code base. During codegen, when we see an intrinsic ofVector<T>
type, we would know that we need to generate SVE instruction instead of NEON instruction.Register allocation
In .NET 9, we adopted custom ABI for SVE registers. For now, we will continue to use that ABI. At call boundary, only lower-half of
v8~v15
is callee-saved and today, we preserve the upper-half of live SIMD registers into those registers. Since SVE registers are wider, we might need more thanv8~v15
to preserve the upper portion of the killed registers. Hence, I decided to just spill them on stack. In future, when we fine tune our ABI, we will update this design.Other optimizations
In xarch, there are several other optimizations like
ReadUtf8
orMemmove
that takes benefit of higher VL. I tried to enable them for Arm64 with higher VL, but for some of them, I was not able to find an optimal equivalent SVE instructions. Some needed support of SVE2 instructions. Hence, I decided to not do any optimization around this. We will enable them in future incrementally.Testing
I have introduced a DEBUG flag
DOTNET_UseSveForVectorT
. When this is set, we will hardcode the VL to 32B in order to kick off theVector<T>
/SVE path I mentioned above. This approach will work for superpmi / jitstress testing. I need to still validate its functioning during actual execution on Cobalt machines that just have 16B VL. I thought about introducing a flag likeDOTNET_MinVectorTLengthForSve
, which basically will specify what is the minimum vector length needed to trigger SVE instructions, and during testing, we could have set it to16B
, however I soon realized that there were lot of code paths, that takes dependency onTYP_SIMD16
and generate NEON instructions. HavingDOTNET_UseSveForVectorT
felt like better approach.TODOs
There are several TODOs that I will address before marking the PR for review, but others might have to be done incrementally.
Reference: #115037
Examples