Description
Background and motivation
Comming from the need to narrow a Vector256<double> to a Vector128<float>, I was confrontated with three differerent coding possibilities, which produced different code gens (with identical results).
using System.Runtime.Intrinsics;
public static class TestClass
{
public static Vector128<float> Narrow1(Vector256<double> value) {
return Vector128.Narrow(value.GetLower(), value.GetUpper());
}
public static Vector128<float> Narrow2(Vector256<double> value) {
return Vector256.Narrow(value, Vector256<double>.Zero).GetLower();
}
public static Vector128<float> Narrow3(Vector256<double> value) {
return Vector256.Narrow(value, value).GetLower();
}
}
translating to following code on my workstation (Windows x64, NET9.0, AVX2 support)
TestClass.Narrow1(System.Runtime.Intrinsics.Vector256`1<Double>)
L0000: vmovups ymm0, [rdx]
L0004: vmovaps ymm1, ymm0
L0008: vcvtpd2ps xmm1, xmm1
L000c: vextractf128 xmm0, ymm0, 1
L0012: vcvtpd2ps xmm0, xmm0
L0016: vmovlhps xmm0, xmm1, xmm0
L001a: vmovups [rcx], xmm0
L001e: mov rax, rcx
L0021: vzeroupper
L0024: ret
TestClass.Narrow2(System.Runtime.Intrinsics.Vector256`1<Double>)
L0000: vcvtpd2ps xmm0, ymmword ptr [rdx]
L0004: vxorps ymm1, ymm1, ymm1
L0008: vcvtpd2ps xmm1, ymm1
L000c: vinsertf128 ymm0, ymm0, xmm1, 1
L0012: vmovups [rcx], xmm0
L0016: mov rax, rcx
L0019: vzeroupper
L001c: ret
TestClass.Narrow3(System.Runtime.Intrinsics.Vector256`1<Double>)
L0000: vcvtpd2ps xmm0, ymmword ptr [rdx]
L0004: vmovaps ymm1, ymm0
L0008: vinsertf128 ymm0, ymm1, xmm0, 1
L000e: vmovups [rcx], xmm0
L0012: mov rax, rcx
L0015: vzeroupper
L0018: ret
where Narrow3() seems to be the optimal one on my workstation.
API Proposal
I suggest clearer additional methods for narrowing/widening Vector256<TFrom> to Vector128<TTo>.
Where conversion TFrom/TTo are: double/float, long/int, ulong/uint, int/short, uint/ushort, short/sbyte, ushort/byte.
namespace System.Runtime.Intrinsics;
public class Vector256
{
public static Vector128<float> Narrow(Vector256<double> value) => ...
public static Vector128<int> Narrow(Vector256<long> value) => ...
public static Vector128<uint> Narrow(Vector256<ulong> value) => ...
public static Vector128<short> Narrow(Vector256<int> value) => ...
public static Vector128<ushort> Narrow(Vector256<uint> value) => ...
public static Vector128<sbyte> Narrow(Vector256<short> value) => ...
public static Vector128<byte> Narrow(Vector256<ushort> value) => ...
public static Vector256<double> Widen(Vector128<float> value) => ...
public static Vector256<long> Widen(Vector128<int> value) => ...
public static Vector256<ulong> Widen(Vector128<uint> value) => ...
public static Vector256<int> Widen(Vector128<short> value) => ...
public static Vector256<uint> Widen(Vector128<ushort> value) => ...
public static Vector256<short> Widen(Vector128<sbyte> value) => ...
public static Vector256<ushort> Widen(Vector128<byte> value) => ...
}
and analogous methods on Vector128 and Vector512...
Remark: I'm aware that oposed to the current Narrow/Widen methods, these new methods cannot be implemented as generic methods.
API Usage
Vector256<double> v0 = Vector256.Create(1.1, 2.2, 3.3, 4.4);
Vector128<float> v1 = Vector256.Narrow(v0);
Vector256<double> v2 = Vector256.Widen(v1);
Alternative Designs
Clearify the optimal usage in the current API documentation.
Risks
Because of being new overloads, I see no risk.