Skip to content

Optimize Vector.Max codegen using AVX512 #116117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 6, 2025

Conversation

alexcovington
Copy link
Contributor

@alexcovington alexcovington commented May 29, 2025

This PR optimizes the codegen for Vector128/256/512.Max and Vector128/256/512.Min when AVX512 is supported. The new codegen emitted is the same pattern as the scalar codegen for Math.Max/Math.Min.

Scalar version + codegen
public double ScalarMax(double x, double y) {
        return Math.Max(x,y);
}

public double ScalarMin(double x, double y) {
        return Math.Min(x,y);
}
C.ScalarMax(Double, Double)
    L0000: vrangesd xmm0, xmm1, xmm2, 5
    L0007: vmovups xmm3, [0x7ffcb40e0060]
    L000f: vfixupimmsd xmm1, xmm2, xmm3, 0
    L0016: vfixupimmsd xmm0, xmm1, xmm3, 0
    L001d: ret
C.ScalarMin(Double, Double)
    L0000: vrangesd xmm0, xmm1, xmm2, 4
    L0007: vmovups xmm3, [0x7ffcb40e0060]
    L000f: vfixupimmsd xmm1, xmm2, xmm3, 0
    L0016: vfixupimmsd xmm0, xmm1, xmm3, 0
    L001d: ret
Vector Max version + codegen
public Vector128<double> VectorMax(Vector128<double> x, Vector128<double> y) {
        return Vector128.Max(x, y);
}

Codegen (before):

       vmovups   xmm0,[r8]
       vmovups   xmm1,[r9]
       vcmpeqpd  xmm2,xmm1,xmm0
       vxorps    xmm3,xmm3,xmm3
       vpcmpgtq  xmm3,xmm3,xmm1
       vcmpneqpd xmm4,xmm0,xmm0
       vpternlogq xmm4,xmm3,xmm2,0F8
       vcmpltpd  xmm2,xmm1,xmm0
       vorpd     xmm2,xmm2,xmm4
       vpternlogq xmm2,xmm1,xmm0,0AC
       vmovups   [rdx],xmm2
       mov       rax,rdx
       ret

Codegen (after):

       vmovups   xmm0,[r8]
       vmovups   xmm1,[r9]
       vrangepd  xmm2,xmm0,xmm1,5
       vmovddup  xmm3,qword ptr [7FFA5DFC3030]
       vfixupimmpd xmm0,xmm1,xmm3,0
       vfixupimmpd xmm2,xmm0,xmm3,0
       vmovups   [rdx],xmm2
       mov       rax,rdx
       ret
Vector Min version + codegen
public Vector128<double> VectorMin(Vector128<double> x, Vector128<double> y) {
        return Vector128.Min(x, y);
}

Codegen (before):

       vmovups   xmm0,[r8]
       vmovups   xmm1,[r9]
       vcmpeqpd  xmm2,xmm1,xmm0
       vxorps    xmm3,xmm3,xmm3
       vpcmpgtq  xmm3,xmm3,xmm0
       vcmpneqpd xmm4,xmm0,xmm0
       vpternlogq xmm4,xmm3,xmm2,0F8
       vcmpltpd  xmm2,xmm0,xmm1
       vorpd     xmm2,xmm2,xmm4
       vpternlogq xmm2,xmm1,xmm0,0AC
       vmovups   [rdx],xmm2
       mov       rax,rdx
       ret

Codegen (after):

       vmovups   xmm0,[r8]
       vmovups   xmm1,[r9]
       vrangepd  xmm2,xmm0,xmm1,4
       vmovddup  xmm3,qword ptr [7FFA82BD3030]
       vfixupimmpd xmm0,xmm1,xmm3,0
       vfixupimmpd xmm2,xmm0,xmm3,0
       vmovups   [rdx],xmm2
       mov       rax,rdx
       ret
Performance

These are the existing Max microbenchmarks for System.Numerics.Tensors:

| Type                                | Method     | Job        | Toolchain                              | BufferLength | Mean       | Error      | StdDev     | Median     | Min        | Max        | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------ |----------- |----------- |--------------------------------------- |------------- |-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|------:|--------:|----------:|------------:|
| Perf_NumberTensorPrimitives<Double> | Max_Vector | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  23.580 ns |  0.4058 ns |  0.3597 ns |  23.672 ns |  22.363 ns |  23.797 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max_Vector | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |  15.917 ns |  0.0598 ns |  0.0530 ns |  15.912 ns |  15.797 ns |  16.002 ns |  0.68 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max_Vector | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  13.422 ns |  0.1850 ns |  0.1731 ns |  13.425 ns |  13.119 ns |  13.691 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max_Vector | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |   9.448 ns |  0.1728 ns |  0.1617 ns |   9.427 ns |   9.265 ns |   9.781 ns |  0.70 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Max_Scalar | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  17.686 ns |  0.2818 ns |  0.2498 ns |  17.797 ns |  17.083 ns |  17.856 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max_Scalar | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |  10.053 ns |  0.0393 ns |  0.0368 ns |  10.060 ns |   9.986 ns |  10.117 ns |  0.57 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max_Scalar | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  10.899 ns |  0.1788 ns |  0.1493 ns |  10.944 ns |  10.412 ns |  10.995 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max_Scalar | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |   6.588 ns |  0.0550 ns |  0.0515 ns |   6.576 ns |   6.492 ns |   6.664 ns |  0.60 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Max        | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  39.597 ns |  0.3459 ns |  0.3236 ns |  39.617 ns |  38.713 ns |  40.068 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max        | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |  20.630 ns |  0.0695 ns |  0.0616 ns |  20.622 ns |  20.556 ns |  20.759 ns |  0.52 |    0.00 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max        | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 128          |  23.512 ns |  0.0944 ns |  0.0883 ns |  23.520 ns |  23.288 ns |  23.653 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max        | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 128          |  11.249 ns |  0.1301 ns |  0.1217 ns |  11.295 ns |  10.971 ns |  11.385 ns |  0.48 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Max_Vector | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 549.776 ns |  3.1703 ns |  2.9655 ns | 550.484 ns | 544.151 ns | 553.314 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max_Vector | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 441.097 ns |  1.1802 ns |  1.1039 ns | 441.030 ns | 439.368 ns | 442.973 ns |  0.80 |    0.00 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max_Vector | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 335.473 ns |  4.3674 ns |  4.0853 ns | 336.539 ns | 324.981 ns | 339.704 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max_Vector | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 183.924 ns |  0.3314 ns |  0.2938 ns | 183.908 ns | 183.439 ns | 184.449 ns |  0.55 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Max_Scalar | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 376.094 ns |  1.9991 ns |  1.8700 ns | 376.419 ns | 370.356 ns | 378.039 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max_Scalar | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 303.696 ns |  1.1710 ns |  1.0954 ns | 303.673 ns | 302.163 ns | 305.591 ns |  0.81 |    0.00 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max_Scalar | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 177.732 ns |  0.6656 ns |  0.5558 ns | 177.806 ns | 176.879 ns | 178.902 ns |  1.00 |    0.00 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max_Scalar | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 107.079 ns |  0.3884 ns |  0.3443 ns | 107.008 ns | 106.533 ns | 107.741 ns |  0.60 |    0.00 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Max        | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 902.804 ns | 16.9216 ns | 16.6193 ns | 907.864 ns | 848.689 ns | 921.769 ns |  1.00 |    0.03 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Max        | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 513.187 ns |  2.4330 ns |  2.2758 ns | 513.181 ns | 510.050 ns | 517.755 ns |  0.57 |    0.01 |         - |          NA |
|                                     |            |            |                                        |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Max        | Job-DWINXP | \core_roots\base\Core_Root\corerun.exe | 3079         | 461.626 ns |  3.2858 ns |  3.0735 ns | 462.374 ns | 453.206 ns | 465.063 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Max        | Job-OMRYDZ | \core_roots\diff\Core_Root\corerun.exe | 3079         | 253.227 ns |  1.0693 ns |  1.0003 ns | 253.197 ns | 251.372 ns | 254.555 ns |  0.55 |    0.00 |         - |          NA |

There were not any equivalent Min benchmarks, so I copied the existing Max microbenchmarks and modified them to use Min:

| Type                                | Method     | Job        | Toolchain                   | BufferLength | Mean       | Error      | StdDev     | Median     | Min        | Max        | Ratio | RatioSD | Allocated | Alloc Ratio |
|------------------------------------ |----------- |----------- |---------------------------- |------------- |-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|------:|--------:|----------:|------------:|
| Perf_NumberTensorPrimitives<Double> | Min_Vector | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  23.971 ns |  0.3589 ns |  0.3357 ns |  23.974 ns |  23.420 ns |  24.539 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min_Vector | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |  15.898 ns |  0.0416 ns |  0.0369 ns |  15.893 ns |  15.842 ns |  15.971 ns |  0.66 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min_Vector | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  13.124 ns |  0.2579 ns |  0.2760 ns |  13.154 ns |  12.374 ns |  13.467 ns |  1.00 |    0.03 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min_Vector | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |  10.937 ns |  0.1407 ns |  0.1098 ns |  10.954 ns |  10.745 ns |  11.098 ns |  0.83 |    0.02 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Min_Scalar | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  23.141 ns |  0.2489 ns |  0.2328 ns |  23.107 ns |  22.706 ns |  23.552 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min_Scalar | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |  10.044 ns |  0.0248 ns |  0.0232 ns |  10.047 ns |   9.991 ns |  10.087 ns |  0.43 |    0.00 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min_Scalar | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  12.241 ns |  0.1603 ns |  0.1500 ns |  12.287 ns |  11.919 ns |  12.462 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min_Scalar | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |   8.802 ns |  0.1085 ns |  0.0962 ns |   8.790 ns |   8.689 ns |   8.994 ns |  0.72 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Min        | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  41.013 ns |  0.6195 ns |  0.5173 ns |  40.804 ns |  40.420 ns |  41.904 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min        | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |  21.078 ns |  0.0973 ns |  0.0910 ns |  21.072 ns |  20.925 ns |  21.260 ns |  0.51 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min        | Job-YYSURG | \base\Core_Root\corerun.exe | 128          |  23.518 ns |  0.2475 ns |  0.2315 ns |  23.578 ns |  23.127 ns |  23.825 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min        | Job-GRFSHG | \diff\Core_Root\corerun.exe | 128          |  11.097 ns |  0.0823 ns |  0.0770 ns |  11.095 ns |  10.989 ns |  11.230 ns |  0.47 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Min_Vector | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 545.986 ns |  6.8797 ns |  6.4353 ns | 547.852 ns | 527.453 ns | 551.870 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min_Vector | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 444.900 ns |  1.2448 ns |  1.1035 ns | 444.832 ns | 443.414 ns | 447.291 ns |  0.81 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min_Vector | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 340.412 ns |  2.4956 ns |  2.2123 ns | 340.250 ns | 336.724 ns | 344.329 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min_Vector | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 186.607 ns |  1.1039 ns |  0.9218 ns | 186.543 ns | 185.290 ns | 188.812 ns |  0.55 |    0.00 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Min_Scalar | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 456.055 ns |  3.7783 ns |  3.5343 ns | 456.857 ns | 449.796 ns | 461.236 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min_Scalar | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 304.747 ns |  1.1251 ns |  0.8784 ns | 304.872 ns | 303.667 ns | 305.891 ns |  0.67 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min_Scalar | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 221.380 ns |  1.2605 ns |  1.1791 ns | 221.479 ns | 218.182 ns | 222.767 ns |  1.00 |    0.01 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min_Scalar | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 109.071 ns |  0.3560 ns |  0.3330 ns | 109.008 ns | 108.410 ns | 109.616 ns |  0.49 |    0.00 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Double> | Min        | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 975.787 ns | 12.1159 ns | 10.7404 ns | 980.540 ns | 948.919 ns | 985.321 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Double> | Min        | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 519.316 ns |  4.3723 ns |  3.8760 ns | 520.750 ns | 511.615 ns | 524.735 ns |  0.53 |    0.01 |         - |          NA |
|                                     |            |            |                             |              |            |            |            |            |            |            |       |         |           |             |
| Perf_NumberTensorPrimitives<Single> | Min        | Job-YYSURG | \base\Core_Root\corerun.exe | 3079         | 488.570 ns |  7.7281 ns |  7.2289 ns | 484.997 ns | 475.244 ns | 499.179 ns |  1.00 |    0.02 |         - |          NA |
| Perf_NumberTensorPrimitives<Single> | Min        | Job-GRFSHG | \diff\Core_Root\corerun.exe | 3079         | 253.116 ns |  1.0311 ns |  0.9141 ns | 252.851 ns | 252.003 ns | 254.950 ns |  0.52 |    0.01 |         - |          NA |

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 29, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 29, 2025
@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 30, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

Changes generally LGTM, there was just a simplification to InstructionSet_AVX512* that happened to simplify what the JIT needs to handle in cases like this and the PR needs to be updated to account for that.

There's likely an opportunity to share the general logic between the scalar and vector code paths, but it's not "critical" to do in this PR.

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib, @EgorBo for secondary sign-off.

@tannergooding tannergooding requested a review from EgorBo June 5, 2025 17:06
@EgorBo
Copy link
Member

EgorBo commented Jun 6, 2025

These Min and Max routines look like they can be unified to remove the code duplication: https://www.diffchecker.com/SYTwWTzO/

@tannergooding
Copy link
Member

These Min and Max routines look like they can be unified to remove the code duplication:

Right. They are unified for the scalar path already too.

I think when we finish the factoring to also accelerate *Magnitude, *MagnitudeNumber, and *Number, we should do so by extracting the scalar path into a reusable helper that takes 1 extra bool isScalar parameter. That way we can have one function that handles generation for all 16 variations (8 scalar, 8 vector).

@tannergooding tannergooding merged commit 9214279 into dotnet:main Jun 6, 2025
109 checks passed
@tannergooding
Copy link
Member

@alexcovington is also covering the other Min/Max APIs listed something you'd be interested in doing? If not, I can get something up over the weekend handling them.

@alexcovington
Copy link
Contributor Author

@alexcovington is also covering the other Min/Max APIs listed something you'd be interested in doing? If not, I can get something up over the weekend handling them.

I'm not sure it's something I can commit to right now. It is probably better to assign the other variants to someone else for now.

@tannergooding
Copy link
Member

No worries, I can get to it.

Thanks for the improvements made here!!

@alexcovington
Copy link
Contributor Author

Thanks so much for the feedback and review @tannergooding and @EgorBo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants