Prefer to keep member functions as member functions #3921

jkwak-work · 2024-04-09T21:14:26Z

Description of the problem
It appears that the compile-time of DXC goes up when a certain condition is met.
It is still unclear what the condition is but we like to avoid any unnecessary extra compile-time if possible.

The observation is that the compile-time goes up when calling global functions, compare to calling member functions.
Since Slang turns most, if not all, member functions into global functions, there is higher chance that we will trigger the issue on DXC side.

Goal
We should prefer to keep member functions as member functions so that we get a better chance of avoiding the DXC performance issue.

Repro-steps
Note that the issue is not 100% reproducible by just calling a global function, although it occurs deterministically once you hit the case.
There must be other conditions required to reproduce the issue.
The current theory is that certain pattern of the shader code triggers an optimization mechanism. And it increases the compile-time unnecessarily.

Here is a simple example that can reproduce the issue.

#ifndef SIMD_SIZE
#define SIMD_SIZE 64
#endif

struct TLaneVector
{
    float3  RegisterArray[SIMD_SIZE];

    float3 GetElement(const uint SimdIndex)
    {
        return RegisterArray[SimdIndex];
    }

    void SetElement(const uint SimdIndex, float3 Element)
    {
        RegisterArray[SimdIndex] = Element;
    }
};

void TLaneVector_SetElement(inout TLaneVector this_0, uint SimdIndex, float3 Element)
{
    this_0.RegisterArray[SimdIndex] = Element;
}

float3 TLaneVector_GetElement(TLaneVector this_3, uint SimdIndex)
{
    return this_3.RegisterArray[SimdIndex];
}

TLaneVector min_0(TLaneVector A, TLaneVector B)
{
    TLaneVector R_11;

    [unroll]
    for(uint SimdIndex = 0U; SimdIndex < SIMD_SIZE; ++SimdIndex)
    {
#ifdef USE_MEMBER
        float3 lhs = A.GetElement(SimdIndex);
#else
        float3 lhs = TLaneVector_GetElement(A, SimdIndex);
#endif

        float3 rhs = TLaneVector_GetElement(B, SimdIndex);
        TLaneVector_SetElement(R_11, SimdIndex, min(lhs, rhs));
    }
    return R_11;
}

[numthreads(4, 1, 1)]
void MainCS(uint2 GroupId_0 : SV_GroupID, uint GroupThreadIndex_20 : SV_GroupIndex)
{
    TLaneVector AccumulatorData;
    TLaneVector Sample0;
    TLaneVector Sample1;

    //[unroll]
    for (uint SimdIndex = 0U; SimdIndex < SIMD_SIZE; ++SimdIndex)
    {
        TLaneVector_SetElement(Sample0, SimdIndex, 0);
        TLaneVector_SetElement(Sample1, SimdIndex, 1);
    }

#ifdef USE_MEANINGLESS_COPY
    // For some reason, this copy is required to reproduce the issue.
    // Probably this line triggers some kind of extra optimization that
    // increases the compile-time.
    AccumulatorData = Sample0;
#endif

    AccumulatorData = min_0(Sample0, Sample1);
}

Assuming that the shader code above is saved as "extra_compile_time.hlsl", it can be compiled in four following ways:

/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEMBER
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEANINGLESS_COPY
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEANINGLESS_COPY -DUSE_MEMBER

Each command took following amount of time to compile on my machine,

0.11 second
0.20 second
2.68 second
0.18 second.

Note that only the 3rd case has increased compile-time while the others show similar and smaller compile-time.
Also note that the generated DXIL binary files are all identical.

You can make the difference more obvious by using a bigger number for SIMD_SIZE.

The text was updated successfully, but these errors were encountered:

jkwak-work · 2024-04-10T00:04:34Z

I was able to reproduce the issue with the latest DXC from github.

dxcompiler.dll: 1.8 - 1.8.0.14549 (main, 0781ded87)

csyonghe · 2024-04-10T05:25:20Z

We should file this bug with dxc to see if they can find out what exactly caused the slow down.

The amount of slow down seems pathological.

jkwak-work · 2024-04-10T15:24:14Z

I filed an issue on DXC github for the same issue.

https://github.com/microsoft/DirectXShaderCompiler/issues/6518

jkwak-work · 2024-04-10T17:35:41Z

I got an explanation from DXC github.

https://github.com/microsoft/DirectXShaderCompiler/issues/6512#issuecomment-2047992891

This has something to do with the fact that "inout" modifier on the global functions is "copy-in and copy-out" rather than a reference.
LLVM takes extra time to optimize it as if it was a reference.

On the other hand, "this" is a true reference for the member functions and that's why the compile time is faster.

csyonghe · 2024-04-10T18:40:10Z

For now we may be able to work around the issue by doing [ForceInline] on the member methods in Slang.

jkwak-work added kind:performance things we wish were faster goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang labels Apr 10, 2024

jkwak-work mentioned this issue Apr 10, 2024

Investigate why UE TSRRejectShadingCS takes longer at compile-time and runtime when Slang generates HLSL for WaveSize 32 #3888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer to keep member functions as member functions #3921

Prefer to keep member functions as member functions #3921

jkwak-work commented Apr 9, 2024

jkwak-work commented Apr 10, 2024

csyonghe commented Apr 10, 2024

jkwak-work commented Apr 10, 2024

jkwak-work commented Apr 10, 2024

csyonghe commented Apr 10, 2024

Prefer to keep member functions as member functions #3921

Prefer to keep member functions as member functions #3921

Comments

jkwak-work commented Apr 9, 2024

jkwak-work commented Apr 10, 2024

csyonghe commented Apr 10, 2024

jkwak-work commented Apr 10, 2024

jkwak-work commented Apr 10, 2024

csyonghe commented Apr 10, 2024