Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer to keep member functions as member functions #3921

Open
jkwak-work opened this issue Apr 9, 2024 · 5 comments
Open

Prefer to keep member functions as member functions #3921

jkwak-work opened this issue Apr 9, 2024 · 5 comments
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang kind:performance things we wish were faster

Comments

@jkwak-work
Copy link
Collaborator

Description of the problem
It appears that the compile-time of DXC goes up when a certain condition is met.
It is still unclear what the condition is but we like to avoid any unnecessary extra compile-time if possible.

The observation is that the compile-time goes up when calling global functions, compare to calling member functions.
Since Slang turns most, if not all, member functions into global functions, there is higher chance that we will trigger the issue on DXC side.

Goal
We should prefer to keep member functions as member functions so that we get a better chance of avoiding the DXC performance issue.

Repro-steps
Note that the issue is not 100% reproducible by just calling a global function, although it occurs deterministically once you hit the case.
There must be other conditions required to reproduce the issue.
The current theory is that certain pattern of the shader code triggers an optimization mechanism. And it increases the compile-time unnecessarily.

Here is a simple example that can reproduce the issue.

#ifndef SIMD_SIZE
#define SIMD_SIZE 64
#endif

struct TLaneVector
{
    float3  RegisterArray[SIMD_SIZE];

    float3 GetElement(const uint SimdIndex)
    {
        return RegisterArray[SimdIndex];
    }

    void SetElement(const uint SimdIndex, float3 Element)
    {
        RegisterArray[SimdIndex] = Element;
    }
};

void TLaneVector_SetElement(inout TLaneVector this_0, uint SimdIndex, float3 Element)
{
    this_0.RegisterArray[SimdIndex] = Element;
}

float3 TLaneVector_GetElement(TLaneVector this_3, uint SimdIndex)
{
    return this_3.RegisterArray[SimdIndex];
}

TLaneVector min_0(TLaneVector A, TLaneVector B)
{
    TLaneVector R_11;

    [unroll]
    for(uint SimdIndex = 0U; SimdIndex < SIMD_SIZE; ++SimdIndex)
    {
#ifdef USE_MEMBER
        float3 lhs = A.GetElement(SimdIndex);
#else
        float3 lhs = TLaneVector_GetElement(A, SimdIndex);
#endif

        float3 rhs = TLaneVector_GetElement(B, SimdIndex);
        TLaneVector_SetElement(R_11, SimdIndex, min(lhs, rhs));
    }
    return R_11;
}

[numthreads(4, 1, 1)]
void MainCS(uint2 GroupId_0 : SV_GroupID, uint GroupThreadIndex_20 : SV_GroupIndex)
{
    TLaneVector AccumulatorData;
    TLaneVector Sample0;
    TLaneVector Sample1;

    //[unroll]
    for (uint SimdIndex = 0U; SimdIndex < SIMD_SIZE; ++SimdIndex)
    {
        TLaneVector_SetElement(Sample0, SimdIndex, 0);
        TLaneVector_SetElement(Sample1, SimdIndex, 1);
    }

#ifdef USE_MEANINGLESS_COPY
    // For some reason, this copy is required to reproduce the issue.
    // Probably this line triggers some kind of extra optimization that
    // increases the compile-time.
    AccumulatorData = Sample0;
#endif

    AccumulatorData = min_0(Sample0, Sample1);
}

Assuming that the shader code above is saved as "extra_compile_time.hlsl", it can be compiled in four following ways:

/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEMBER
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEANINGLESS_COPY
/usr/bin/time -f "%e" dxc.exe -E MainCS -T cs_6_6 extra_compile_time.hlsl -Fo a.out -DSIMD_SIZE=64 -DUSE_MEANINGLESS_COPY -DUSE_MEMBER

Each command took following amount of time to compile on my machine,

0.11 second
0.20 second
2.68 second
0.18 second.

Note that only the 3rd case has increased compile-time while the others show similar and smaller compile-time.
Also note that the generated DXIL binary files are all identical.

You can make the difference more obvious by using a bigger number for SIMD_SIZE.

@jkwak-work
Copy link
Collaborator Author

I was able to reproduce the issue with the latest DXC from github.

dxcompiler.dll: 1.8 - 1.8.0.14549 (main, 0781ded87)

@jkwak-work jkwak-work added kind:performance things we wish were faster goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang labels Apr 10, 2024
@csyonghe
Copy link
Collaborator

We should file this bug with dxc to see if they can find out what exactly caused the slow down.

The amount of slow down seems pathological.

@jkwak-work
Copy link
Collaborator Author

I filed an issue on DXC github for the same issue.

https://github.com/microsoft/DirectXShaderCompiler/issues/6518

@jkwak-work
Copy link
Collaborator Author

I got an explanation from DXC github.

https://github.com/microsoft/DirectXShaderCompiler/issues/6512#issuecomment-2047992891

This has something to do with the fact that "inout" modifier on the global functions is "copy-in and copy-out" rather than a reference.
LLVM takes extra time to optimize it as if it was a reference.

On the other hand, "this" is a true reference for the member functions and that's why the compile time is faster.

@csyonghe
Copy link
Collaborator

For now we may be able to work around the issue by doing [ForceInline] on the member methods in Slang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang kind:performance things we wish were faster
Projects
None yet
Development

No branches or pull requests

2 participants