[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

quadpixels · 2022-06-13T09:28:21Z

It seems in the SPIR-V program, global_tmps_buffer_u32 and global_tmps_buffer_i32 point to the same underlying resource.

In DX11, this means 2 UAVs can point to the same "global temp buffer" resource, and be bound to the same pipeline.

It seems this behavior is fine with the Vulkan backend, but will cause the following warning with the DX11 backend:

Resource being set to CS UnorderedAccessView slot X is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD

This can cause incorrect behavior, and may be reproduced with the following Taichi kernel:

import taichi as ti
ti.init(arch=ti.dx11)
idx = ti.field(dtype=ti.i32, shape=())
scratch = ti.field(dtype=ti.f32, shape=20)

@ti.kernel
def hey():
  ix = idx[None]
  idx[None] += 1
  for i in range(ix):
    if True:
      scratch[ix] += 233
for i in range(6): hey()

Output before the fix:

[3495.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.]

Expected result:

[   0.  233.  466.  699.  932. 1165.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.]

The way it happens is: the hey Taichi kernel gets compiled into 2 HLSL kernels. The first one saves the current value of idx[None], or ix, to global_tmps_buffer_u32. However, for some reason the second kernel decides to read the loop count (ix) from global_tmps_buffer_i32.

The first kernel looks like:

static const uint3 gl_WorkGroupSize = uint3(1u, 1u, 1u);

RWByteAddressBuffer root_buffer_0 : register(u2);
RWByteAddressBuffer global_tmps_buffer_u32 : register(u3);

static uint3 gl_GlobalInvocationID;
struct SPIRV_Cross_Input
{
    uint3 gl_GlobalInvocationID : SV_DispatchThreadID;
};

void comp_main()
{
    if (gl_GlobalInvocationID.x == 0u)
    {
        uint _41 = root_buffer_0.Load(0);       <-------------- Get the current value of idx[None], or "ix"
        int tmp2_i32 = int(_41);
        global_tmps_buffer_u32.Store(0, uint(tmp2_i32));   <--------------- Store the current value of ix to global temp
        root_buffer_0.Store(0, uint(tmp2_i32 + 1)); <----------------- Increment idx[None] by 1
    }
}

[numthreads(1, 1, 1)]
void main(SPIRV_Cross_Input stage_input)
{
    gl_GlobalInvocationID = stage_input.gl_GlobalInvocationID;
    comp_main();
}

The second kernel looks like:

static const uint3 gl_WorkGroupSize = uint3(128u, 1u, 1u);

RWByteAddressBuffer global_tmps_buffer_i32 : register(u2);      <----------- Note these two
RWByteAddressBuffer global_tmps_buffer_u32 : register(u3);     <----------- UAV slots
RWByteAddressBuffer root_buffer_0 : register(u4);
cbuffer SPIRV_Cross_NumWorkgroups
{
    uint3 SPIRV_Cross_NumWorkgroups_1_count : packoffset(c0);
};


static uint3 gl_GlobalInvocationID;
struct SPIRV_Cross_Input
{
    uint3 gl_GlobalInvocationID : SV_DispatchThreadID;
};

void comp_main()
{
    int begin_ = int(gl_GlobalInvocationID.x) + 0;
    int end_ = (int(global_tmps_buffer_i32.Load((0 >> 2) * 4 + 0)) - 0) + 0;   <------------- The loop range, ix, comes from i32 not u32 !
    int total_invocs = int(SPIRV_Cross_NumWorkgroups_1_count.x * 128u);
    uint _77;
    for (int ii_i32 = begin_; ii_i32 < end_; ii_i32 += total_invocs)
    {
        uint _74 = ((((0u + (uint(0) * 132u)) + 4u) + (uint(int(global_tmps_buffer_u32.Load(int(uint(0) >> uint(2)) * 4 + 0)) & 31) * 4u)) + 0u) >> 2u;
        for (;;)
        {
            uint _84 = root_buffer_0.Load(_74 * 4 + 0);
            uint _88;
            root_buffer_0.InterlockedCompareExchange(_74 * 4 + 0, _84, asuint(asfloat(_84) + 233.0f), _88);
            _77 = _88;
            if (_88 == _84)
            {
                break;
            }
            continue;
        }
    }
}

[numthreads(128, 1, 1)]
void main(SPIRV_Cross_Input stage_input)
{
    gl_GlobalInvocationID = stage_input.gl_GlobalInvocationID;
    comp_main();
}

As shown above, With the current code, the i32 and u32 versions of the global temp buffer gets two UAVs, u2 and u3 respectively. However, since they are pointing to the same buffer allocation, the following DX runtime warning will happen:

Resource being set to CS UnorderedAccessView slot is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD

The DX runtime will bind either u2 and u3 to NULL, depending on which one is the bound later than the other. This will cause _74 in the second kernel to have a value of 0 so only the first element in scratch gets incremented.

This change attempts to fix this bug by redirecting all operations on "i32" to "u32" (since u32 seems to be present by default)

This has not yet been extensively tested and may need to be improved a bit, I wonder if there is a better way? Any comments will be appreciated.

Thanks!

Related issue = #

netlify · 2022-06-13T09:28:27Z

✅ Deploy Preview for docsite-preview ready!

Name	Link
🔨 Latest commit	`18da94d`
🔍 Latest deploy log	https://app.netlify.com/sites/docsite-preview/deploys/62ef3f1ddae54400092fe4d8
😎 Deploy Preview	https://deploy-preview-5151--docsite-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

bobcao3 · 2022-06-13T17:32:06Z

This is a on-going struggle with the backend, I was thinking the best solution is to pre-compile a type table for all buffer binds and reduce the need of buffer aliasing

quadpixels · 2022-06-14T11:43:05Z

This is a on-going struggle with the backend, I was thinking the best solution is to pre-compile a type table for all buffer binds and reduce the need of buffer aliasing

I see, so it sounds "global_tmps_buffer_i32" and "global_tmps_buffer_u32" are indeed aliases of the same "global_tmps" buffer.
Is there an example of "pre-compile a type table for all buffer binds"?

In DX, global_tmps_buffer_u32 and global_tmps_buffer_i32 become two UAVs pointing to the same underlying ID3D11Buffer resource. This will cause the following error: Resource being set to CS UnorderedAccessView slot X is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD This can be reproduced with the following Taichi kernel: @ti.kernel def hey(): ix = idx[None] idx[None] += 1 for i in range(ix): if True: scratch[ix] += 233 This change attempts to fix this bug by redirecting all operations on "i32" to "u32" (since u32 seems to be present by default) This may need to be improved a bit.

for more information, see https://pre-commit.ci

feisuzhu · 2023-03-24T07:12:20Z

Closing ancient PR.

quadpixels changed the title ~~[wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource~~ [dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource Jun 13, 2022

quadpixels and others added 2 commits August 6, 2022 21:24

[pre-commit.ci] auto fixes from pre-commit.com hooks

65684cf

for more information, see https://pre-commit.ci

quadpixels force-pushed the global_temp_i32_u32 branch from 936959a to 65684cf Compare August 7, 2022 04:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

18da94d

for more information, see https://pre-commit.ci

feisuzhu closed this Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

quadpixels commented Jun 13, 2022

netlify bot commented Jun 13, 2022 •

edited

Loading

bobcao3 commented Jun 13, 2022

quadpixels commented Jun 14, 2022

feisuzhu commented Mar 24, 2023

[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

Conversation

quadpixels commented Jun 13, 2022

netlify bot commented Jun 13, 2022 • edited Loading

✅ Deploy Preview for docsite-preview ready!

bobcao3 commented Jun 13, 2022

quadpixels commented Jun 14, 2022

feisuzhu commented Mar 24, 2023

netlify bot commented Jun 13, 2022 •

edited

Loading