Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource #5151

Closed
wants to merge 3 commits into from

Conversation

quadpixels
Copy link
Contributor

It seems in the SPIR-V program, global_tmps_buffer_u32 and global_tmps_buffer_i32 point to the same underlying resource.

In DX11, this means 2 UAVs can point to the same "global temp buffer" resource, and be bound to the same pipeline.

It seems this behavior is fine with the Vulkan backend, but will cause the following warning with the DX11 backend:

Resource being set to CS UnorderedAccessView slot X is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD

This can cause incorrect behavior, and may be reproduced with the following Taichi kernel:

import taichi as ti
ti.init(arch=ti.dx11)
idx = ti.field(dtype=ti.i32, shape=())
scratch = ti.field(dtype=ti.f32, shape=20)

@ti.kernel
def hey():
  ix = idx[None]
  idx[None] += 1
  for i in range(ix):
    if True:
      scratch[ix] += 233
for i in range(6): hey()

Output before the fix:

[3495.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.]

Expected result:

[   0.  233.  466.  699.  932. 1165.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.]

The way it happens is: the hey Taichi kernel gets compiled into 2 HLSL kernels. The first one saves the current value of idx[None], or ix, to global_tmps_buffer_u32. However, for some reason the second kernel decides to read the loop count (ix) from global_tmps_buffer_i32.

The first kernel looks like:

static const uint3 gl_WorkGroupSize = uint3(1u, 1u, 1u);

RWByteAddressBuffer root_buffer_0 : register(u2);
RWByteAddressBuffer global_tmps_buffer_u32 : register(u3);

static uint3 gl_GlobalInvocationID;
struct SPIRV_Cross_Input
{
    uint3 gl_GlobalInvocationID : SV_DispatchThreadID;
};

void comp_main()
{
    if (gl_GlobalInvocationID.x == 0u)
    {
        uint _41 = root_buffer_0.Load(0);       <-------------- Get the current value of idx[None], or "ix"
        int tmp2_i32 = int(_41);
        global_tmps_buffer_u32.Store(0, uint(tmp2_i32));   <--------------- Store the current value of ix to global temp
        root_buffer_0.Store(0, uint(tmp2_i32 + 1)); <----------------- Increment idx[None] by 1
    }
}

[numthreads(1, 1, 1)]
void main(SPIRV_Cross_Input stage_input)
{
    gl_GlobalInvocationID = stage_input.gl_GlobalInvocationID;
    comp_main();
}

The second kernel looks like:

static const uint3 gl_WorkGroupSize = uint3(128u, 1u, 1u);

RWByteAddressBuffer global_tmps_buffer_i32 : register(u2);      <----------- Note these two
RWByteAddressBuffer global_tmps_buffer_u32 : register(u3);     <----------- UAV slots
RWByteAddressBuffer root_buffer_0 : register(u4);
cbuffer SPIRV_Cross_NumWorkgroups
{
    uint3 SPIRV_Cross_NumWorkgroups_1_count : packoffset(c0);
};


static uint3 gl_GlobalInvocationID;
struct SPIRV_Cross_Input
{
    uint3 gl_GlobalInvocationID : SV_DispatchThreadID;
};

void comp_main()
{
    int begin_ = int(gl_GlobalInvocationID.x) + 0;
    int end_ = (int(global_tmps_buffer_i32.Load((0 >> 2) * 4 + 0)) - 0) + 0;   <------------- The loop range, ix, comes from i32 not u32 !
    int total_invocs = int(SPIRV_Cross_NumWorkgroups_1_count.x * 128u);
    uint _77;
    for (int ii_i32 = begin_; ii_i32 < end_; ii_i32 += total_invocs)
    {
        uint _74 = ((((0u + (uint(0) * 132u)) + 4u) + (uint(int(global_tmps_buffer_u32.Load(int(uint(0) >> uint(2)) * 4 + 0)) & 31) * 4u)) + 0u) >> 2u;
        for (;;)
        {
            uint _84 = root_buffer_0.Load(_74 * 4 + 0);
            uint _88;
            root_buffer_0.InterlockedCompareExchange(_74 * 4 + 0, _84, asuint(asfloat(_84) + 233.0f), _88);
            _77 = _88;
            if (_88 == _84)
            {
                break;
            }
            continue;
        }
    }
}

[numthreads(128, 1, 1)]
void main(SPIRV_Cross_Input stage_input)
{
    gl_GlobalInvocationID = stage_input.gl_GlobalInvocationID;
    comp_main();
}

As shown above, With the current code, the i32 and u32 versions of the global temp buffer gets two UAVs, u2 and u3 respectively. However, since they are pointing to the same buffer allocation, the following DX runtime warning will happen:

Resource being set to CS UnorderedAccessView slot is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD

The DX runtime will bind either u2 and u3 to NULL, depending on which one is the bound later than the other. This will cause _74 in the second kernel to have a value of 0 so only the first element in scratch gets incremented.


This change attempts to fix this bug by redirecting all operations on "i32" to "u32" (since u32 seems to be present by default)

This has not yet been extensively tested and may need to be improved a bit, I wonder if there is a better way? Any comments will be appreciated.

Thanks!

Related issue = #

@netlify
Copy link

netlify bot commented Jun 13, 2022

Deploy Preview for docsite-preview ready!

Name Link
🔨 Latest commit 18da94d
🔍 Latest deploy log https://app.netlify.com/sites/docsite-preview/deploys/62ef3f1ddae54400092fe4d8
😎 Deploy Preview https://deploy-preview-5151--docsite-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@quadpixels quadpixels changed the title [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource [dx11] [wip] (do not submit yet) Redirect global_tmps_buffer_i32 to u32 so DX11 doesn't bind 2 UAVs pointing to the same resource Jun 13, 2022
@bobcao3
Copy link
Collaborator

bobcao3 commented Jun 13, 2022

This is a on-going struggle with the backend, I was thinking the best solution is to pre-compile a type table for all buffer binds and reduce the need of buffer aliasing

@quadpixels
Copy link
Contributor Author

This is a on-going struggle with the backend, I was thinking the best solution is to pre-compile a type table for all buffer binds and reduce the need of buffer aliasing

I see, so it sounds "global_tmps_buffer_i32" and "global_tmps_buffer_u32" are indeed aliases of the same "global_tmps" buffer.
Is there an example of "pre-compile a type table for all buffer binds"?

quadpixels and others added 2 commits August 6, 2022 21:24
In DX, global_tmps_buffer_u32 and global_tmps_buffer_i32 become two UAVs
pointing to the same underlying ID3D11Buffer resource.

This will cause the following error:
Resource being set to CS UnorderedAccessView slot X is already bound on output elsewhere. [ STATE_SETTING WARNING #2097354: DEVICE_CSSETUNORDEREDACCESSVIEWS_HAZARD

This can be reproduced with the following Taichi kernel:

@ti.kernel
def hey():
  ix = idx[None]
  idx[None] += 1
  for i in range(ix):
    if True:
      scratch[ix] += 233

This change attempts to fix this bug by redirecting all operations on
"i32" to "u32" (since u32 seems to be present by default)

This may need to be improved a bit.
@feisuzhu
Copy link
Contributor

Closing ancient PR.

@feisuzhu feisuzhu closed this Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants