From c217107f48378cf9ac3549ff308d694423987826 Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Fri, 9 Jun 2023 09:50:23 -0700
Subject: [PATCH 1/5] Start draft of shader converter post

Committing as I have a bunch of text about mixed atomics that I'll likely cut from the final post, but did want to keep a record.
---
 _posts/2023-06-08-shader-converter.md | 81 +++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100644 _posts/2023-06-08-shader-converter.md

diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
new file mode 100644
index 0000000..a15a8ff
--- /dev/null
+++ b/_posts/2023-06-08-shader-converter.md
@@ -0,0 +1,81 @@
+---
+layout: post
+title:  "A note on Metal shader converter"
+date:   2023-06-09 07:03:42 -0700
+categories: [gpu]
+---
+At WWDC, Apple introduced [Metal shader converter], a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.
+
+The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced.
+
+## Typed vs untyped atomics
+
+Another challenge for reliable automated translation into Metal is typed vs untyped atomics. In C++, `atomic<int32_t>` and `int32_t` are distinct types, and atomic operations can only be performed on the former. This is a reasonable choice, and I'm generally in favor of relying on the type system to enforce invariants; Rust follows the same tradition.
+
+The problem is that other shader languages, in this case most importantly HLSL, have an *untyped* approach to atomics. A memory location simply has type `uint`, and that can be accessed both through ordinary loads and stores, and with atomic operations (called "interlocked" in HLSL argot). In some cases, atomic and non-atomic accesses can be cleanly separated, in other cases they might be inextricably mixed. The latter happens when a buffer is a [RWByteAddressBuffer](https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-rwbyteaddressbuffer) which presents as a completely untyped array of 32-bit words, and the actual semantic meaning of types is expressed in program logic above the low-level access to the raw buffer.
+
+Other cases are somewhat in-between. Here's a simple shader that computes the maximum value of each 256 chunk of input:
+```hlsl
+ByteAddressBuffer input;
+RWByteAddressBuffer output;
+
+groupshared uint max_value;
+
+[numthreads(256, 1, 1)]
+void main(uint index: SV_GroupIndex) {
+    if (index == 0) {
+        max_value = 0;
+    }
+    GroupMemoryBarrierWithGroupSync();
+    InterlockedMax(max_value, input.Load(index * 4));
+    GroupMemoryBarrierWithGroupSync();
+    if (index == 0) {
+        output.Store((index / 256) * 4, max_value);
+    }
+}
+```
+
+The initialization and use of `max_value` can be done with non-atomic operations, but of course the max computation needs to be atomic because all the threads are participating in parallel.
+
+Here's the translation of that using DXC and spirv-cross, a combination of open-source tools that accomplishes the same thing as the new Apple tool:
+
+```msl
+kernel void main0(const device type_ByteAddressBuffer& _input [[buffer(0)]], device type_RWByteAddressBuffer& _output [[buffer(1)]], uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]])
+{
+    threadgroup uint max_value;
+    bool _26 = gl_LocalInvocationIndex == 0u;
+    if (_26)
+    {
+        max_value = 0u;
+    }
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    uint _33 = atomic_fetch_max_explicit((threadgroup atomic_uint*)&max_value, _input._m0[(gl_LocalInvocationIndex * 4u) >> 2u], memory_order_relaxed);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    if (_26)
+    {
+        _output._m0[((gl_LocalInvocationIndex / 256u) * 4u) >> 2u] = max_value;
+    }
+}
+```
+
+The key bit is `(threadgroup atomic_uint*)&max_value`, which is a pointer cast from a non-atomic type to an atomic type. In C++, this is considered undefined behavior. Almost certainly, this should be considered "technical undefined behavior," because if the Metal shader compiler did anything other than the reasonable interpretation, a great many games in the App Store that use spirv-cross to translate shaders from HLSL would be extremely unhappy.
+
+Even so, we're in a position where it's not possible to *reason* about correctness systematically. There's a tradition in lock-free algorithms and data structures where the first publication is almost always flawed, then there's a follow-up that fixes it. It's hard to be confident any of these algorithms are correct until there's been formal verification of some kind. Fortunately, these formal tools exist and are put to good use; there are Alloy formulations of the C++11 memory model, model checking tools such as [CDSChecker] (and its Rust counterpart [loom]), and a small academic industry of proving lock-free algorithms correct. Trying to use these formal techniques to prove correctness of an algorithm translated into Metal would result in an instant report of UB.
+
+## Onward
+
+The Metal announcements from WWDC move us no closer to a world of robust GPU infrastructure. But there is much we can still do.
+
+For one, there *is* a GPU infrastructure stack that is based on careful specification and conformance testing, and has two high quality, open source implementations enabling deployment to almost all reasonably current GPU hardware. I speak of course of WebGPU. It's lacking the shiny features – raytracing, bindless, and cooperative matrix operations (marketed as "tensor cores" and quite important for maximum performance in AI workloads) – but what is there should work.
+
+For two, we can cheer on the work of Asahi Linux. They have recently announced [OpenGL 3.1 support] on Apple Silicon, and an intent to implement Vulkan. That work may be highly challenging, as obviously that implies implementing barriers which the Apple GPU engineers haven't been able to manage. But they have done consistently impressive work so far, and I certainly hope they succeed. If nothing else, their work will result in much better public documentation of the hardware's capabilities and limitations.
+
+
+
+[Metal shader converter]: https://developer.apple.com/metal/shader-converter/
+[Prefix sum on portable compute shaders]: https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portable.html
+[Direct3D 11]: https://en.wikipedia.org/wiki/Direct3D#Direct3D_11
+[CDSChecker]: http://plrg.eecs.uci.edu/software_page/42-2/
+[loom]: https://github.com/tokio-rs/loom
+[OpenGL 3.1 support]: https://asahilinux.org/2023/06/opengl-3-1-on-asahi-linux/
+

From 7dcb8c9c1ba0632d0294c6ca181042a4c2292572 Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Mon, 12 Jun 2023 09:50:05 -0700
Subject: [PATCH 2/5] Reasonable first draft of blog

---
 _posts/2023-06-08-shader-converter.md | 66 +++++++--------------------
 1 file changed, 16 insertions(+), 50 deletions(-)

diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
index a15a8ff..2e74398 100644
--- a/_posts/2023-06-08-shader-converter.md
+++ b/_posts/2023-06-08-shader-converter.md
@@ -8,59 +8,17 @@ At WWDC, Apple introduced [Metal shader converter], a tool for converting shader
 
 The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced.
 
-## Typed vs untyped atomics
+I've discussed the value of this barrier in my blog post [Prefix sum on portable compute shaders], but I'll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the [SAM prefix sum] algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically `2 * ceil(log(n) / log(threadgroup size))`). Prefix sum, in turn is an important primitive for advanced compute workloads. There are a number of instances of it in the [Vello] pipeline, and it's also commonly used in stream compaction, decoding of variable length data streams, and compression.
 
-Another challenge for reliable automated translation into Metal is typed vs untyped atomics. In C++, `atomic<int32_t>` and `int32_t` are distinct types, and atomic operations can only be performed on the former. This is a reasonable choice, and I'm generally in favor of relying on the type system to enforce invariants; Rust follows the same tradition.
+I believe there are other important techniques that are similarly unlocked by the availability of these primitives. For example, Nanite's advanced compute pipelines schedule work through job queues, and in general it is not possible to reliably coordinate work between different threadgroups (even within the same dispatch) without such a barrier.
 
-The problem is that other shader languages, in this case most importantly HLSL, have an *untyped* approach to atomics. A memory location simply has type `uint`, and that can be accessed both through ordinary loads and stores, and with atomic operations (called "interlocked" in HLSL argot). In some cases, atomic and non-atomic accesses can be cleanly separated, in other cases they might be inextricably mixed. The latter happens when a buffer is a [RWByteAddressBuffer](https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-rwbyteaddressbuffer) which presents as a completely untyped array of 32-bit words, and the actual semantic meaning of types is expressed in program logic above the low-level access to the raw buffer.
+## Complexity and reasoning
 
-Other cases are somewhat in-between. Here's a simple shader that computes the maximum value of each 256 chunk of input:
-```hlsl
-ByteAddressBuffer input;
-RWByteAddressBuffer output;
+The GPU ecosystem exists at the knife edge of being strangled by complexity. A big part of the problem is that features tend to inhabit a quantum superposition of existing and not existing. Typically there is an anemic core, surrounded by a cloud of optional features. The Vulkan ecosystem is notorious for this: the [extension list at vulkan.gpuinfo.org] currently lists 146 extensions.
 
-groupshared uint max_value;
+The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it's no longer sufficient to consider [Shader Model 5] to be a baseline, but rather the developer needs to keep in mind all the features that don't translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors "count leading zeros" when the input is 0 vary), and in other cases, like these device scoped barriers.
 
-[numthreads(256, 1, 1)]
-void main(uint index: SV_GroupIndex) {
-    if (index == 0) {
-        max_value = 0;
-    }
-    GroupMemoryBarrierWithGroupSync();
-    InterlockedMax(max_value, input.Load(index * 4));
-    GroupMemoryBarrierWithGroupSync();
-    if (index == 0) {
-        output.Store((index / 256) * 4, max_value);
-    }
-}
-```
-
-The initialization and use of `max_value` can be done with non-atomic operations, but of course the max computation needs to be atomic because all the threads are participating in parallel.
-
-Here's the translation of that using DXC and spirv-cross, a combination of open-source tools that accomplishes the same thing as the new Apple tool:
-
-```msl
-kernel void main0(const device type_ByteAddressBuffer& _input [[buffer(0)]], device type_RWByteAddressBuffer& _output [[buffer(1)]], uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]])
-{
-    threadgroup uint max_value;
-    bool _26 = gl_LocalInvocationIndex == 0u;
-    if (_26)
-    {
-        max_value = 0u;
-    }
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    uint _33 = atomic_fetch_max_explicit((threadgroup atomic_uint*)&max_value, _input._m0[(gl_LocalInvocationIndex * 4u) >> 2u], memory_order_relaxed);
-    threadgroup_barrier(mem_flags::mem_threadgroup);
-    if (_26)
-    {
-        _output._m0[((gl_LocalInvocationIndex / 256u) * 4u) >> 2u] = max_value;
-    }
-}
-```
-
-The key bit is `(threadgroup atomic_uint*)&max_value`, which is a pointer cast from a non-atomic type to an atomic type. In C++, this is considered undefined behavior. Almost certainly, this should be considered "technical undefined behavior," because if the Metal shader compiler did anything other than the reasonable interpretation, a great many games in the App Store that use spirv-cross to translate shaders from HLSL would be extremely unhappy.
-
-Even so, we're in a position where it's not possible to *reason* about correctness systematically. There's a tradition in lock-free algorithms and data structures where the first publication is almost always flawed, then there's a follow-up that fixes it. It's hard to be confident any of these algorithms are correct until there's been formal verification of some kind. Fortunately, these formal tools exist and are put to good use; there are Alloy formulations of the C++11 memory model, model checking tools such as [CDSChecker] (and its Rust counterpart [loom]), and a small academic industry of proving lock-free algorithms correct. Trying to use these formal techniques to prove correctness of an algorithm translated into Metal would result in an instant report of UB.
+A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also not that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
 
 ## Onward
 
@@ -70,7 +28,7 @@ For one, there *is* a GPU infrastructure stack that is based on careful specific
 
 For two, we can cheer on the work of Asahi Linux. They have recently announced [OpenGL 3.1 support] on Apple Silicon, and an intent to implement Vulkan. That work may be highly challenging, as obviously that implies implementing barriers which the Apple GPU engineers haven't been able to manage. But they have done consistently impressive work so far, and I certainly hope they succeed. If nothing else, their work will result in much better public documentation of the hardware's capabilities and limitations.
 
-
+I have a recommendations for Apple as well. I hope that they document which HLSL features are expected to work and which are not. Currently in their documentation (which is admittedly beta), it just says "Some features not supported," which I personally find not very useful. I would also like to give them credit for clarifying the [Metal Shading Language Specification] with respect to the scope of the `mem_device` flag to `threadgroup_barrier`. It now says, "The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup," which to a very careful reader does indicate threadgroup scope and no guarantee at device scope. Previously it [said][gpuweb#2297] "Ensure correct ordering of memory operations to device memory," which could easily be misinterpreted as providing a device scope guarantee.
 
 [Metal shader converter]: https://developer.apple.com/metal/shader-converter/
 [Prefix sum on portable compute shaders]: https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portable.html
@@ -78,4 +36,12 @@ For two, we can cheer on the work of Asahi Linux. They have recently announced [
 [CDSChecker]: http://plrg.eecs.uci.edu/software_page/42-2/
 [loom]: https://github.com/tokio-rs/loom
 [OpenGL 3.1 support]: https://asahilinux.org/2023/06/opengl-3-1-on-asahi-linux/
-
+[gpuweb#2297]: https://github.com/gpuweb/gpuweb/pull/2297
+[Metal Shading Language Specification]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
+[SAM prefix sum]: https://dl.acm.org/doi/10.1145/2980983.2908089
+[Vello]: https://github.com/linebender/vello
+[extension list at vulkan.gpuinfo.org]: https://vulkan.gpuinfo.org/listfeaturesextensions.php
+[Shader Model 5]: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3d11-graphics-reference-sm5
+[ghpuweb#2229]: https://github.com/gpuweb/gpuweb/issues/2229
+[std::atomic_ref]: https://en.cppreference.com/w/cpp/atomic/atomic_ref
+[reasonable IR]: https://gist.github.com/raphlinus/a8e0a3a3683127149b746eb37822bdc8

From 3c0ee434b094cbc08c5700f7eb6678170db2cdff Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Mon, 12 Jun 2023 10:41:51 -0700
Subject: [PATCH 3/5] A bit more text

Clarify role of the barrier within the broader GPU ecosystem. Thanks Daniel for the comments!
---
 _posts/2023-06-08-shader-converter.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
index 2e74398..1e0b0ee 100644
--- a/_posts/2023-06-08-shader-converter.md
+++ b/_posts/2023-06-08-shader-converter.md
@@ -6,7 +6,7 @@ categories: [gpu]
 ---
 At WWDC, Apple introduced [Metal shader converter], a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.
 
-The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced.
+The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn't support it. That holds back WebGPU's performance (see [gpuweb#3935 for discussion]), as WebGPU must be portable across the major APIs.
 
 I've discussed the value of this barrier in my blog post [Prefix sum on portable compute shaders], but I'll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the [SAM prefix sum] algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically `2 * ceil(log(n) / log(threadgroup size))`). Prefix sum, in turn is an important primitive for advanced compute workloads. There are a number of instances of it in the [Vello] pipeline, and it's also commonly used in stream compaction, decoding of variable length data streams, and compression.
 
@@ -18,7 +18,9 @@ The GPU ecosystem exists at the knife edge of being strangled by complexity. A b
 
 The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it's no longer sufficient to consider [Shader Model 5] to be a baseline, but rather the developer needs to keep in mind all the features that don't translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors "count leading zeros" when the input is 0 vary), and in other cases, like these device scoped barriers.
 
-A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also not that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+
+I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.
 
 ## Onward
 
@@ -36,12 +38,13 @@ I have a recommendations for Apple as well. I hope that they document which HLSL
 [CDSChecker]: http://plrg.eecs.uci.edu/software_page/42-2/
 [loom]: https://github.com/tokio-rs/loom
 [OpenGL 3.1 support]: https://asahilinux.org/2023/06/opengl-3-1-on-asahi-linux/
+[gpuweb#2229]: https://github.com/gpuweb/gpuweb/issues/2229
 [gpuweb#2297]: https://github.com/gpuweb/gpuweb/pull/2297
+[gpuweb#3935]: https://github.com/gpuweb/gpuweb/discussions/3935
 [Metal Shading Language Specification]: https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
 [SAM prefix sum]: https://dl.acm.org/doi/10.1145/2980983.2908089
 [Vello]: https://github.com/linebender/vello
 [extension list at vulkan.gpuinfo.org]: https://vulkan.gpuinfo.org/listfeaturesextensions.php
 [Shader Model 5]: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3d11-graphics-reference-sm5
-[ghpuweb#2229]: https://github.com/gpuweb/gpuweb/issues/2229
 [std::atomic_ref]: https://en.cppreference.com/w/cpp/atomic/atomic_ref
 [reasonable IR]: https://gist.github.com/raphlinus/a8e0a3a3683127149b746eb37822bdc8

From ec87e0bd22ea848e3ed8bb3d37930a1ef7750d3c Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Mon, 12 Jun 2023 11:07:06 -0700
Subject: [PATCH 4/5] Minor fixes

Also add a concluding sentence so it doesn't end abruptly.
---
 _posts/2023-06-08-shader-converter.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
index 1e0b0ee..9934bc6 100644
--- a/_posts/2023-06-08-shader-converter.md
+++ b/_posts/2023-06-08-shader-converter.md
@@ -1,14 +1,14 @@
 ---
 layout: post
 title:  "A note on Metal shader converter"
-date:   2023-06-09 07:03:42 -0700
+date:   2023-06-12 11:05:42 -0700
 categories: [gpu]
 ---
 At WWDC, Apple introduced [Metal shader converter], a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.
 
-The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn't support it. That holds back WebGPU's performance (see [gpuweb#3935 for discussion]), as WebGPU must be portable across the major APIs.
+The specific feature I'm salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when [Direct3D 11] and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn't support it. That holds back WebGPU's performance (see [gpuweb#3935] for discussion), as WebGPU must be portable across the major APIs.
 
-I've discussed the value of this barrier in my blog post [Prefix sum on portable compute shaders], but I'll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the [SAM prefix sum] algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically `2 * ceil(log(n) / log(threadgroup size))`). Prefix sum, in turn is an important primitive for advanced compute workloads. There are a number of instances of it in the [Vello] pipeline, and it's also commonly used in stream compaction, decoding of variable length data streams, and compression.
+I've discussed the value of this barrier in my blog post [Prefix sum on portable compute shaders], but I'll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the [SAM prefix sum] algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically `2 * ceil(log(n) / log(threadgroup size))`). Prefix sum, in turn, is an important primitive for advanced compute workloads. There are a number of instances of it in the [Vello] pipeline, and it's also commonly used in stream compaction, decoding of variable length data streams, and compression.
 
 I believe there are other important techniques that are similarly unlocked by the availability of these primitives. For example, Nanite's advanced compute pipelines schedule work through job queues, and in general it is not possible to reliably coordinate work between different threadgroups (even within the same dispatch) without such a barrier.
 
@@ -18,7 +18,7 @@ The GPU ecosystem exists at the knife edge of being strangled by complexity. A b
 
 The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it's no longer sufficient to consider [Shader Model 5] to be a baseline, but rather the developer needs to keep in mind all the features that don't translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors "count leading zeros" when the input is 0 vary), and in other cases, like these device scoped barriers.
 
-A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see [gpuweb#2229]). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
 
 I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.
 

From 4e6e386b4bf5111aa4387e4c0375c668e5026de8 Mon Sep 17 00:00:00 2001
From: Raph Levien <raph@google.com>
Date: Mon, 12 Jun 2023 11:21:42 -0700
Subject: [PATCH 5/5] Linkify std::atomic_ref

---
 _posts/2023-06-08-shader-converter.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/_posts/2023-06-08-shader-converter.md b/_posts/2023-06-08-shader-converter.md
index 9934bc6..e0cc6c0 100644
--- a/_posts/2023-06-08-shader-converter.md
+++ b/_posts/2023-06-08-shader-converter.md
@@ -18,7 +18,7 @@ The GPU ecosystem exists at the knife edge of being strangled by complexity. A b
 
 The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it's no longer sufficient to consider [Shader Model 5] to be a baseline, but rather the developer needs to keep in mind all the features that don't translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors "count leading zeros" when the input is 0 vary), and in other cases, like these device scoped barriers.
 
-A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see [gpuweb#2229]). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
+A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see [gpuweb#2229]). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal's semantics are based on C++, I'd personally love to see this resolved by adopting [std::atomic_ref] from C++20 (Metal is still based on C++14). I'll also note that the official Metal shader compiler tool generates [reasonable IR] for this pattern. It's concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it's probably not a big problem in practice.
 
 I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.
 
@@ -32,6 +32,8 @@ For two, we can cheer on the work of Asahi Linux. They have recently announced [
 
 I have a recommendations for Apple as well. I hope that they document which HLSL features are expected to work and which are not. Currently in their documentation (which is admittedly beta), it just says "Some features not supported," which I personally find not very useful. I would also like to give them credit for clarifying the [Metal Shading Language Specification] with respect to the scope of the `mem_device` flag to `threadgroup_barrier`. It now says, "The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup," which to a very careful reader does indicate threadgroup scope and no guarantee at device scope. Previously it [said][gpuweb#2297] "Ensure correct ordering of memory operations to device memory," which could easily be misinterpreted as providing a device scope guarantee.
 
+I am optimistic in the long term about having really good, portable infrastructure for GPU compute, but it is clear that we have a long way to go.
+
 [Metal shader converter]: https://developer.apple.com/metal/shader-converter/
 [Prefix sum on portable compute shaders]: https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portable.html
 [Direct3D 11]: https://en.wikipedia.org/wiki/Direct3D#Direct3D_11