You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the last tutorial, we explain basic knowledge about Slang's autodiff feature and introduce some advanced techniques about custom derivative implementation and custom differential type.
12
+
In the [last tutorial](auto-diff-tutorial-1.md), we explain basic knowledge about Slang's autodiff feature and introduce some advanced techniques about custom derivative implementation and custom differential type.
13
13
14
14
In this tutorial, we will walk through a real-world example to combine what we've learnt from the last tutorial.
15
15
16
-
We will learn how to implement a tiny Multi-Layer Perceptron (MLP) to approximate a set of polynomial functions using Slang\'s automatic differentiation capabilities.
16
+
We will learn how to implement a tiny Multi-Layer Perceptron (MLP) to approximate a set of polynomial functions using Slang's automatic differentiation capabilities.
17
17
18
18
### Problem Setup
19
19
@@ -78,7 +78,7 @@ public struct MyNetwork
78
78
}
79
79
```
80
80
81
-
In our interface methods design, we introduce an internal representation of vector type `MLVec<4>` in the network evaluation. And this design choice will help us hide the details about how we are going to perform the arithmetic/logic operations on the vector type, as they are irrelevant to how the network should be designed. For now, you can treat this type as an opaque type, and we will talk about the detail in later section. Therefore, in the public method, we will just box the input to `MLVec`, call internal `_eval` method, and unbox MLVec to normal vector. And in the internal `_eval` method, we will just chain the evaluations of each layer together.
81
+
In our interface methods design, we introduce an internal representation of vector type `MLVec<4>` in the network evaluation. And this design choice will help us hide the details about how we are going to perform the arithmetic/logic operations on the vector type, as they are irrelevant to how the network should be designed. For now, you can treat this type as an opaque type, and we will talk about the detail in later section. Therefore, in the public method, we will just box the input to `MLVec`, call internal `_eval` method, and unbox `MLVec` to normal vector. And in the internal `_eval` method, we will just chain the evaluations of each layer together.
82
82
83
83
### 3.2 Feed-Forward Layer Definition
84
84
@@ -176,7 +176,7 @@ After implementing the forward pass of the network evaluation, we will need to i
176
176
177
177
### 4.1 Backward Pass of Loss
178
178
179
-
To implement the backward derivative of loss function, we only need to mark the function is `[Differentiable]` as we learnt from last tutorial:
179
+
To implement the backward derivative of loss function, we only need to mark the function is `[Differentiable]` as we learnt from [last tutorial](auto-diff-tutorial-1.md)
180
180
181
181
```hlsl
182
182
[Differentiable]
@@ -194,7 +194,7 @@ And from the slang kernel function, we will just need to call the backward of lo
194
194
bwd_diff(loss)(network, input.x, input.y, 1.0h);
195
195
```
196
196
197
-
One important thing to notice is that we are using no_diff attribute to decorate the input x, y and **groudtruth** calculation, because in the backward pass, we only care about the result of $\frac{\partial loss}{\partial\theta}$. no_diff attribute just tells Slang to treat the variables or instructions as non-differentiable, so there will be no backward mode instructions generated for those variables or instructions. In this case, since we don't care about the derivative of loss function with respective of input, therefore we can safely mark them as non-differentiable.
197
+
One important thing to notice is that we are using no_diff attribute to decorate the input `x`, `y` and `groudtruth` calculation, because in the backward pass, we only care about the result of $\frac{\partial loss}{\partial\theta}$. `no_diff` attribute just tells Slang to treat the variables or instructions as non-differentiable, so there will be no backward mode instructions generated for those variables or instructions. In this case, since we don't care about the derivative of loss function with respective of input, therefore we can safely mark them as non-differentiable.
198
198
199
199
This implementation indicates that this call `network.eval(x, y);` must be differentiable, so next we are going to implement the backward pass for this method.
200
200
@@ -225,7 +225,7 @@ public struct MyNetwork
225
225
226
226
## 4.3 Custom Backward Pass for Layers
227
227
228
-
Following the propagation direction of the gradients, we will next implement the backward derivative of FeedForwardLayer. But we're going to do something different. Instead of asking Slang to automatically synthesize the backward autodiff for us, we will provide a custom derivative implementation. Because the network parameters and gradients are a buffer storage declared in the layer, we will have to provide a custom derivative to write the gradient back to the global buffer storage, you can reference \[**How to propagate derivatives to globalbufferstorage\ ** in last tutorial to refresh your memory. Another reason is that our layer is just matrix multiplication with bias, and its derivative is quite simple, and there are lots of options to even accelerate it with specific hardware (e.g. Nvidia tensor core). Therefore, it's good practice to implement the custom derivative.
228
+
Following the propagation direction of the gradients, we will next implement the backward derivative of FeedForwardLayer. But we're going to do something different. Instead of asking Slang to automatically synthesize the backward autodiff for us, we will provide a custom derivative implementation. Because the network parameters and gradients are a buffer storage declared in the layer, we will have to provide a custom derivative to write the gradient back to the global buffer storage, you can reference [progagate derivative to storage buffer](auto-diff-tutorial-1.md#how-to-propagate-derivatives-to-global-buffer-storage) in last tutorial to refresh your memory. Another reason is that our layer is just matrix multiplication with bias, and its derivative is quite simple, and there are lots of options to even accelerate it with specific hardware (e.g. Nvidia tensor core). Therefore, it's good practice to implement the custom derivative.
229
229
230
230
First, let's revisit the mathematical formula, given:
231
231
$$Z=W\cdot x+b$$
@@ -279,7 +279,7 @@ InterlockedAddF16Emulated(biasesGrad + i, dResult.data[i], originalValue);
279
279
280
280
In the forward pass we already know that the parameters stored in a global storage buffer is shared by all threads, and so are gradients. Therefore, during backward pass, each thread will accumulate its gradient data to the shared storage buffer, we must use atomic add to accumulate all the gradients without race condition.
281
281
282
-
The implementation of `outerProductAccumulate` and `matMulTransposed` are just trivial for-loop multiplication, so we will not show the details in this tutorial, the complete code can be found at **[github-link]**.
282
+
The implementation of `outerProductAccumulate` and `matMulTransposed` are just trivial for-loop multiplication, so we will not show the details in this tutorial, the complete code can be found at [here](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training).
283
283
284
284
### 4.3 Make the vector differentiable
285
285
@@ -370,4 +370,4 @@ void adjustParameters(
370
370
371
371
The training process will be a loop that alternatively invokes the slang compute kernel `learnGradient` and `adjustParameters`, until the loss converges to an acceptable threshold value.
372
372
373
-
We will skip the host side implementation for the MLP example, it will be boilerplate code that setup graphics pipeline and allocate buffers for parameters. You can access the [github link] for complete code of this example which includes the host side implementation. Alternatively, you can try to use the more powerful tool [SlangPy](https://github.com/shader-slang/slangpy) to run this MLP example without writing any graphics boilerplate code.
373
+
We will skip the host side implementation for the MLP example, it will be boilerplate code that setup graphics pipeline and allocate buffers for parameters. You can access the [github repo](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training) for complete code of this example which includes the host side implementation. Alternatively, you can try to use the more powerful tool [SlangPy](https://github.com/shader-slang/slangpy) to run this MLP example without writing any graphics boilerplate code.
0 commit comments