You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/auto-diff-tutorial-2.md
+37-36Lines changed: 37 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ intro_image_hide_on_mobile: false
9
9
10
10
## What Will We Learn
11
11
12
-
In the [last tutorial](auto-diff-tutorial-1.md), we explain basic knowledge about Slang's autodiff feature and introduce some advanced techniques about custom derivative implementation and custom differential type.
12
+
In the [last tutorial](auto-diff-tutorial-1.md), we explained basic knowledge about Slang's autodiff feature and introduced some advanced techniques about custom derivative implementation and custom differential type.
13
13
14
-
In this tutorial, we will walk through a real-world example using what we've learnt from the last tutorial.
14
+
In this tutorial, we will walk through a real-world example using what we've learned from the last tutorial.
15
15
16
16
We will learn how to implement a tiny Multi-Layer Perceptron (MLP) to approximate a set of polynomial functions using Slang's automatic differentiation capabilities.
In our interface methods design, we introduce an internal representation of vector type `MLVec<4>` in the network evaluation. And this design choice will help us hide the details about how we are going to perform the arithmetic/logic operations on the vector type, as they are irrelevant to how the network should be designed. For now, you can treat this type as an opaque type, and we will talk about the detail in later section. Therefore, in the public method, we will just convert the input to `MLVec`, call the internal `_eval` method, and convert `MLVec` back to a normal vector. And in the internal `_eval` method, we will just chain the evaluations of each layer together.
82
-
> Note that we are using `half` type, which is [16-bit floating-point type](external/slang/docs/user-guide/02-conventional-features.md#scalar-types) for better thoughput and reducing memory usage.
82
+
In our interface method design, we introduce an internal representation of vector type `MLVec<4>` in the network evaluation. And this design choice will help us hide the details about how we are going to perform the arithmetic/logic operations on the vector type, as they are irrelevant to how the network should be designed. For now, you can treat this type as an opaque type, and we will talk about the details in later section. Therefore, in the public method, we will just convert the input to `MLVec`, call the internal `_eval` method, and convert `MLVec` back to a normal vector. And in the internal `_eval` method, we will just chain the evaluations of each layer together.
83
+
> Note that we are using `half` type, which is [16-bit floating-point type](external/slang/docs/user-guide/02-conventional-features.md#scalar-types) for better throughput and to reduce memory usage.
83
84
84
85
### 3.2 Feed-Forward Layer Definition
85
86
86
-
Each layer performs: $LeakyRelu(Wx+x, b)$, so we can create a struct to abstract this as follow:
87
+
Each layer performs: $LeakyRelu(Wx+b)$, so we can create a struct to abstract this as follows:
87
88
88
89
```hlsl
89
90
public struct FeedForwardLayer<int InputSize, int OutputSize>
@@ -101,8 +102,7 @@ public struct FeedForwardLayer<int InputSize, int OutputSize>
101
102
}
102
103
```
103
104
104
-
The evaluation is very simple, just doing a matrix vector multiplication plus a bias vector, and then performing a LeakyRelu function on it. But note that we use pointers in this struct to reference the parameters (e.g. weight and bias) instead of declaring arrays or even global storage buffers. Because our MLP will be run on the GPU and each evaluation function will be executed per-thread, if we used arrays for each layer, an array would be allocated for each thread and that would explode the GPU memory. For the same reason, we cannot declare a global storage buffer because each thread will hold a storage buffer that stores the exact same data.
105
-
Therefore, the most efficient approach is to use pointers or references to the global storage buffer, so every thread which executes the layer's method can access it.
105
+
The evaluation is very simple, just doing a matrix vector multiplication plus a bias vector, and then performing a LeakyRelu function on it. But note that we use pointers in this struct to reference the parameters (e.g. weight and bias) instead of declaring arrays or even global storage buffers. Because our MLP will be run on the GPU and each evaluation function will be executed per-thread, if we used arrays for each layer, an array would be allocated for each thread and that would explode the GPU memory. For the same reason, we cannot declare a storage buffer directly within the layer struct because each thread would create its own copy of the storage buffer containing identical data, leading to massive memory waste. Instead, we use pointers to reference a single shared global storage buffer.
106
106
107
107
### 3.3 Vector Type Definition
108
108
@@ -136,14 +136,14 @@ MLVec<OutputSize> matMulTransposed<int OutputSize, int InputSize>(MLVec<InputSiz
136
136
```
137
137
- Outer product of two vectors
138
138
```hlsl
139
-
void outerProductAccumulate<int M, int N>(MLVec<M> v0, MLVec<N>v1, NFloat* matrix);
139
+
void outerProductAccumulate<int M, int N>(MLVec<M> v0, MLVec<N>v1, NFloat* matrix);
140
140
```
141
141
142
-
The First two operations are straightforwad, which is just matrix vector multiplication and its transpose version. The last operation defines an outer product of two vectors, the result will be a matrix, such that $\text{x} \otimes \text{y} = \text{x} \cdot \text{y}^{T}$, where $\text{x}$ and $\text{y}$ are column vectors with same length.
142
+
The first two operations are straightforward, which is just matrix vector multiplication and its transpose version. The last operation defines an outer product of two vectors, the result will be a matrix, such that $\text{x} \otimes \text{y} = \text{x} \cdot \text{y}^{T}$, where $\text{x}$ and $\text{y}$ are column vectors with the same length.
143
143
144
144
### 3.4 Loss Function Definition
145
145
146
-
The loss function measures how far our network output is from the target, given that we have already defined the interfaces for the MLP network, we can simply implement the loss function:
146
+
The loss function measures how far our network output is from the target, since we have already defined the interfaces for the MLP network, we can simply implement the loss function:
147
147
148
148
```hlsl
149
149
public half loss(MyNetwork* network, half x, half y)
@@ -172,15 +172,15 @@ public half4 groundtruth(half x, half y)
172
172
173
173
## 4. Backward Pass Design
174
174
175
-
After implementing the forward pass of the network evaluation, we then need to implement the backward pass. You will see how effortless it is to implement the backward pass with Slang's autodiff. We're going to start the implementation from the end of the workflow to the beginning, because that's how the direction of the gradients flow.
175
+
After implementing the forward pass of the network evaluation, we then need to implement the backward pass. You will see how effortless it is to implement the backward pass with Slang's autodiff. We will start the implementation from the end of the workflow to the beginning, because that's how the direction in which the gradients flow.
176
176
177
177
### 4.1 Backward Pass of Loss
178
178
179
-
To implement the backward derivative of loss function, we only need to mark the function is`[Differentiable]` as we learnt from [last tutorial](auto-diff-tutorial-1.md#forward-mode-differentiation)
179
+
To implement the backward derivative of loss function, we only need to mark the function as`[Differentiable]` as we learned from [last tutorial](auto-diff-tutorial-1.md#forward-mode-differentiation)
180
180
181
181
```hlsl
182
182
[Differentiable]
183
-
public half loss(MyNetwork* network, no_diff half x, no_diff halfy)
183
+
public half loss(MyNetwork* network, no_diff half x, no_diff half y)
184
184
{
185
185
let networkResult = network->eval(x, y); // MLP(x,y; θ)
186
186
let gt = no_diff groundtruth(x, y); // target(x,y)
@@ -194,13 +194,13 @@ And from the Slang kernel function, we just need to call the backward mode of th
194
194
bwd_diff(loss)(network, input.x, input.y, 1.0h);
195
195
```
196
196
197
-
One important thing to notice is that we are using the [`no_diff` attribute](external/slang/docs/user-guide/07-autodiff.html#excluding-parameters-from-differentiation) to decorate the inputs `x` and `y`, as well as `groudtruth` calculation, because in the backward pass, we only care about the result of $\frac{\partial loss}{\partial\theta}$. The `no_diff` attribute just tells Slang to treat the variables or instructions as non-differentiable, so there will be no backward mode instructions generated for those variables or instructions. In this case, since we don't care about the derivative of loss function with respective of input, therefore we can safely mark them as non-differentiable.
197
+
One important thing to notice is that we are using the [`no_diff` attribute](external/slang/docs/user-guide/07-autodiff.html#excluding-parameters-from-differentiation) to decorate the inputs `x` and `y`, as well as `groundtruth` calculation, because in the backward pass, we only care about the result of $\frac{\partial loss}{\partial\theta}$. The `no_diff` attribute just tells Slang to treat the variables or instructions as non-differentiable, so there will be no backward mode instructions generated for those variables or instructions. In this case, since we don't care about the derivative of loss function with respect to the input, therefore we can safely mark them as non-differentiable.
198
198
199
199
Since `loss` function is differentiable now, every instruction inside this function has to be differentiable except those marked as `no_diff`. Therefore, `network->eval(x, y)` must be differentiable, so next we are going to implement the backward pass for this method.
200
200
201
201
### 4.2 Automatic Propagation to MLP
202
202
203
-
Just like the `loss` function, the only thing we need to do for `MyNetwork::eval` in order to use them with autodiff is to mark them as differentiable:
203
+
Just like the `loss` function, the only thing we need to do for `MyNetwork::eval` in order to use it with autodiff is to mark it as differentiable:
204
204
```hlsl
205
205
public struct MyNetwork
206
206
{
@@ -225,19 +225,21 @@ public struct MyNetwork
225
225
226
226
## 4.3 Custom Backward Pass for Layers
227
227
228
-
Following the propagation direction of the gradients, we will next implement the backward derivative of FeedForwardLayer. But here we're going to do something different. Instead of asking Slang to automatically synthesize the backward autodiff for us, we will provide a custom derivative implementation. Because the network parameters and gradients are a buffer storage declared in the layer, we will have to provide a custom derivative to write the gradient back to the global buffer storage. You can reference [progagate derivative to storage buffer](auto-diff-tutorial-1.md#how-to-propagate-derivatives-to-global-buffer-storage) in last tutorial to refresh your memory. Another benefit of providing a custom derivative here is that our layer is just matrix multiplication with bias, and its derivative is quite simple, so there are lots of options to accelerate it with specific hardware (e.g. Nvidia tensor cores). Therefore, it's good practice to implement the custom derivative.
228
+
Following the propagation direction of the gradients, we will next implement the backward derivative of FeedForwardLayer. But here we're going to do something different. Instead of asking Slang to automatically synthesize the backward autodiff for us, we will provide a custom derivative implementation. Because the network parameters and gradients are a buffer storage declared in the layer, we will have to provide a custom derivative to write the gradient back to the global buffer storage. You can reference [propagate derivative to storage buffer](auto-diff-tutorial-1.md#how-to-propagate-derivatives-to-global-buffer-storage) in last tutorial to refresh your memory. Another benefit of providing a custom derivative here is that our layer is just matrix multiplication with bias, and its derivative is quite simple, so there are lots of options to accelerate it with specific hardware (e.g. Nvidia tensor cores). Therefore, it's good practice to implement the custom derivative.
229
229
230
230
First, let's revisit the mathematical formula, given:
231
+
231
232
$$Z=W\cdot x+b$$
232
233
$$Y=LeakyRelu(Z)$$
233
234
234
235
Where $W \in R^{m \times n}$, $x \in R^{n}$ and $b \in R^{m}$, the
235
236
gradient of $W$, $x$ and $b$ will be:
236
-
$$Z = W\cdot x + b$$
237
-
$$dZ=dY\cdot(z > 0?1:\alpha)$$
237
+
238
+
$$Z=W\cdot x+b$$
239
+
$$dZ=dY\cdot(z > 0 ? 1:\alpha)$$
238
240
$$dW=dZ\cdot x^{T}$$
239
-
$$\text{d}b=\text{d}Z$$
240
-
$$dx = W^{T}\cdot dZ$$
241
+
$$\text{d}b=\text{d}Z$$
242
+
$$dx=W^{T}\cdot dZ$$
241
243
242
244
Therefore, the implementation should be:
243
245
```hlsl
@@ -267,8 +269,8 @@ public void evalBwd(inout DifferentialPair<MLVec<InputSize>> x, MLVec<OutputSize
let dx= **matMulTransposed<InputSize>(dResult, weights);
271
-
x= {x.p, dx}; // Update differential pair
272
+
let dx = matMulTransposed<InputSize>(dResult, weights);
273
+
x= {x.p, dx}; // Update differential pair
272
274
}
273
275
```
274
276
@@ -277,13 +279,13 @@ The key point in this implementation is that we use atomic add when writing the
277
279
InterlockedAddF16Emulated(biasesGrad + i, dResult.data[i], originalValue);
278
280
```
279
281
280
-
In the forward pass we already know that the parameters stored in a global storage buffer is shared by all threads, and so are gradients. Therefore, during backward pass, each thread will accumulate its gradient data to the shared storage buffer, we must use atomic add to accumulate all the gradients without race conditions.
282
+
In the forward pass we already know that the parameters stored in a global storage buffer are shared by all threads, and so are the gradients. Therefore, during the backward pass, each thread will accumulate its gradient data to the shared storage buffer, so we must use atomic add operation to accumulate all the gradients without race conditions.
281
283
282
284
The implementation of `outerProductAccumulate` and `matMulTransposed` are trivial for-loop multiplication, so we will not show the details in this tutorial. The complete code can be found at [here](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training).
283
285
284
-
### 4.3 Make the vector differentiable
286
+
### 4.4 Make the vector differentiable
285
287
286
-
If we just compile what we have right now, we would generate a compile error because `MLVec` is not a differentiable type, so Slang doesn't expect the signature of backward of layer's eval method to be
288
+
If we just compiled what we have right now, we would get a compile error because `MLVec` is not a differentiable type, so Slang doesn't expect the signature of backward of layer's eval method to be:
287
289
```hlsl
288
290
public void evalBwd(inout DifferentialPair<MLVec<InputSize>> x, MLVec<OutputSize> dResult)
289
291
```
@@ -302,11 +304,11 @@ public struct MLVec<int N> : IDifferentiable
302
304
}
303
305
```
304
306
305
-
### 4.4 Parameter Update
307
+
### 4.5 Parameter Update
306
308
307
-
After back propagation, the last step is to update the parameters by the gradients we just computed
309
+
After backpropagation, the last step is to update the parameters by the gradients we just computed:
308
310
```hlsl
309
-
public struct Optimzer
311
+
public struct Optimizer
310
312
{
311
313
public static const NFloat learningRate = 0.01h; // Step size
312
314
public static void step(inout NFloat param, inout NFloat grad)
@@ -319,8 +321,7 @@ public struct Optimzer
319
321
320
322
## 5. Putting It All Together
321
323
322
-
We will create two kernel functions, one for the training, and another
323
-
one for parameter updating.
324
+
We will create two kernel functions: one for training and another for parameter updating.
324
325
325
326
The training kernel will be:
326
327
```hlsl
@@ -368,6 +369,6 @@ void adjustParameters(
368
369
}
369
370
```
370
371
371
-
The training process will be a loop that alternatively invokes the slang compute kernel `learnGradient` and `adjustParameters`, until the loss converges to an acceptable threshold value.
372
+
The training process will be a loop that alternately invokes the slang compute kernel `learnGradient` and `adjustParameters`, until the loss converges to an acceptable threshold value.
372
373
373
-
The host side implementation for this example is not shown, it will be boilerplate code to setup the graphics pipeline and allocate buffers for parameters. You can access the [github repo](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training) for complete code of this example, which includes the host side implementation. Alternatively, you can use the more powerful tool [SlangPy](https://github.com/shader-slang/slangpy) to run this MLP example without writing any graphics boilerplate code.
374
+
The host side implementation for this example is not shown, as it is simply boilerplate code to setup the graphics pipeline and allocate buffers for parameters. You can access the [github repo](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training) for this example's complete code, which includes the host side implementation. Alternatively, you can use the more powerful tool [SlangPy](https://github.com/shader-slang/slangpy) to run this MLP example without writing any graphics boilerplate code.
0 commit comments