# 5.3.6. Exercises

<https://d2l.ai/chapter_multilayer-perceptrons/backprop.html#exercises>


In [1]:
from torch.profiler import profile, record_function, ProfilerActivity
from d2l import torch as d2l
import torch.nn as nn
import torch

##### 1. Assume that the inputs $\mathbf{X}$ to some scalar function $f$ are $n \times m$ matrices. What is the dimensionality of the gradient of $f$ with respect to $\mathbf{X}$?

**假设输入 $\mathbf{X}$ 到某个标量函数 $f$ 是 $n \times m$ 矩阵。 那么 $f$ 相对于 $\mathbf{X}$ 的梯度的维度是多少？**


答：该梯度的维度为： $n \times m$


##### 2. Add a bias to the hidden layer of the model described in this section (you do not need to include bias in the regularization term).

1. Draw the corresponding computational graph.
2. Derive the forward and backward propagation equations.

**在本节中为模型的隐藏层添加一个偏置（您不需要在正则化项中包含偏置）。**

1. 绘制相应的计算图。
2. 推导前向传播和后向传播方程。


答：

1. 计算图
   ![image.png](5_3_1.png)
2. 方程

   前向传播
   $$\mathbf{z}= \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$$
   $$\mathbf{h}= \phi (\mathbf{z}).$$
   $$\mathbf{o}= \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}$$
   $$L = l(\mathbf{o}, y).$$
   $$J = L + s.$$

   反向传播
   $$\frac{\partial J}{\partial L} = 1 \; \textrm{and} \; \frac{\partial J}{\partial s} = 1.$$

   $$
   \frac{\partial J}{\partial \mathbf{o}}
   = \textrm{prod}\left(\frac{\partial J}{\partial L}, \frac{\partial L}{\partial \mathbf{o}}\right)
   = \frac{\partial L}{\partial \mathbf{o}}
   \in \mathbb{R}^q.
   $$

   $$
   \frac{\partial J}{\partial \mathbf{b}^{(2)}}= \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{b}^{(2)}}\right) = \frac{\partial L}{\partial \mathbf{o}}
   $$

   $$
   \frac{\partial s}{\partial \mathbf{W}^{(1)}} = \lambda \mathbf{W}^{(1)}
   \; \textrm{and} \;
   \frac{\partial s}{\partial \mathbf{W}^{(2)}} = \lambda \mathbf{W}^{(2)}.
   $$

   $$\frac{\partial J}{\partial \mathbf{W}^{(2)}}= \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)}.$$

   $$
   \frac{\partial J}{\partial \mathbf{h}}
   = \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{h}}\right)
   = {\mathbf{W}^{(2)}}^\top \frac{\partial J}{\partial \mathbf{o}}.
   $$

   $$
   \frac{\partial J}{\partial \mathbf{z}}
   = \textrm{prod}\left(\frac{\partial J}{\partial \mathbf{h}}, \frac{\partial \mathbf{h}}{\partial \mathbf{z}}\right)
   = \frac{\partial J}{\partial \mathbf{h}} \odot \phi'\left(\mathbf{z}\right).
   $$

   $$
   \frac{\partial J}{\partial \mathbf{b}^{(1)}}=\textrm{prod}\left(\frac{\partial J}{\partial \mathbf{z}}, \frac{\partial \mathbf{z}}{\partial \mathbf{b}^{(1)}}\right)
   =\frac{\partial J}{\partial \mathbf{h}} \odot \phi'\left(\mathbf{z}\right)
   $$

   $$
   \frac{\partial J}{\partial \mathbf{W}^{(1)}}
   = \frac{\partial J}{\partial \mathbf{z}} \mathbf{x}^\top + \lambda \mathbf{W}^{(1)}.
   $$


##### 3. Compute the memory footprint for training and prediction in the model described in this section.

**计算本节中所述模型的训练和预测的内存占用。**


In [2]:
# 答


class MLP(d2l.Classifier):
  def __init__(self, num_outputs, num_hiddens, lr, plot_flag=True):
    super().__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(
      nn.Flatten(), nn.LazyLinear(num_hiddens), nn.ReLU(), nn.LazyLinear(num_outputs)
    )

  def training_step(self, batch):
    l = self.loss(self(*batch[:-1]), batch[-1])
    if self.plot_flag:
      self.plot("loss", l, train=True)
    return l

  def validation_step(self, batch):
    Y_hat = self(*batch[:-1])
    l = self.loss(Y_hat, batch[-1])
    if self.plot_flag:
      self.plot("loss", self.loss(Y_hat, batch[-1]), train=False)
      self.plot("acc", self.accuracy(Y_hat, batch[-1]), train=False)
    return l


model = MLP(num_outputs=10, num_hiddens=256, lr=0.1, plot_flag=False)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=1)




In [3]:
with profile(activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof:
  with record_function("model_train"):
    trainer.fit(model, data)

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

STAGE:2024-12-24 17:49:10 17061:4014684 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
[W CPUAllocator.cpp:235] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
STAGE:2024-12-24 17:49:13 17061:4014684 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2024-12-24 17:49:13 17061:4014684 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        83.39%        2.742s        83.42%        2.743s       9.902ms     209.92 Mb     209.46 Mb           277  
                                     aten::resolve_conj         0.00%       4.000us         0.00%       4.000us       0.001us     129.05 Mb     129.05 Mb          2747  
                                               aten::mm         0.94%      30.922ms         0.94%      30.922ms      43.861us     240.81 Mb     111.76

In [4]:
with profile(activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof:
  with record_function("model_infer"):
    model(data.train.data.type(torch.float32))

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::empty_strided         0.04%      25.000us         0.04%      25.000us      25.000us     179.44 Mb     179.44 Mb             1  
             aten::addmm        38.11%      22.907ms        50.37%      30.276ms      15.138ms      60.88 Mb      60.88 Mb             2  
         aten::clamp_min        13.42%       8.065ms        13.42%       8.065ms       8.065ms      58.59 Mb      58.59 Mb             1  
                aten::to         0.01%       7.000us        29.15%      17.518ms      17.518ms     179.44 Mb           0 b             1  
          aten::_to_copy   

STAGE:2024-12-24 17:49:15 17061:4014684 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2024-12-24 17:49:15 17061:4014684 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2024-12-24 17:49:15 17061:4014684 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


##### 4. Assume that you want to compute second derivatives. What happens to the computational graph? How long do you expect the calculation to take?

**假设您想计算二阶导数。计算图会发生什么变化？您预计计算将花费多长时间？**


答：

1. 计算图变得更深更复杂，因为它不仅需要捕捉参数与损失之间的关系，还需要捕捉梯度与其梯度之间的关系。
2. 假设我们在一个网络中有 N 个参数，损失是一个标量。第一导数有 N 个元素。但第二导数有 N^2 个元素。


##### 5. Assume that the computational graph is too large for your GPU.

1. Can you partition it over more than one GPU?
2. What are the advantages and disadvantages over training on a smaller minibatch?

**假设计算图对于你的 GPU 来说太大。**

1. 你能在多个 GPU 上进行分区吗？
2. 在较小的迷你批次上训练的优缺点是什么？


答：

1.  我们可以在多个 GPU 上拆分模型或小批量数据。
2.  迷你批次上训练
    - 优点：它允许我们处理更大的模型或数据集，这些模型或数据集无法容纳在单个 GPU 的内存中。由于并行计算，它可以导致更快的训练时间。
    - 缺点：在 GPU 之间交换信息时会有通信开销，这可能会减慢训练速度。同步多个 GPU 可能很复杂，尤其是在处理异步更新时。较小的迷你批次可能导致更嘈杂的梯度估计，从而减慢收敛速度。
