**Table of contents**<a id='toc0_'></a>    
- [Building the model layers](#toc1_)    
  - [Build a neural network](#toc1_1_)    
    - [Get a hardware device for training](#toc1_1_1_)    
    - [Define the class](#toc1_1_2_)    
  - [Weight and Bias](#toc1_2_)    
  - [Model layers](#toc1_3_)    
    - [nn.Flatten](#toc1_3_1_)    
    - [nn.Linear](#toc1_3_2_)    
    - [nn.ReLU](#toc1_3_3_)    
    - [nn.Sequential](#toc1_3_4_)    
    - [nn.Softmax](#toc1_3_5_)    
  - [Model parameters](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# <a id='toc1_'></a>[Building the model layers](#toc0_)

## <a id='toc1_1_'></a>[Build a neural network](#toc0_)

### <a id='toc1_1_1_'></a>[Get a hardware device for training](#toc0_)

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

Using cpu device


### <a id='toc1_1_2_'></a>[Define the class](#toc0_)

- We define our neural network by subclassing ``nn.Module``, and initialize the neural network layers in ``__init__``.
- Every nn.Module subclass implements the operations on input data in the forward method.

Our neural network is composed of the following:
```
1. The input layer with 28x28 or 784 features/pixels.
2. The first linear module takes the input 784 features and transforms it to a hidden layer with 512 features.
3. The ReLU activation function will be applied in the transformation.
4. The second linear module takes 512 features as input from the first hidden layer and transforms it to the next hidden layer with 512 features.
5. The ReLU activation function will be applied in the transformation.
6. The third linear module take 512 features as input from the second hidden layer and transforms those features to the output layer with 10, which is the number of classes.
7. The ReLU activation function will be applied in the transformation.
```


In [28]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        print("========__init__")
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        print("========forward")
        print("--x: ", x[0][0])
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        print("--logits", logits)
        return logits


In [29]:
# 1. 创建一个 NeuralNetwork 的实例，并将其移动到 device 中，并打印其结构
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
)


In [31]:
# 2. 使用此模型实例
X = torch.rand(1, 28, 28, device=device)
logits = model(X)          # 调用模型会返回一个 10 维张量，其中包含每个类别的原始预测值
print("---> ", logits)
pred_probab = nn.Softmax(dim=1)(logits)     # 把值传递给 nn.Softmax 的实例来获取预测密度
y_pred = pred_probab.argmax(1) 
print(f"---> Predicted class: {y_pred}")

--x:  tensor([0.6713, 0.9800, 0.0375, 0.7683, 0.3016, 0.2872, 0.9026, 0.4352, 0.5712,
        0.9999, 0.1142, 0.1094, 0.4680, 0.6605, 0.6617, 0.6932, 0.9095, 0.1173,
        0.0639, 0.9552, 0.1322, 0.2908, 0.0895, 0.8303, 0.5566, 0.6190, 0.5297,
        0.3281])
--logits tensor([[0.0000, 0.0897, 0.0335, 0.0000, 0.0000, 0.0297, 0.0000, 0.1087, 0.0000,
         0.0085]], grad_fn=<ReluBackward0>)
--->  tensor([[0.0000, 0.0897, 0.0335, 0.0000, 0.0000, 0.0297, 0.0000, 0.1087, 0.0000,
         0.0085]], grad_fn=<ReluBackward0>)
---> Predicted class: tensor([7])


In [7]:
logits

tensor([[0.0000, 0.0218, 0.0875, 0.0030, 0.0212, 0.0000, 0.1574, 0.0776, 0.0228,
         0.0000]], grad_fn=<ReluBackward0>)

## <a id='toc1_2_'></a>[Weight and Bias](#toc0_)

- nn.Linear 模块随机初始化每一层的 weights 和 bias 并在内部将值存储在 Tensors 中

In [33]:
print(f"First Linear weights: {model.linear_relu_stack[0].weight} \n")

print(f"First Linear biases: {model.linear_relu_stack[0].bias} \n")

First Linear weights: Parameter containing:
tensor([[-3.3600e-03,  1.7673e-02, -1.0084e-02,  ...,  2.1328e-02,
          1.2723e-02, -2.0640e-02],
        [ 1.9739e-02, -3.2739e-02, -1.1304e-02,  ...,  1.6896e-02,
         -1.0160e-03, -2.7725e-03],
        [ 1.6177e-02,  3.2788e-02,  2.0461e-02,  ..., -1.7207e-02,
         -1.2042e-02, -2.1971e-02],
        ...,
        [ 1.3602e-02, -2.0067e-02,  3.3359e-03,  ..., -7.0133e-03,
         -2.1222e-02, -3.1736e-04],
        [ 1.2792e-02, -4.7015e-03,  1.8607e-02,  ..., -2.7154e-05,
          2.4730e-02,  2.0821e-02],
        [-5.8050e-03,  2.4799e-02, -1.1905e-04,  ...,  8.8574e-03,
          2.1331e-02,  2.0435e-02]], requires_grad=True) 

First Linear biases: Parameter containing:
tensor([-1.3724e-02,  2.4594e-02, -9.8472e-03, -1.4274e-03, -3.3574e-02,
         1.8288e-02, -4.6431e-03, -2.9256e-02, -2.4660e-02,  2.5854e-02,
         8.3790e-03, -5.5297e-03, -1.4865e-02, -2.8793e-02, -9.7110e-03,
        -9.9543e-03, -1.7029e-02, -2.987

## <a id='toc1_3_'></a>[Model layers](#toc0_)

- 分解 FashionMNIST 模型中的各个层。
- 为了说明这一点，我们将采用 3 张大小为 28x28 的图像的小批量样本

In [34]:
input_image = torch.rand(3,28,28)
print(input_image.size())

torch.Size([3, 28, 28])


### <a id='toc1_3_1_'></a>[nn.Flatten](#toc0_)

- 初始化 nn.Flatten 层，将每个 2D 28x28 图像转换为 784 个像素值的连续数组，即维持小批量维度（dim=0 时）
- 每个像素都被传递到神经网络的输入层

In [35]:
flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

torch.Size([3, 784])


### <a id='toc1_3_2_'></a>[nn.Linear](#toc0_)

- 线性层是一个使用其存储的权重和偏差对输入应用线性变换的模块
- 输入层中每个像素的灰度值将连接到隐藏层中的神经元进行计算
- 用于转换的计算是 weight∗input+bias

In [36]:
layer1 = nn.Linear(in_features=28*28, out_features=20)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 20])


### <a id='toc1_3_3_'></a>[nn.ReLU](#toc0_)

- 非线性激活是在模型的输入和输出之间创建复杂映射的原因。它们在线性变换后应用以引入非线性，帮助神经网络学习各种现象。
- 在此模型中，我们在线性层之间使用 nn.ReLU
- ReLU 激活函数获取线性层计算的输出，并将负值替换为零。


Linear output: ${ x = {weight * input + bias}} $.  
ReLU:  $f(x)= 
\begin{cases}
    0, & \text{if } x < 0\\
    x, & \text{if } x\geq 0\\
\end{cases}
$

In [37]:
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

Before ReLU: tensor([[ 0.4944, -0.1607, -0.1401,  0.4301,  0.5532,  0.0578,  0.3011, -0.0024,
         -0.4917, -0.1941, -0.2191, -0.5344, -0.4400, -0.2563, -0.1389,  0.4899,
          0.2086,  0.0346,  0.2361,  0.2533],
        [ 0.1026, -0.4872, -0.2421,  0.4621,  0.0791,  0.5096,  0.3423,  0.1377,
         -0.6847, -0.0187, -0.2458, -0.1424, -0.1524, -0.0881, -0.1300,  0.2688,
          0.1321, -0.0317,  0.0481,  0.1814],
        [ 0.3747, -0.3533,  0.0103,  0.6814,  0.3014,  0.3590,  0.4004,  0.1450,
         -0.9122, -0.2040, -0.5909, -0.0821, -0.2452,  0.1204, -0.0147,  0.3103,
          0.2773, -0.0685, -0.0464, -0.1741]], grad_fn=<AddmmBackward0>)


After ReLU: tensor([[0.4944, 0.0000, 0.0000, 0.4301, 0.5532, 0.0578, 0.3011, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.4899, 0.2086, 0.0346,
         0.2361, 0.2533],
        [0.1026, 0.0000, 0.0000, 0.4621, 0.0791, 0.5096, 0.3423, 0.1377, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00

### <a id='toc1_3_4_'></a>[nn.Sequential](#toc0_)

- nn.Sequential 是模块的有序容器。数据按照定义的顺序通过所有模块
- 您可以使用顺序容器来组合一个快速网络，例如 seq_modules

注：这个是可以把其他的组合在一起


In [49]:
seq_modules = nn.Sequential(
    nn.Flatten(),
    nn.Linear(in_features=28*28, out_features=20),
    nn.ReLU(),
    nn.Linear(20, 10)
)
input_image = torch.rand(3,28,28)
logits = seq_modules(input_image)
logits

tensor([[ 0.1620,  0.3904, -0.3265,  0.0129,  0.0988,  0.0296, -0.2605,  0.0922,
          0.2681, -0.2154],
        [ 0.1186,  0.4392, -0.3286,  0.2184,  0.0560, -0.2330, -0.3739,  0.0420,
          0.1948, -0.2438],
        [ 0.0392,  0.2726, -0.3239,  0.1041, -0.0461, -0.0214, -0.1551, -0.1006,
          0.0993,  0.0305]], grad_fn=<AddmmBackward0>)

### <a id='toc1_3_5_'></a>[nn.Softmax](#toc0_)

- Softmax 激活函数用于计算神经网络输出的概率。
- 它仅用于神经网络的输出层。
- 结果将缩放到值 \[0， 1\]，表示模型对每个类的预测密度。
- “dim”参数指示结果值必须总和 1 的维度。概率最高的节点预测所需的输出。


In [52]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
pred_probab

tensor([[0.1120, 0.1407, 0.0687, 0.0965, 0.1051, 0.0981, 0.0734, 0.1044, 0.1245,
         0.0768],
        [0.1102, 0.1518, 0.0704, 0.1217, 0.1035, 0.0775, 0.0673, 0.1020, 0.1189,
         0.0767],
        [0.1038, 0.1311, 0.0722, 0.1108, 0.0953, 0.0977, 0.0855, 0.0903, 0.1103,
         0.1029]], grad_fn=<SoftmaxBackward0>)

## <a id='toc1_4_'></a>[Model parameters](#toc0_)

- 神经网络内的许多层都是参数化的，也就是说，这些层具有在训练期间优化的关联权重和偏差

In [54]:
print("Model structure: ", model, "\n\n")

# 迭代每个参数，并打印其大小及其值的预览
for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure:  NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
    (5): ReLU()
  )
) 


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[-0.0034,  0.0177, -0.0101,  ...,  0.0213,  0.0127, -0.0206],
        [ 0.0197, -0.0327, -0.0113,  ...,  0.0169, -0.0010, -0.0028]],
       grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([-0.0137,  0.0246], grad_fn=<SliceBackward0>) 

Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[-0.0213, -0.0405, -0.0068,  ...,  0.0376,  0.0434, -0.0058],
        [ 0.0183, -0.0006,  0.0382,  ..., -0.0136,  0.0409, -0.0433]],
       grad_fn=<SliceBackward0>) 

Layer: linear_re

In [61]:
for name, param in model.named_parameters():
    print("=================")
    print(f"Layer: {name} | Size: {param.size()} | Values : {param} \n")


Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : Parameter containing:
tensor([[-3.3600e-03,  1.7673e-02, -1.0084e-02,  ...,  2.1328e-02,
          1.2723e-02, -2.0640e-02],
        [ 1.9739e-02, -3.2739e-02, -1.1304e-02,  ...,  1.6896e-02,
         -1.0160e-03, -2.7725e-03],
        [ 1.6177e-02,  3.2788e-02,  2.0461e-02,  ..., -1.7207e-02,
         -1.2042e-02, -2.1971e-02],
        ...,
        [ 1.3602e-02, -2.0067e-02,  3.3359e-03,  ..., -7.0133e-03,
         -2.1222e-02, -3.1736e-04],
        [ 1.2792e-02, -4.7015e-03,  1.8607e-02,  ..., -2.7154e-05,
          2.4730e-02,  2.0821e-02],
        [-5.8050e-03,  2.4799e-02, -1.1905e-04,  ...,  8.8574e-03,
          2.1331e-02,  2.0435e-02]], requires_grad=True) 

Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : Parameter containing:
tensor([-1.3724e-02,  2.4594e-02, -9.8472e-03, -1.4274e-03, -3.3574e-02,
         1.8288e-02, -4.6431e-03, -2.9256e-02, -2.4660e-02,  2.5854e-02,
         8.