# Build model
<font color=blue>[包含tutorial的Build model和developer note的Modules]</font>
## 1. 什么是pytorch中的module,pytorch提供了哪些module类型？
· module是构建神经网络的基础模块。pytorch提供了一个modules库，也支持自定义modules。用他们可以很容易地构建多层神经网络。具体实现来看，<font color=green>**namespace**</font> **torch.nn**提供了layers, containers和utilities三种主要的module类型，以及tensor类型的nn.Parameter作为modules parameter。
1. <font color=lightblue>**Layers：**</font>NN通过layers对数据进行操作。pytorch用modules来表达这些layers,比如conv, affine, pooling, normalization, transformer和loss functions等
2. <font color=lightblue>**containers：**</font>有3类container，nn.Module，nn.Sequential和holders of submodules。\
(1)**torch.nn.Module**。它是所有NN modules的base class，pytorch中所有的module都是**nn.Module**的子类\
(2)**torch.nn.Sequential**：以序列形式将1个或多个module顺序排列，体现了module的nestable\
(3)holders of submodules,其中：**nn.ModuleList，nn.ModuleDict**分别是以list和dictionary类型存储的module序列。**nn.ParamterList和nn.ParameterDict**分别是以list和dictionary形式存储的参数。
3. <font color=lightblue>**utilities：**</font>把一些数据处理的函数以modules的形式表达。<font color=red>【具体待使用后描述？？？】</font>

## 2. module的特点
1. module和autograd system一起工作：modules使optimizer update参数非常方便。因为module能在autograd system的管理下自动完成requires_grad=True的tensor的梯度计算，optimizer可以在此基础上自动工作。
2. pytorch中的module可以nest：每个神经网络模型自身都是一个module，该module又由其他modules(layers)构成。这种nest structure可以很方便的构造复杂的网络架构。<font color=red>【理解？？？】</font>
3. **nn.Module**的子类会自动track参数，可以用两个method来查看：parameters()和named_parameters()
4. 很容易与Transform配合使用：modules的save和restore都很直接，在CPU/GPU之间移动，做prune，quantize和其他很多操作都很方便

In [1]:
import os
import torch
import torch.nn as nn           # for torch.nn.Module
import torch.nn.functional as F # for the activation function
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = ("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device")

Using cuda device


## 3. 定义一个NN
1. 自定义model也得定义为**nn.Module**的子类，每个子类必须定义\__init__和forward()两个method。
2. 模型对input data的操作都放在forward()中。即，forward()用来指定要执行的computation，用的operation是nn.autograd.Function的子类的实例。这些子类可以是pytorch定义好的，也可以是自定义的。
3. **用nn.Mudule来实例化module时，只implement forward() method不用implement backward() method**，因为：\
(1)<font color=blue>用nn.autograd.Function来实例化（自定义）Function时，要同时implement forward() and backward() methods</font> \
(2)<font color=blue>autograd system会用Function中的backward来自动处理module中用到的function的backward pass。</font>
4. 如果module中要定义parameters，就要在\__init\__()中register。方式是在\__init\__()中将parameter定义为nn.Parameter的实例。此时，这些parameters就是parameters registered by the module。这也是autograd system运行需要的。
5. Parameter class是torch.Tensor的子类，但他们可以被assigned as attributes of a Module。一旦实例化后，这些parameters就会被加到lists of the module's parameters，之后可以通过module.parameters()和model.namedparameters()来iterate throgh。

### 3.1 自定义一个简单的Module

In [2]:
class MyLinear(nn.Module):  # 必须是nn.Module的子类
    def __init__(self, in_features, out_features):
        super().__init__()

        # registering parameters: 参数定义成nn.Parameter的实例
        # 此时autograd会自动tracking并让optimizer在迭代时update
        self.weight = nn.Parameter(torch.randn(in_features, out_features))
        self.bias = nn.Parameter(torch.randn(out_features))

  # implement forward() method
    def forward(self, input):
        return input @ self.weight + self.bias

In [3]:
model = MyLinear(4, 3)          
sample_input = torch.randn(4)  

# model is callable, calling invoke forward function
model(sample_input)

tensor([ 0.8722,  3.7005, -0.6856], grad_fn=<AddBackward0>)

In [4]:
## 遍历parameters()
for parameter in model.parameters():
    print(parameter)

print('\n')
    
## 遍历parameters named_parameters()
#  这里weights和bias是parameter的name
for parameter in model.named_parameters():
    print(parameter)

Parameter containing:
tensor([[-0.7241, -1.8295,  0.4082],
        [-1.0282, -0.3029,  0.0207],
        [ 0.1802,  0.3633, -0.2619],
        [ 1.1791, -1.1295, -0.4416]], requires_grad=True)
Parameter containing:
tensor([-0.1510,  0.4955,  0.4743], requires_grad=True)


('weight', Parameter containing:
tensor([[-0.7241, -1.8295,  0.4082],
        [-1.0282, -0.3029,  0.0207],
        [ 0.1802,  0.3633, -0.2619],
        [ 1.1791, -1.1295, -0.4416]], requires_grad=True))
('bias', Parameter containing:
tensor([-0.1510,  0.4955,  0.4743], requires_grad=True))


### 3.2 将modules作为模型的基础模块(building blocks)
· modules contain other modules

#### i. 用nn.Sequential定义一个简单的module
· Sequential会自动将上一层的输出传给下一层作为输入。<font color=red>但只在输入和输出都是单变量的情况时有效。</font>

In [5]:
# nn.Sequential本身也是nn.Module的子类，所以实例化得到的也是module
net = nn.Sequential(
    MyLinear(4, 3),
    nn.ReLU(),
    MyLinear(3, 1)
)

simple_input = torch.randn(4)
net(sample_input)

tensor([-0.3434], grad_fn=<AddBackward0>)

####  ii. 自定义module
· 除了上面例子中非常简单的案例，通常都不会直接用Sequential来定义module，更多还是直接自定义module的方式.\
· 在__init__()中定义的submodule对应NN中的layer。

In [6]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer0 = MyLinear(4, 3)
        self.layer1 = MyLinear(3, 1)  # 定义了两个submodule
    
    def forward(self, x):
        x = self.layer0(x)
        x = F.relu(x)                 # relu不是submodule
        x = self.layer1(x)
        return x

In [7]:
class Net2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer0 = MyLinear(4, 3)
        self.relu = nn.ReLU()
        self.layer1 = MyLinear(3, 1)  # 定义了两个submodule
    
    def forward(self, x):
        x = self.layer0(x)
        x = self.relu(x)                 # relu不是submodule
        x = self.layer1(x)
        return x

**module的Immediate children可以用children() or named_children()来iterated through** \
上例中的children(也就是submodule)不包括rely层

In [8]:
net = Net()
for child in net.named_children():
    print(child)

('layer0', MyLinear())
('layer1', MyLinear())


In [9]:
# 对比前面例子中直接用tensor operation定义的module
# 此时module中没有child
for child in model.children():
    print(child)

In [10]:
# 也可以把relu处理成module
net2 = Net2()
for child in net2.named_children():
    print(child)

('layer0', MyLinear())
('relu', ReLU())
('layer1', MyLinear())


**modules() and named_modules() recursively iterate through a module and its child modules**

In [11]:
class BigNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = MyLinear(5, 4)
        self.net = Net()
    def forward(self, x):
        return self.net(self.l1(x))

big_net = BigNet()
for module in big_net.named_modules():
    print('-' * 52)
    print(module)

----------------------------------------------------
('', BigNet(
  (l1): MyLinear()
  (net): Net(
    (layer0): MyLinear()
    (layer1): MyLinear()
  )
))
----------------------------------------------------
('l1', MyLinear())
----------------------------------------------------
('net', Net(
  (layer0): MyLinear()
  (layer1): MyLinear()
))
----------------------------------------------------
('net.layer0', MyLinear())
----------------------------------------------------
('net.layer1', MyLinear())


#### iii. dynamically define submodule
· 用ModuleList或者ModuleDict \
· calls to parameters() and named_parameters() will recursively include child parameters, allowing for convenient optimization of all parameters within the network

In [12]:
class DynamicNet(nn.Module):
    def __init__(self, num_layers):
        super().__init__()
        self.linears = nn.ModuleList(
            [MyLinear(4, 4) for _ in range(num_layers)])
        self.activations = nn.ModuleDict({
            'relu': nn.ReLU(),
            'lrelu': nn.LeakyReLU()
        })
        self.final = MyLinear(4, 1)
        
    def forward(self, x, act):
        for linear in self.linears:
            x = linear(x)
        x = self.activations[act](x)
        # x = self.final(x)
        return x

dynamic_net = DynamicNet(3)
sample_input = torch.randn(4)
output = dynamic_net(sample_input, 'relu')

** · child module由__init__()中排列的module sequence决定，不由forward()实际执行的computation决定**

In [13]:
for module in dynamic_net.named_modules():
    print('-'*52)
    print(module)

----------------------------------------------------
('', DynamicNet(
  (linears): ModuleList(
    (0-2): 3 x MyLinear()
  )
  (activations): ModuleDict(
    (relu): ReLU()
    (lrelu): LeakyReLU(negative_slope=0.01)
  )
  (final): MyLinear()
))
----------------------------------------------------
('linears', ModuleList(
  (0-2): 3 x MyLinear()
))
----------------------------------------------------
('linears.0', MyLinear())
----------------------------------------------------
('linears.1', MyLinear())
----------------------------------------------------
('linears.2', MyLinear())
----------------------------------------------------
('activations', ModuleDict(
  (relu): ReLU()
  (lrelu): LeakyReLU(negative_slope=0.01)
))
----------------------------------------------------
('activations.relu', ReLU())
----------------------------------------------------
('activations.lrelu', LeakyReLU(negative_slope=0.01))
----------------------------------------------------
('final', MyLinear())


In [14]:
for parameter in dynamic_net.named_parameters():
    print('-'*68)
    print(parameter)

--------------------------------------------------------------------
('linears.0.weight', Parameter containing:
tensor([[-1.2449, -1.5810, -0.9684,  0.7997],
        [-0.5384, -1.0668,  1.1007,  1.0570],
        [-1.5931, -0.3755,  1.8301,  0.3028],
        [ 2.0470,  1.0902,  1.1974,  1.2223]], requires_grad=True))
--------------------------------------------------------------------
('linears.0.bias', Parameter containing:
tensor([ 0.9569,  1.0514,  0.3401, -0.3270], requires_grad=True))
--------------------------------------------------------------------
('linears.1.weight', Parameter containing:
tensor([[ 1.3122,  1.1817,  0.0877, -0.2661],
        [ 0.4310,  0.0309, -0.5424, -0.6220],
        [-0.7864, -1.7989, -1.6627, -0.0295],
        [ 0.5377, -2.4508,  1.9957, -1.0140]], requires_grad=True))
--------------------------------------------------------------------
('linears.1.bias', Parameter containing:
tensor([-1.7527, -1.5700, -0.7644,  0.9201], requires_grad=True))
------------

#### vi. 移动参数的设备，改变参数精度，用.to()

In [15]:
# Move all parameters to a CUDA device
dynamic_net.to(device='cuda')

# Change precision of all parameters
dynamic_net.to(dtype=torch.float64)

dynamic_net(torch.randn(4, device='cuda', dtype=torch.float64), 'relu')

tensor([ 0.0000, 19.5688,  0.0000, 13.0718], device='cuda:0',
       dtype=torch.float64, grad_fn=<ReluBackward0>)

#### v. module和submodule可以apply任意函数，包括自定义函数
an arbitrary function can be applied to a module and its submodules recursively by using the apply() function

In [16]:
# Define a function to initialize Linear weights.
# Note that no_grad() is used here to avoid tracking this computation in the autograd graph.
@torch.no_grad()
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight)
        m.bias.fill_(0.0)

# Apply the function recursively on the module and its submodules.
dynamic_net.apply(init_weights)

DynamicNet(
  (linears): ModuleList(
    (0-2): 3 x MyLinear()
  )
  (activations): ModuleDict(
    (relu): ReLU()
    (lrelu): LeakyReLU(negative_slope=0.01)
  )
  (final): MyLinear()
)

## 4. 使用module训练NN
**module有两种mode：trainning mode和evaluation mode**
1. module默认处于training mode。用training()和eval()可以改变module所处mode。
2. 如果module中有submodule在training mode和evaluation mode的时候输出不同，那么就应该在inference的时候将mode改为evaluation mode，比如batchnorm

In [17]:
# 新建network和optimizer
net = Net()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-4, 
                            weight_decay=1e-2, momentum=0.9)

# trainging the netword
for _ in range(10000):
    input = torch.randn(4)
    output = net(input)
    loss = torch.abs(output) # 用abs做loss，会让weights趋于0
    
    net.zero_grad()
    loss.backward()
    optimizer.step()
    
# training完成后，将module转到eval mode
net.eval()

Net(
  (layer0): MyLinear()
  (layer1): MyLinear()
)

In [18]:
print(net.layer1.weight)

Parameter containing:
tensor([[0.0018],
        [0.7507],
        [0.4740]], requires_grad=True)


In [19]:
# 在training和evaluation mode下输出不同的例子
class ModalModule(nn.Module):
  def __init__(self):
    super().__init__()

  def forward(self, x):
    if self.training:
      # Add a constant only in training mode.
      return x + 1.
    else:
      return x

m = ModalModule()
x = torch.randn(4)
print('training mode output: {}'.format(m(x)))

m.eval()
print('evaluation mode output: {}'.format(m(x)))

training mode output: tensor([ 0.8559,  2.5287, -0.5101,  1.3260])
evaluation mode output: tensor([-0.1441,  1.5287, -1.5101,  0.3260])


## 5. module state
1. 如果要保存a trained model，可以存该module的state_dict，state_dict中保存了影响module运算的状态。state包括parameters和buffers。\
(1)**parameters**: learnable aspects of computation,存在state_dict中。\
(2)**buffers**: non-learnable aspects of computation. 有的module会存储参数之外的其他信息到state_dict，和参数不同的是，这些信息不需要learn，他们会被存在buffers中。
2. 有两种buffers：Persistent buffers存在state_dict中，non-Persistent buffers不存在state_dict中。\
(1)Persistent buffers: 比如：serialized when saving and loading
(2)non-Persistent buffers: 比如：left out of serialization
3. Persistent buffers的特点：\
(1)如果state被存为state_dict的一部分，那么loading a serialized form of the module的时候，它就能被restore。 \
(2)这部分变量不会像parameters那样被optimizer处理，因而是non-leanable
4. non-Persistent buffers的特点：\
不存为state_dict的一部分

In [20]:
## save the module
torch.save(net.state_dict(), 'net.pt')

## load the module
#  1. 新建一个结构相同的module
new_net = Net()
#  2. load state
new_net.load_state_dict(torch.load('net.pt'))

<All keys matched successfully>

In [21]:
## 使用buffers：module中要保存running mean
#  将running mean的当前值存到state_dict用register_buffer()

class RunningMean(nn.Module):
    def __init__(self, num_features, momentum=0.9):
        super().__init__()
        self.momentum = momentum
        self.register_buffer('mean', torch.zeros(num_features))
        # 此时，self.mean会被存到state_dict中
    
    def forward(self, x):
        # 每次迭代时更新running mean的值
        # 作为state_dict的一部分，当loading module的时候会被restore
        self.mean = self.momentum * self.mean + (1.0 - self.momentum) * x
        return self.mean

In [22]:
torch.manual_seed(0)
m = RunningMean(4)
for _ in range(10):
    input = torch.randn(4)
    m(input)

print(m.state_dict())

# Serialized form will contain the 'mean' tensor
torch.save(m.state_dict(), 'mean.pt')

m_loaded = RunningMean(4)
m_loaded.load_state_dict(torch.load('mean.pt'))

# 注意这里几种assert和print的差异
assert(torch.all(m.mean == m_loaded.mean))
print(torch.all(m.mean == m_loaded.mean))
print(m.mean == m_loaded.mean)

OrderedDict([('mean', tensor([-0.1494,  0.1179, -0.3679, -0.1974]))])
tensor(True)
tensor([True, True, True, True])


In [23]:
## 将running mean存为non-Persistent buffers
#  还是用register_buffer()，参数Persistent=False

class RunningMean(nn.Module):
    def __init__(self, num_features, momentum=0.9):
        super().__init__()
        self.momentum = momentum
        self.register_buffer('mean', torch.zeros(num_features), persistent=False)
        # 此时，self.mean不会被存到state_dict中
    
    def forward(self, x):
        self.mean = self.momentum * self.mean + (1.0 - self.momentum) * x
        return self.mean

torch.manual_seed(0)
m2 = RunningMean(4)
for _ in range(10):
    input = torch.randn(4)
    m2(input)

print(m2.state_dict()) # 此时输出的state_dict是空的

torch.save(m2.state_dict(), 'mean.pt')
m2_loaded = RunningMean(4)
m2_loaded.load_state_dict(torch.load('mean.pt'))
print(torch.all(m2.mean == m2_loaded.mean)) # 输出False

OrderedDict()
tensor(False)


#### 一个module的buffers可以用buffers()和named_buffers()来迭代

In [24]:
for buffer in m.named_buffers():
    print(buffer)

('mean', tensor([-0.1494,  0.1179, -0.3679, -0.1974]))


In [25]:
for buffer in m2.named_buffers():
    print(buffer)

('mean', tensor([-0.1494,  0.1179, -0.3679, -0.1974]))


#### 两种buffers都受model-wide device/type changes所使用的.to() method影响


In [26]:
m.to(device='cuda', dtype=torch.float64 )

RunningMean()

In [27]:
## 一个综合例子
class StatefulModule(nn.Module):
    def __init__(self):
        super().__init__()
        # 用nn.Parameter实例化的参数会自动将tensor register为module parameter
        self.param1 = nn.Parameter(torch.randn(2))

        # 另一种将tensor register为module parameter的方式：用register_parameter() method
        self.register_parameter('param2', nn.Parameter(torch.randn(3)))

        # 将attribute： "param3" 定义为一个parameter，但不做初始化。
        # 它的值'None'不会出现在state_dict中    
        self.register_parameter('param3', None)

        # Registers a list of parameters：没有name
        self.param_list = nn.ParameterList([nn.Parameter(torch.randn(2)) for i in range(3)])

        # Registers a dictionary of parameters：有name
        self.param_dict = nn.ParameterDict({
            'foo': nn.Parameter(torch.randn(3)),
            'bar': nn.Parameter(torch.randn(4))
        })

        # Registers a persistent buffer
        self.register_buffer('buffer1', torch.randn(4), persistent=True)

        # Registers a non-persistent buffer
        self.register_buffer('buffer2', torch.randn(5), persistent=False)

        # 将attribute："buffer3" 定义为一个buffer，但不做初始化
        # 它的值'None'也不会出现在state_dict中    
        self.register_buffer('buffer3', None)

        # 添加一个submodule就会将其parameters自动register为module的parameters
        self.linear = nn.Linear(2, 3)

m = StatefulModule()

# Save and load state_dict.
torch.save(m.state_dict(), 'state.pt')
m_loaded = StatefulModule()
m_loaded.load_state_dict(torch.load('state.pt'))

# state_dict中没有non-persistent buffer和reserved attributes "param3"与"buffer3"
print(m_loaded.state_dict())

OrderedDict([('param1', tensor([-0.0404,  0.2881])), ('param2', tensor([-0.0075, -0.9145, -1.0886])), ('buffer1', tensor([ 1.3232,  0.0371, -0.2849, -0.1334])), ('param_list.0', tensor([-0.2666,  0.1894])), ('param_list.1', tensor([-0.2190,  2.0576])), ('param_list.2', tensor([-0.0354,  0.0627])), ('param_dict.bar', tensor([ 0.1753, -0.9315, -1.5055, -0.6610])), ('param_dict.foo', tensor([-0.7663,  1.0993,  2.7565])), ('linear.weight', tensor([[ 0.0197, -0.0610],
        [ 0.1431,  0.4496],
        [ 0.6698,  0.4491]])), ('linear.bias', tensor([ 0.6713, -0.0511, -0.6352]))])


## 6. module初始化
1. 默认情况下，torch.nn提供的module中的parameter和浮点数buffer会在module实例化的时候初始化为存在CPU上的32位浮点数值。
2. 如果要改变默认的初始化设置，可以在module实例化的时候设置对应的arguments或者直接用skip_init()method，之后自定义初始化方式

In [28]:
# 将module直接初始化到GPU上，参数类型为16位浮点数
m = nn.Linear(5, 3, device='cuda', dtype=torch.half)

In [29]:
# 除参数外，上述初始化方式也适用于floating-point buffers registered for the module
m = nn.BatchNorm2d(3, dtype=torch.half)
print(m.running_mean)

tensor([0., 0., 0.], dtype=torch.float16)


In [30]:
# 例：自定义参数初始化为正交矩阵
m = torch.nn.utils.skip_init(nn.Linear, 5, 3)
nn.init.orthogonal_(m.weight)

Parameter containing:
tensor([[-0.3930, -0.1617,  0.4533, -0.0858,  0.7788],
        [ 0.0826,  0.2477,  0.6222, -0.6544, -0.3411],
        [ 0.1306,  0.3920,  0.5422,  0.7263, -0.0882]], requires_grad=True)

#### 自定义module的时候，建议按照torch.nn所遵守的规则那样：
1. 提供一个device constructor kwarg，可以应用在任意的parameter和buffers registered by the module上
2. 提供一个dtype constructor kwarg，可以应用在任意的parameter和floating-point buffers registered by the module上
3. 只用初始化函数（比如：torch.nn.init package提供的函数）来初始化module constructor中的parameters和buffers。注意，此时要使用skip_init()。

## 7. torch自带module中的典型layers

### nn.Flatten
1. 参数：torch.nn.Flatten(start_dim=1, end_dim=-1)
2. 压缩[start_dim, end_dim]范围的dims
2. 默认将输入的data压成2维数据，保留原第一维，压缩剩下的维度，比如输出(N, D)

In [31]:
input_image = torch.rand(3,28,28)
print(input_image.size())

flatten = nn.Flatten()
flat_image = flatten(input_image)
print(flat_image.size())

flatten2 = nn.Flatten(0, 1)  # 压缩[0, 1]范围的dims
flat_image2 = flatten2(input_image)
print(flat_image2.size())

torch.Size([3, 28, 28])
torch.Size([3, 784])
torch.Size([84, 28])


### nn.Linear
1. affine layer
2. 参数：torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)
   · in_features (int) – size of each input sample
   · out_features (int) – size of each output sample
   · bias (bool)取False时, 就不会learn bias. Default: True

In [32]:
layer1 = nn.Linear(in_features=28*28, out_features=6)
hidden1 = layer1(flat_image)
print(hidden1.size())

torch.Size([3, 6])


### nn.ReLU

In [33]:
print(f"Before ReLU:\n {hidden1}\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU:\n {hidden1}")

Before ReLU:
 tensor([[-0.1059, -0.3650, -0.2354,  0.1382,  0.0520,  0.0531],
        [-0.2357, -0.4797, -0.2849, -0.0148,  0.2071,  0.2770],
        [-0.0447, -0.2520, -0.0628,  0.0251,  0.4561,  0.1185]],
       grad_fn=<AddmmBackward0>)

After ReLU:
 tensor([[0.0000, 0.0000, 0.0000, 0.1382, 0.0520, 0.0531],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.2071, 0.2770],
        [0.0000, 0.0000, 0.0000, 0.0251, 0.4561, 0.1185]],
       grad_fn=<ReluBackward0>)


### nn.Sequential
1. an ordered container of modules.
2. 数据会按照Sequential中定义的layer顺序做处理

In [34]:
seq_modules = nn.Sequential(
    flatten,
    layer1,
    nn.ReLU(),
    nn.Linear(6, 10)
)
input_image = torch.rand(3,28,28)
scores = seq_modules(input_image)

softmax = nn.Softmax(dim=1)
pred_probab = softmax(scores)