Once we have chosen an architecture and set our hyperparameters, we proceed to the training loop, where our goal is to find parameter values that minimize our loss function. After training, we will need these parameters in order to make future predictions. Additionally, we will sometimes wish to extract the parameters perhaps to reuse them in some other context, to save our model to disk so that it may be executed in other software, or for examination in the hope of gaining scientific understanding.

when we move away from stacked architectures with standard layers, we will sometimes need to get into the weeds of declaring and manipulating parameters. In this section, we cover the following:

1. Accessing parameters for debugging, diagnostics, and visualizations.

2. Sharing parameters across different model components.

In [1]:
import torch
from torch import nn

In [2]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape



torch.Size([2, 1])

In [7]:
# We can see that this fully connected layer contains two parameters, corresponding to that layer’s weights and biases, respectively.

net[2].state_dict()

OrderedDict([('weight',
              tensor([[-0.0537, -0.1868,  0.2111, -0.2906,  0.1202, -0.0968, -0.0236,  0.1317]])),
             ('bias', tensor([-0.3029]))])

In [5]:
# The following code extracts the bias from the second neural network layer, which returns a parameter class instance, and further accesses that parameter’s value.
type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([-0.3029]))

In [6]:
# In addition to the value, each parameter also allows us to access the gradient. Because we have not invoked backpropagation for this network yet, it is in its initial state.
net[2].weight.grad == None

True

In [8]:
# . Below we demonstrate accessing the parameters of all layers.
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])

# You might wonder, when parameters are tied what happens to the gradients? Since the model parameters contain gradients, 
# the gradients of the second hidden layer and the third hidden layer are added together during backpropagation.

1. Accessing the parameters of various layers in NestMLP:

Assuming that the NestMLP model is defined as follows (as in Section 6.1):

```python
import torch
import torch.nn as nn

class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU())
        self.linear = nn.Linear(32, 16)

    def forward(self, x):
        return self.linear(self.net(x))
```

To access the parameters of various layers, you can use the `named_parameters()` method:

```python
nest_mlp = NestMLP()
for name, param in nest_mlp.named_parameters():
    print(name, param.size())
```

2. Constructing an MLP containing a shared parameter layer and observing the model parameters and gradients:

```python
class SharedMLP(nn.Module):
    def __init__(self, shared_layer):
        super().__init__()
        self.shared_layer = shared_layer
        self.linear = nn.Linear(32, 16)

    def forward(self, x):
        return self.linear(self.shared_layer(x))

shared_layer = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU())
mlp1 = SharedMLP(shared_layer)
mlp2 = SharedMLP(shared_layer)

# Train the MLPs and observe the parameters and gradients
# ...
```

During the training process, you can observe the model parameters and gradients of each layer using the following code snippet:

```python
for model in [mlp1, mlp2]:
    print(f"Model: {model}")
    for name, param in model.named_parameters():
        print(f"Parameter: {name}")
        print("Value:", param.data)
        print("Gradient:", param.grad)
    print()
```

3. Why is sharing parameters a good idea?

Sharing parameters is a good idea in certain scenarios, such as when you have multiple parts of your model that perform the same operation and learn the same features. By sharing parameters, you can:

- Reduce the total number of parameters in your model, which can help prevent overfitting.
- Save memory and computation during training, as the shared layers only need to be updated once.
- Enable transfer learning, where a pre-trained model can be fine-tuned for a new task with a smaller dataset.

However, it's important to note that sharing parameters is not always the best approach. It can limit the model's capacity to learn distinct features in different parts of the model. You should carefully consider the problem and the architecture of your model before deciding to share parameters.