# File I/O

So far we have discussed how to process data and how
to build, train, and test deep learning models.
However, at some point we will hopefully be happy enough
with the learned models that we will want
to save the results for later use in various contexts
(perhaps even to make predictions in deployment).
Additionally, when running a long training process,
the best practice is to periodically save intermediate results (checkpointing)
to ensure that we do not lose several days' worth of computation
if we trip over the power cord of our server.
Thus it is time to learn how to load and store
both individual weight vectors and entire models.
This section addresses both issues.


In [1]:
import torch
from torch import nn
from torch.nn import functional as F

## (**Loading and Saving Tensors**)

For individual tensors, we can directly
invoke the `load` and `save` functions
to read and write them respectively.
Both functions require that we supply a name,
and `save` requires as input the variable to be saved.


In [2]:
x = torch.arange(4)
torch.save(x, 'x-file')

We can now read the data from the stored file back into memory.


In [3]:
x2 = torch.load('x-file')
x2

tensor([0, 1, 2, 3])

We can [**store a list of tensors and read them back into memory.**]


In [4]:
y = torch.zeros(4)
torch.save([x, y],'x-files')
x2, y2 = torch.load('x-files')
(x2, y2)

(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))

We can even [**write and read a dictionary that maps
from strings to tensors.**]
This is convenient when we want
to read or write all the weights in a model.


In [5]:
mydict = {'x': x, 'y': y}
torch.save(mydict, 'mydict')
mydict2 = torch.load('mydict')
mydict2

{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}

## [**Loading and Saving Model Parameters**]

Saving individual weight vectors (or other tensors) is useful,
but it gets very tedious if we want to save
(and later load) an entire model.
After all, we might have hundreds of
parameter groups sprinkled throughout.
For this reason the deep learning framework provides built-in functionalities
to load and save entire networks.
An important detail to note is that this
saves model *parameters* and not the entire model.
For example, if we have a 3-layer MLP,
we need to specify the architecture separately.
The reason for this is that the models themselves can contain arbitrary code,
hence they cannot be serialized as naturally.
Thus, in order to reinstate a model, we need
to generate the architecture in code
and then load the parameters from disk.
(**Let's start with our familiar MLP.**)


In [14]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.LazyLinear(256)
        self.layer2 = nn.LazyLinear(100)
        self.output = nn.LazyLinear(10)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.output(x)

net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

Next, we [**store the parameters of the model as a file**] with the name "mlp.params".


In [15]:
torch.save(net.state_dict(), 'mlp.params')

To recover the model, we instantiate a clone
of the original MLP model.
Instead of randomly initializing the model parameters,
we [**read the parameters stored in the file directly**].


In [8]:
clone = MLP()
clone.load_state_dict(torch.load('mlp.params'))
clone.eval()

MLP(
  (hidden): LazyLinear(in_features=0, out_features=256, bias=True)
  (output): LazyLinear(in_features=0, out_features=10, bias=True)
)

Since both instances have the same model parameters,
the computational result of the same input `X` should be the same.
Let's verify this.


In [9]:
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

## Summary

The `save` and `load` functions can be used to perform file I/O for tensor objects.
We can save and load the entire sets of parameters for a network via a parameter dictionary.
Saving the architecture has to be done in code rather than in parameters.

## Exercises

1. Even if there is no need to deploy trained models to a different device, what are the practical benefits of storing model parameters?
1. Assume that we want to reuse only parts of a network to be incorporated into a network having a different architecture. How would you go about using, say the first two layers from a previous network in a new network?
1. How would you go about saving the network architecture and parameters? What restrictions would you impose on the architecture?


[Discussions](https://discuss.d2l.ai/t/61)


1/
- In the definitions of checkpoints and fault tolerance, training will consume hours and days, if the program crashs, electricity is off or CPU is consumed all, we can resume training once these are resolved from the checkpoint.
- We also save the model parameters at the best epoch of training stage and continue from it when it is appropriate. (Early stopping and Model selection)
- Reproducibility: saving the model parameters can help us visualize the result, debug, compare results.
- It also helps us save many seeds of different hyperparameters that we can evaluate offline.
- The data (model parameters) will be persistent cause on RAM/CPU these are temporary and we can free up CPU usage for other purposes.


In [24]:
# 2.We can save the learned parameters of these two first layers using state_dict, then load those parametes into corresponding layers for new network.
class NewMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.LazyLinear(256)
        self.layer2 = nn.LazyLinear(100)
        self.output = nn.LazyLinear(5) # different architecture

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.output(x)

net.load_state_dict(torch.load('mlp.params'))
new_net = NewMLP()

# Materialize lazy layers
dummy = torch.rand(1, 20)
_ = net(dummy)
_ = new_net(dummy)

new_net.layer1.load_state_dict(net.layer1.state_dict())
new_net.layer2.load_state_dict(net.layer2.state_dict())

# Freeze the copied layers
for param in new_net.layer1.parameters():
    param.requires_grad = False
for param in new_net.layer2.parameters():
    param.requires_grad = False

X = torch.randn(size=(3, 20))
Y = new_net(X)


OrderedDict([('hidden.weight',
              tensor([[ 0.1514,  0.0836, -0.0181,  ..., -0.1981, -0.1966,  0.2334],
                      [ 0.1261, -0.0033,  0.0632,  ..., -0.0187,  0.0513,  0.1465],
                      [-0.1790,  0.0706,  0.1552,  ..., -0.1704,  0.0469, -0.1167],
                      ...,
                      [ 0.2271, -0.1124, -0.0528,  ..., -0.0657, -0.1134, -0.2157],
                      [-0.0656,  0.2199, -0.1157,  ...,  0.2156,  0.1562,  0.0876],
                      [ 0.1245,  0.0159, -0.1745,  ..., -0.0892, -0.0395,  0.2128]])),
             ('hidden.bias',
              tensor([-0.1660,  0.0425, -0.2298,  0.2185, -0.0333,  0.1166,  0.1079, -0.0093,
                      -0.0383,  0.2181, -0.1474, -0.1324, -0.1836, -0.0031,  0.2067, -0.1114,
                      -0.0347, -0.1826, -0.1885,  0.0362,  0.2185,  0.0708,  0.2284,  0.1400,
                      -0.0915, -0.2036,  0.1653, -0.0253, -0.1238, -0.1081, -0.1122,  0.0230,
                      -0.0150,

3/ If we save all the model, like: torch.save(model, "model.pkl") then it will depend very closely on code, Pytorch version, hard to maintain and to be reusable, bugs when refactor. Occasionally, we often save model parameters or configurations of models (in JSON or YAML format) and load it more efficiently.