### Persist and load PyTorch models

* persist and load trained models by torch.save and torch.load
  + Pros
    + simply and intuitive syntax
    + saves entire module using pickle
  + Cons
    + serialized data bound to specific classes you use in your code
    + exact structure of the directory where model is created is saved
    + model class not saved in isolation
    + serialized state is tied to current state and path of your file
    + introduces dependencies and fragility
    + pickle saves a path to the file containing the class, which is used during loading time
      + if you refactor your code or move your code around after the serialization, your code can      break
    + The class name of the model must be available from the new file when you load the model. If you save the model definition in another file and save the mode from that file, then the class definition will not be available to a new file, and you will get errors when loading the model from the new file  
* recommend to use a persisted `state_dict` to save learnable parameters for maximum flexibility during restoration
  + save and load models and use checkpoints
  + use ONNX (Open Neural Network Exchange) for model portability
  + every object in PyTorch that learns from the data that you feed in during training has a state dictionary
    + this includes not only models, but optimizers
    + `state_dict` is a python dictionary that maps each layer to a corresponding tensor of learnable parameters (weights and bias)
    + also contains hyperparameter information for optimizers
    + use `torch.nn.Module.load_state_dict` to load a model's paramter dictionary
    + this uses deserialized `state_dict` to deserialize model with pre-existing weights
  + makes state of models and optimizers very modular
  + allows you to resume a training from a checkpoint
* checkpoints
  + checkpoints can be used to resume training for a model
  + During checkpointing, it is important to save `state_dict` for both the model as well as the optimizer objects
* when you load a mode and want to use the model to predict data, you need to call model.eval()
  + this will reset dropout and batch normalization which work differently in training and prediction
  + this will result in inconsistent inference results
* to resume training, must call model.train(), which is default when loading a model     
    
* production ML Workflow
![image.png](attachment:image.png)
* torch.save()
  + works with any kind of torch object such as models, tensors, dictionaries etc.
  + uses Python's pickle utility to serialize an object to disk
* torch.load()  
  + use torch.load() to reload the objects saved by torch.save()
  + use Python pickle utility to deserialize objects
  + can specify device to load into such as a GPU or cuda device
* `torch.nn.Module.load_state_dict` load model parameters to the initialized model 

### Save checkpoints to resume training
* you save the `state_dict` for model and optimizer, and the epoch number using torch.save and save them in a dictionary called checkpoint
* load checkpoint by torch.load, and retrieve and load `state_dict` for a new model and optimizer to continue the training starting from the epoch number

### ONNX
* ONNX is an open format to represent deep learning models that allows models to be re-used across frameworks
* ONNX models supported in
  + Caffe2
  + Microsoft Cognitive Toolkit (CNTK)
  + Apache MXNet
  + PyTorch

In [1]:
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
print(pd.__version__)

1.4.2


In [3]:
print(np.__version__)

1.24.1


In [4]:
print(torch.__version__)

2.0.1+cpu


In [5]:
!pip install protobuf onnx

Defaulting to user installation because normal site-packages is not writeable

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mysql-connector-python 8.0.31 requires protobuf<=3.20.1,>=3.11.0, but you have protobuf 4.23.4 which is incompatible.



Collecting onnx
  Downloading onnx-1.14.0-cp310-cp310-win_amd64.whl (13.3 MB)
     ---------------------------------------- 13.3/13.3 MB 4.6 MB/s eta 0:00:00
Collecting protobuf
  Downloading protobuf-4.23.4-cp310-abi3-win_amd64.whl (422 kB)
     -------------------------------------- 422.5/422.5 kB 2.6 MB/s eta 0:00:00
Installing collected packages: protobuf, onnx
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.1
    Uninstalling protobuf-3.20.1:
      Successfully uninstalled protobuf-3.20.1
Successfully installed onnx-1.14.0 protobuf-4.23.4


### Run training using multiple processes, devices and machines
* torch.multiprocessing package, which is a wrapper of python multiprocessing module
* better option is to run mutiple processes on the same CPU using data parallel
* Parallelize training across multiple GPUs or CPUs are possible
  + parallelize training across multiple cuda devices

#### use multiprocessing to train models by multiple python processes
* overview
  + torch.multiprocessing
  + wrapper arournd native python multiprocessing module
  + best for prototyping your model because all of the data are handled by user
  + need to make sure there is no memory leaks and data is shared across processes correctly
  + not recommended for production sine it is prone to memory leaks, only for prototyping purpose
  + tensors of all your training data and model parameters are moved to shared memory accessible to all processes running training on your model
  + you only share tensors that are stored on the CPU in this manner
    + `file_descriptor`
      + shared memory handler to share data across processes
    + `file_system`
      + use file names to identify shared memory regions
  + if you work with CUDA tensors, you need to use torch.cuda API
* code for multiprocess training
  + define a train(model, Xtrain, Ytrain, optimizer) function to run training on differnt processes
    + this function use the Pytorch routine to train model
    ```python
      import torch.multiprocessing as mp

      # tells you sharing strategies for CPU tensors spupported 
      #  on this machine, GPU tensors use CUDA API
      mp.get_all_sharing_strategies()

      # forks the python interpreter, child process created is
      # identical to the parent, share resources of the parent 
      # on windows, us 'spawn' which starts a fresh python interpretor, child
      # process inherits only some resoruces  
      if __name__ == '__main__':
        mp.set_start_method('fork')

        # move training data to the shared memory
        Xtrain.share_memory_()
        Ytrain.share_memory_()
        Xtest.share_memory_()
        Ytest.share_memory_()

        model = Net()
        model.share_memory()

        optimizer = optim.Adam(model.parameters(), lr = 0.001)
        preocesses = []

        for rank in range(4):
            p = mp.Process(target=train, args=(model, Xtrain, Ytrain, optimizer))
            p.start()
            processes.append(p)

        for p in processes:
            p.join()

        # check if all processes are completed
        if (processes[0].is_alive() == False):
            model.eval()
            test_loss = 0
            correct = 0

            predict_out = model(Xtest)
            _, predict_y = torch.max(predct_out, 1)

            print("\n")
            print('predict accuracy', accuracy_score(testY.data, predict_y.data)
  ```    
      
#### train model using different proecesses on different processors or devices using data parallel
* overview
  + very easy to place model on GPU and train your model on different devices including both CPU and GPU
  + for sequential training using just a single process, pytorch makes it very easy to place our model on GPU and only use a single GPU by default
  + if you want to use multiple GPUs, you need to use nn.DataParallel
    + this allows you to place the model that you want to train on different devices, whether CPUs or GPUs
    + you simply pass your model object to data parallel, which acts as a wrapper. Data parallel will then replicate the same model to all GPUs
    + the entire model is copied over, and each version of the model is trained in parallel, which can significantly accelerate taining
    + for each replica of the model that you train, data parallel chunks the input data along the batch dimension, and each replica of the model handles a subset of data
    + the creation of subsets from the training data chunking along the dimension is entirely transparent to you    
      + mitigates the data handling issues encounter with torch.multiprocessing namespace
  + the sub-datasets are distributed on devices for model prediction, and gradient calculation, and then the summed grad is sent to model on default device    
* code for data parallel
``` Python

    model = Net()

    # replicates model on each device to work on a subset of input
    # chunks input into batches. The devices are on the same VM
    model = nn.DataParallel(model)

    # move the model to the default cuda device, cuda0
    # the model will be distrbuted to the 2 GPUs
    model.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=le-3)
    loss_fn = nn.NLLLoss()

    # train model as usual
    epochs = 100

    for epoch in range(epochs):
        optimizer.zero_grad()

        Ypred = model(X_train)
        loss = loss_fn(Ypred, Ytrain)


        # in the backwards pass, gradients from each replica 
        # are summed into the original model on default deive (cuda:0). 
        # Data parallel package takes care of this transparent to users
        loss.backward()
        optimizer.step()
        
        # the models are distributed on different devices
        # with subsets of chunked data inside model (forward() in model)
        # and backward before summing up gradients
        print(f"outside model, device_id: {torch.cuda.current_device()}")

  ```
      
      
#### Use model parallel option to distribute model across mutliple devices
  + data parallel will not work when model parameters are too large to fit into a single GPU, since data parallel replicates entire model to each GUP
  + model parallel chunks the model parameters and places a different subset of model parameters on each device
    + since the model is a neural network, you can image different sub-networks of your model are placed on different devices 
  + only a subset of model operates on an individual device to train on data, and many devices collectively are used to train a single model  

#### Use distributed data parallel with distributed computing resources 
  + work with multiple machines in a cluster
  + synchronous distributed training wrapper around PyTorch model
  + you instantiate this object and pass your model in. This will allow your model to be trained on multiple network-connected machines, a cluster of machines
  + user must explicitly launch separate copies of training scripts
  + this is preferable even for singl-machine over data parallel because
    + each process has own optimizer and therefore, no parameter broadcast needed, which greatly reduces the network bandwidth and latency
    + you save on time required to transfer tensors between nodes
    + each process has own python interpreter
    + all these make training very efficient

### Implementing Distributed training on multiple machines
* demo to use distributed training with AWS Sagemaker API with Pytorch estimators to paralize training across machines
  + An estimator object is a high-level API that abstracts away the whole distributed training process from you
  + specific to a cloud platform that helps build, train and deploy PyTorch models
* on AWS, you have the options to run distributed training by
  + sagemaker notebook for prototyping
  + deep learning VM (Amazon Machine Image, AMI)
  + Pytorch estimator (sagemaker PyTorch Estimator)
    + SageMaker Python SDK with its own PyTorch estimators and models API
    + estimator API uses PyTorch open-source container to run the training
    + choose the backends as communication libraries
      + use Gloo for distributed CPU training
      + use NCCL for distributed GPU training
      + need to read more fine prints in the docs to go beyond the plain vanilla
    + use torch.nn.parallel.DistributedDataParallel  
* On Azure, you can use
  + Azure notebook for prototyping
  + deep learning VM
  + estimator
* on GCP
  + notebook for prototyping (datalab)
  + deep learning VM
  + no estimator for support


#### Use Sagemaker distributed framwork
* go to Amazon SageMaker service page and crate Notebook instances under Notebook item
* [Sagemaker tutorial on PyTorch MNIST data](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/pytorch_mnist.ipynb)
* Sagemaker notebooks of [tutorials on github](https://github.com/awslabs/amazon-sagemaker-examples)

##### Create mnist.py file
* to run training on distributed cluster, create a python file, mnist.py
  + the imported torch.distributed module provides communication primitives for multiprocess parallelism across nodes
  + set the logger at debug level
  + create a NN model as a CNN model that operates on MNIST handwritten digit images
  + use `torch.utils.data.distributed.DistributedSampler(dataset) if is_distributed else None` to load a subset of the dataset for training in a distributed manner using torch.nn.parallel.DistributedDataParallel
  + `torch.utils.data.DataLoader` object combines a dataset and a sampler to provide single-process or multi-process iterators over the data so that you can load the data in batches for training and testing
    + note that we only set the distributed sampler for training dataset loader, not for test dataset loader, since we don't need to use cluaster for prediction
  + `_average_gradients(model)` averages out the gradients from models running on different machines. This is only used when running training on mutliple machines that only have CPUs without GPUs. The averaged gradients are used to update the model parameters. This is not needed for multi-machine GPU case
  + in `train(args)` function
    + if we use GPU with only one worker, we set               
    `kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}`
      + `pin_memory` is an optimization allowing us to train our models faster
    + `device = torch.device("cuda" if use_cuda else "cpu")
      + device variable will hold the default CUDA device if `use_cuda` is True
      + `use_cuda` is defined by args.num_gpus > 0
    + set the `world_size` as the number of hosts from args.hosts, and set `os.environ['WORLD_SIZE'] = str(world_size)` so that all processes from the manchine can access this variable
    + get the `host_rank` from `args.hosts.index(args.current_host)`, and store the rank to environment variable
    + initialize the distributed environment on each of the machines by
    `dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)`
    + define the `torch.manual_seed(args.seed)` in the initialization
    + define seed for cuda if `use_cuda` is True by `torch.cuda.manual_seed(args.seed)`
    + set up loaders for training and testing. The batch size and directory of data are set by CLI
    + set up logger.debug message for data loader
    + load model to device by `model = Net().to(device)`
    + if a distributed cluster is used, we use `torch.nn.parallel.DistributedDataParallel(model)`, which synchronous distributed training across mutliple machines as a wrapper around any PyTorch model
      + training data will be splitted on batch dimension, and a replica of the model will be placed on each machine and each device
      + otherwise, we just use torch.nn.DataParallel(model) to split input processing across multiple GPUs on a single machine by `model = torch.nn.DataParallel(model)`
    + define optimizer, and traverse the epochs  
  
  + in `test(model, test_loader, device)` function
    + set mode.eval()
    + initialize `test_loss = 0` and `correct = 0`
    + use `with torch.inference_mode` context manager and run predictions and collect performance metrics during predictions
    
  + define `def model_fn(model_dir)`, which SageMaker PyTorch model server expects. Sagemaker will use this function to load the PyTorch model to serve for predictions of test data
    + find the device and model 
    + load model parameters using `model.load_state_dict(torch.load(f))` where f is the location of model.pth
    + return model.to(device)
  + define `save_model(model, model_dir)` to save model state to model path
    + remember to move model to cpu when storing the `state_dict` by 
    + `torch.save(model.cpu().state_dict(), path)`
    
  + in main (`if __name__ == '__main__'`), define the command line arguments by argparse.ArgumentParser() to pass the command line arguments in 
  + SageMaker makes a number of environment variables available to PyTorch, such as the number of hosts on which we are running training, the current host, the model directory where we want to store out our model, the number of GPUs on each machine, and so on.
  + Sagemaker gives these information to PyTorch
  + call `train(parser.parse_args())` to run the training and test
  + remember to save mnist.py file to run training and test on SageMaker

##### run SageMaker notebook
* create a new SageMaker notebook using python conda 3.6 env from the dropdown list
* in notebook, import sagemaker and initialize a sagemaker session and a default s3 bucket
  ```python
    sagemaker_session = sagemaker.Session()
    
    # if you use another bucket, make sure it is 
    # in the same region as the sagemaker notebook
    bucket = sagemaker_session.default_bucket()
    
    # get the role to access the bucket
    role = sagemaker.get_execution_role()
    
    # download data and upload to s3 bucket
    inputs = sagemaker_session.upload_data(path="data", bucket=bucket, key_prefix=prefix)
    
    ```
* create an estimator as a sagemaker.pytorch.PyTorch object
  + PyTorch estimator points to the training script and defines how it should be executed
  
``` python
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="mnist.py",
    role=role,
    py_version="py38",
    framework_version="1.11.0",
    instance_count=2, # use two instances of c5.2xlarge
    instance_type="ml.c5.2xlarge",
    hyperparameters={"epochs": 1, "backend": "gloo"},
)
```

* train model for only 1 epoch
  + fully trained model will be saved in the folder created by sagemaker in s3 bucket, in sub-folder of output
  + the file model.tar.gz contains the model parameters
  + in sagemaker dashboard, click on "Training jobs", you will see all the training jobs performed
``` python
estimator.fit({"training": inputs})
```
  
* for prediction
  + deploy the model for prediction
```python
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
```
  + click "Endpoint configurations" link in sagemaker dashboard, you will see the endpoint of your deployment.
    + click the corresponding endpoint link to see the endpoint config details
    
  + in sagemaker, extract one observation from data and use the predictor
  ```python
test_features = next(iter(data))[0]
test_target = next(iter(data))[1]
response = predictor.predict(test_feature)
response.argmax(axis=1)[0]
```
  

### Deploy PyTorch models to production
* deploy models for prediction using a flask web application
* make models available using a Clipper cluster
  + low latency serving framework which runs Docker containers to host models and a serving front-end
* options to deploy models via http endpoint
  + flask
  + clipper
  + serverless compute
    + AWS lambda
    + google cloud funtions
    + Azure functions
* for flask hosting, HTTP request is sent to nginx, which runs on gunicorn. Flask runs on Gunicorn
* deploy model to flask
  + upload the `state_dict` dictionary to a GCP bucket and make the bucket public, and copy the url of the bucket file to use in flask app
  + install and import pytorch modules such as nn and torch.nn.fulnctional
  + get parameters from bucket
  + define the model class and instantiate the model
  + load parameters by `model.load_state_dict`
  + define the predict function to call model's prediction and return the results
    + the data for prediction is obtained by `data = request.get_json()`
    
* deploy to clipper cluster
  + install clipper_admin package
  + import clipper modules
  ```python
from clipper_admin import ClipperConnection, DockerContainerManager
clipper_conn = ClipplerConnection(DockerContainerManager())
clipper_conn.start_clipper()
clipper_addr = clipper_conn.get_query_addr()
```

  + a clipper cluster has 3 docker containers running
    + a query front-end
      + listen for incoming requests and routes them to models deployed on clipper
    + the management-frontend
      + manage clipper's config state, track deployed models and registered endpoints
    + a redis instance
      + persistence storage for clipper's internal state
  + define a pytorch model
  + register the app with app name and default output    
  + define function `predict_torch_model(model, data)` to return prediction
  + deploy the model by `clipper_admin.deployers.pytorch` and specify clipper conn, name, input type, func as the function defined previously, pytorch model as model and batch size as 1
  + link model to app to link the deployed model to registered app
  `clipper_conn.link_model_to_app(app_name, model_name)`
  
  + send request using clipper address
  + stop cluster by `clipper_conn.stop_all()` 