# LEARN Workshop - session 2
_24 March 2023_

___

## Overview of last session

- Overview of DGX hardware.
- Overview of features.
- Justify usage of PyTorch Lightning
- Docker: how to build an image.

___

## Objectives of today's session
- Explore MLflow and PyTorch Lightning.
- Start implementing individual use cases

## References
- [PyTorch Lightning 1.9 docs](https://lightning.ai/docs/pytorch/1.9.3/)
- [MLflow docs](https://mlflow.org/docs/latest/index.html)
- [Ray Tune docs](https://docs.ray.io/en/latest/tune/index.html), if you want to perform hyperparameter tuning.

___

- For this session, you need to `docker pull` the image `rbonazzola/learn_workshop:session_2`. 
- Also, run `git pull` from within the LEARN repo's folder to update the repository with today's contents. If you made changes to the notebook (which you want to preserve), save a copy of it ("save as...". Then, run `git reset --hard` to revert the changes on the original file. Finally, run `git pull`.

___

**Note on version compatibility.**

From the MLflow docs (2023/03/23) we have:

![](figures/mlflow_ptl_compatibility.png)

On the other hand, the following chart gives the range of PyTorch versions that officially work with specific Pytorch Lightning versions.

![](figures/ptl_compatibility_chart.png)

For this session, we will use the last version of PTL that is officially supported by MLflow (1.9.3), and the last PyTorch version that is officially supported by the latter (1.13).

In [None]:
import torch
import pytorch_lightning as ptl
import mlflow

print(f"Torch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print(f"PyTorch Lightning version: {ptl.__version__}")
print(f"MLflow version: {mlflow.__version__}")


### PyTorch Lightning (PTL)

- A lightweight wrapper for PyTorch code.
- Requires a precise organisation of the code.
- Gets rid of boilerplate code.
- **Allows to access hardware capabilities more easily.**

It's built around three key abstractions:
- `ptl.LightningModule`: DL model itself plus specifications on what to do at each stage (training/validation/testing/inference). The optimizer configuration (Adam, SGD, etc.) must be supplied here as well.
- `ptl.LightninDataModule`: data + how to partition the data for each stage.
- `ptl.Trainer`: object that is fed with the two previous and performs the training. Hardware details must be specified through this object.

The `ptl.LightningModule`:
    
```
class MyPTLModule(ptl.LightningModule):

  def __init__(self, ...):
      
      super().__init__()
      ...
      
      self.training_step_outputs = []
      self.validation_step_outputs = []
      self.test_step_outputs = []


  def forward(self, x):
      y = ...
      return y

  def training_step(self, batch, batch_idx):
      
      x, y = batch
      y_pred = self(x)
      loss = ...
      loss_dict = { "loss": loss }

      self.training_step_outputs.append(loss_dict)
        
      self.log_dict(loss_dict)      
      return loss_dict
      
  
  def on_train_epoch_end(self):
        
      outputs = self.training_step_outputs

      loss = torch.stack([x["loss"] for x in outputs])
      
      # Do something, e.g.
      avg_loss = loss.mean() 
      
      self.log_dict({
          "training_loss": avg_loss
        },
        on_epoch=True,
        prog_bar=True,
        logger=True,
      )
      self.training_step_outputs.clear()    
  

  # Same for validation and test
  #
  # def {validation|test}_step(self, batch, batch_idx):
  #     ...
  #     ...
  #
  # def on_{validation|test}_epoch_end(self)
  #
  # 


  def configure_optimizers(self):
      optimizer = ...
      return optimizer            
    
```

Let's import the ptl.Module and ptl.DataModule from the file `MNIST_lightning.py`:

In [None]:
from MNIST_example.MNIST import MNISTClassifier
from MNIST_lightning import CNN_Module, MNIST_DataModule 
from my_ptl_callbacks import early_stopping, model_checkpoint, progress_bar, rich_model_summary

In [None]:
torch.set_float32_matmul_precision('medium') # To enable optimal use of the Tensor Cores of the A100 GPU.

In [None]:
BATCH_SIZE = 512
PRECISION = "16" # 64 # try 32, 64, "bf16"

datamodule = MNIST_DataModule(batch_size=BATCH_SIZE, split_lengths=[48000, 12000])
ptl_module = CNN_Module(model=MNISTClassifier())

callbacks = [ early_stopping, model_checkpoint, progress_bar, rich_model_summary ]

trainer = ptl.Trainer(
    accelerator='gpu', devices=1,
    precision=PRECISION,
    callbacks=callbacks,
    min_epochs=10
)

In [None]:
trainer.fit(ptl_module, datamodule)

___

## MLflow

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

MLflow currently offers **four components**:

- **MLflow Tracking**. Record and query experiments: code, data, config, and results.
- **MLflow Projects**. Package data science code in a format to reproduce runs on any platform.
- **MLflow Models**. Deploy machine learning models in diverse serving environments.
- **Model Registry**. Store, annotate, discover, and manage models in a central repository.

_We will focus on **MLflow Tracking**._

### Glossary

- **Run**: an instance of model training. More concretely is a collection of parameters (hyperparameters, network weights, seed, reference to input data), metrics, tags and artifacts.
- **Experiment**: a set of runs. Primary unit of organization of MLflow.
- **Parameters**: key-value parameters, where the value is either numeric or a string.
- **Metrics**: key-value metrics, where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow records and lets you visualize the metric’s full history (*from [the docs](https://www.MLflow.org/docs/latest/tracking.html#concepts)*).
- **Artifacts**: Output files in any format. For example, you can record images (like PNGs), trained models, and data files (for example, a csv file) as artifacts (*from [the docs](https://www.MLflow.org/docs/latest/tracking.html#concepts)*).

In [None]:
import mlflow
print(f"{mlflow.__version__}")

### Create experiments and runs

We will create an experiment called `"TEST"` and, within it, a run called `test_run` with one "hyperparameter" `a=1` and one "metric" `b=2`.

In [None]:
uri = "./mlruns" # the location where to store the runs

In [None]:
mlflow.set_tracking_uri(uri)
exp_name = "TEST"

try:
  exp_id = mlflow.create_experiment(exp_name)
except:
  # If the experiment already exists, we can just retrieve its ID
  exp_id = mlflow.get_experiment_by_name(exp_name).experiment_id


In [None]:
print("Experiment name:", exp_name)
print("Experiment ID:", exp_id)

In [None]:
run_name = "test_run"
with mlflow.start_run(run_name=run_name, experiment_id=exp_id):    
    mlflow.log_param("a", 1)
    mlflow.log_metric("b", 2)

The GUI can be used to explore the experiments and runs using the web browser. To launch it, execute `mlflow ui` on the command line.

### Explore experiments with the MLflow Python API

In [None]:
import ipywidgets as widgets

In [None]:
exp_list = {experiment.name: experiment.experiment_id for experiment in mlflow.search_experiments()}
exp_list

In [None]:
exp_w = widgets.SelectMultiple(options=exp_list)
exp_w

In [None]:
exp_w.value

In [None]:
runs_df = mlflow.search_runs(
    experiment_ids=exp_w.value    
)

runs_df

## PyTorch Lightning with MLflow

In [None]:
mlflow.pytorch.autolog(log_models=False)

In [None]:
BATCH_SIZE = 512
PRECISION = "16"

model = MNISTClassifier()
datamodule = MNIST_DataModule(batch_size=BATCH_SIZE, split_lengths=[48000, 12000])
ptl_module = CNN_Module(model)

trainer = ptl.Trainer(
    accelerator="gpu",
    devices=1,
    precision=PRECISION,
    callbacks=callbacks,
    min_epochs=5
)

In [None]:
trainer.fit(ptl_module, datamodule)