# Using PyTorch Lightning with Tune

(tune-pytorch-lightning-ref)=

PyTorch Lightning is a framework which brings structure into training PyTorch models. It
aims to avoid boilerplate code, so you don't have to write the same training
loops all over again when building a new model.

```{image} /images/pytorch_lightning_full.png
:align: center
```

The main abstraction of PyTorch Lightning is the `LightningModule` class, which
should be extended by your application. There is [a great post on how to transfer your models from vanilla PyTorch to Lightning](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).

The class structure of PyTorch Lightning makes it very easy to define and tune model
parameters. This tutorial will show you how to use Tune with AIR {class}`LightningTrainer <ray.train.lightning.LightningTrainer>` to find the best set of
parameters for your application on the example of training a MNIST classifier. Notably,
the `LightningModule` does not have to be altered at all for this - so you can
use it plug and play for your existing models, assuming their parameters are configurable!

:::{note}
If you don't want to use AIR {class}`LightningTrainer <ray.train.lightning.LightningTrainer>` and prefer using vanilla lightning trainer with function trainable, please refer to this document: {ref}`Using vanilla Pytorch Lightning with Tune <tune-vanilla-pytorch-lightning-ref>`.

:::

:::{note}
To run this example, you will need to install the following:

```bash
$ pip install "ray[tune]" torch torchvision pytorch-lightning
```
:::

```{contents}
:backlinks: none
:local: true
```

## PyTorch Lightning classifier for MNIST

Let's first start with the basic PyTorch Lightning implementation of an MNIST classifier.
This classifier does not include any tuning code at this point.

First, we run some imports:

In [1]:
import os
import torch
import pytorch_lightning as pl
import torch.nn.functional as F
from filelock import FileLock
from torchmetrics import Accuracy
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms

import ray
import ray.tune as tune
from ray.air.config import CheckpointConfig, ScalingConfig
from ray.train.lightning import LightningTrainer, LightningConfigBuilder
from ray.tune.schedulers import PopulationBasedTraining


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# If you want to run full test, please set SMOKE_TEST to False
SMOKE_TEST = True

Our example builds on the MNIST example from the [blog post](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09) we mentioned before. We adapted the original model and dataset definitions into `MNISTClassifier` and `MNISTDataModule`. 

In [3]:
class MNISTClassifier(pl.LightningModule):
    def __init__(self, config):
        super(MNISTClassifier, self).__init__()
        self.accuracy = Accuracy()
        self.layer_1_size = config["layer_1_size"]
        self.layer_2_size = config["layer_2_size"]
        self.lr = config["lr"]

        # mnist images are (1, 28, 28) (channels, width, height)
        self.layer_1 = torch.nn.Linear(28 * 28, self.layer_1_size)
        self.layer_2 = torch.nn.Linear(self.layer_1_size, self.layer_2_size)
        self.layer_3 = torch.nn.Linear(self.layer_2_size, 10)

    def cross_entropy_loss(self, logits, labels):
        return F.nll_loss(logits, labels)

    def forward(self, x):
        batch_size, channels, width, height = x.size()
        x = x.view(batch_size, -1)

        x = self.layer_1(x)
        x = torch.relu(x)

        x = self.layer_2(x)
        x = torch.relu(x)

        x = self.layer_3(x)
        x = torch.log_softmax(x, dim=1)

        return x

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)

        self.log("ptl/train_loss", loss)
        self.log("ptl/train_accuracy", accuracy)
        return loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)
        return {"val_loss": loss, "val_accuracy": accuracy}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        avg_acc = torch.stack([x["val_accuracy"] for x in outputs]).mean()
        self.log("ptl/val_loss", avg_loss)
        self.log("ptl/val_accuracy", avg_acc)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer


class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=128):
        super().__init__()
        self.data_dir = os.getcwd()
        self.batch_size = batch_size
        self.transform = transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        )

    def setup(self, stage=None):
        with FileLock(f"{self.data_dir}.lock"):
            mnist = MNIST(
                self.data_dir, train=True, download=True, transform=self.transform
            )
            self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000])

            self.mnist_test = MNIST(
                self.data_dir, train=False, download=True, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=self.batch_size, num_workers=4)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=4)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=4)


In [4]:
default_config = {
    "layer_1_size": 128,
    "layer_2_size": 256,
    "lr": 1e-3,
}


## Tuning the model parameters

The parameters above should give you a good accuracy of over 90% already. However, we might improve on this simply by changing some of the hyperparameters. For instance, maybe we get an even higher accuracy if we used a smaller learning rate and larger middle layer size.

Instead of manually loop through all the parameter combinitions, let's use Tune to systematically try out parameter combinations and find the best performing set.

First, we need some additional imports:

In [5]:
from pytorch_lightning.loggers import TensorBoardLogger
from ray import air, tune
from ray.air import session
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining


### Configuring the search space

Now we configure the parameter search space using {class}`LightningConfigBuilder <ray.train.lightning.LightningConfigBuilder>`. We would like to choose between three different layer and batch sizes. The learning rate should be sampled uniformly between `0.0001` and `0.1`. The `tune.loguniform()` function is syntactic sugar to make sampling between these different orders of magnitude easier, specifically we are able to also sample small values.

:::{note}
In `LightningTrainer`, the frequency of metric reporting is the same as the frequency of checkpointing. For example, if you set `builder.checkpointing(..., every_n_epochs=2)`, then for every 2 epochs, all the latest metrics will be reported to the Ray Tune session along with the latest checkpoint. Please make sure the target metrics(e.g. metrics specified in `TuneConfig`, schedulers, and searchers) are logged before saving a checkpoint.

:::


:::{note}
Use `LightningConfigBuilder.checkpointing()` to specify the monitor metric and checkpoint frequency for the Lightning ModelCheckpoint callback. To properly save AIR checkpoints, you must also provide an AIR {class}`CheckpointConfig <ray.air.config.CheckpointConfig>`. Otherwise, LightningTrainer will create a default CheckpointConfig, which saves all the reported checkpoints by default.

:::

In [None]:
# The maximum training epochs
num_epochs = 5

# Number of sampls from parameter space
num_samples = 10

accelerator = "gpu"

config = {
    "layer_1_size": tune.choice([32, 64, 128]),
    "layer_2_size": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
}

In [7]:
if SMOKE_TEST:
    num_epochs = 10
    num_samples = 10
    accelerator = "cpu"

In [8]:
dm = MNISTDataModule(batch_size=128)

lightning_config = (
    LightningConfigBuilder()
    .module(cls=MNISTClassifier, config=config)
    .trainer(max_epochs=num_epochs, accelerator=accelerator)
    .fit_params(datamodule=dm)
    .checkpointing(monitor="ptl/val_accuracy", save_top_k=2, mode="max")
    .build()
)

# Make sure to also define an AIR CheckpointConfig here
# to properly save checkpoints in AIR format.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        checkpoint_score_attribute="ptl/val_accuracy",
        checkpoint_score_order="max",
    ),
)

### Selecting a scheduler

In this example, we use an [Asynchronous Hyperband](https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/)
scheduler. This scheduler decides at each iteration which trials are likely to perform
badly, and stops these trials. This way we don't waste any resources on bad hyperparameter
configurations.

In [9]:
scheduler = ASHAScheduler(max_t=num_epochs, grace_period=1, reduction_factor=2)


### Training with GPUs

We can specify the number of resources, including GPUs, that Tune should request for each trial.

`LightningTrainer` takes care of environment setup for Distributed Data Parallel training, the model and data will automatically get distributed across GPUs. You only need to set the number of GPUs per worker in `ScalingConfig` and also set `accelerator="gpu"` in LightningTrainerConfigBuilder.

In [10]:
scaling_config = ScalingConfig(
    num_workers=3, use_gpu=True, resources_per_worker={"CPU": 1, "GPU": 1}
)


In [11]:
if SMOKE_TEST:
    scaling_config = ScalingConfig(
        num_workers=3, use_gpu=False, resources_per_worker={"CPU": 1}
    )


In [12]:
# Define a base LightningTrainer without hyper-parameters for Tuner
lightning_trainer = LightningTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
)


### Putting it together

Lastly, we need to create a `Tuner()` object and start Ray Tune with `tuner.fit()`.

The full code looks like this:

In [13]:
def tune_mnist_asha(num_samples=10):
    scheduler = ASHAScheduler(max_t=num_epochs, grace_period=1, reduction_factor=2)

    tuner = tune.Tuner(
        lightning_trainer,
        param_space={"lightning_config": lightning_config},
        tune_config=tune.TuneConfig(
            metric="ptl/val_accuracy",
            mode="max",
            num_samples=num_samples,
            scheduler=scheduler,
        ),
        run_config=air.RunConfig(
            name="tune_mnist_asha",
        ),
    )
    results = tuner.fit()
    best_result = results.get_best_result(metric="ptl/val_accuracy", mode="max")
    best_result


In the example above, Tune runs 10 trials with different hyperparameter configurations.
An example output could look like so:

```{code-block} bash
:emphasize-lines: 12

  +------------------------------+------------+-------------------+----------------+----------------+-------------+----------+-----------------+----------------------+
  | Trial name                   | status     | loc               |   layer_1_size |   layer_2_size |          lr |     loss |   mean_accuracy |   training_iteration |
  |------------------------------+------------+-------------------+----------------+----------------+-------------+----------+-----------------+----------------------|
  | LightningTrainer_9532b_00001 | TERMINATED |  10.0.37.7:448989 |            32  |            64  | 0.00025324  | 0.58146  |       0.866667  |                   1  |
  | LightningTrainer_9532b_00002 | TERMINATED |  10.0.37.7:449722 |            128 |            128 | 0.000166782 | 0.29038  |       0.933333  |                   2  |
  | LightningTrainer_9532b_00003 | TERMINATED |  10.0.37.7:453404 |            64  |            128 | 0.0004948	  | 0.15375  |       0.9       |                   4  |
  | LightningTrainer_9532b_00004 | TERMINATED |  10.0.37.7:457981 |            128 |            128 | 0.000304361 | 0.17622  |       0.966667  |                   4  |
  | LightningTrainer_9532b_00005 | TERMINATED |  10.0.37.7:467478 |            128 |            64  | 0.0344561	  | 0.34665  |       0.866667  |                   1  |
  | LightningTrainer_9532b_00006 | TERMINATED |  10.0.37.7:484401 |            128 |            256 | 0.0262851	  | 0.34981  |       0.866667  |                   1  |
  | LightningTrainer_9532b_00007 | TERMINATED |  10.0.37.7:490670 |            32  |            128 | 0.0550712	  | 0.62575  |       0.766667  |                   1  |
  | LightningTrainer_9532b_00008 | TERMINATED |  10.0.37.7:491159 |            32  |            64  | 0.000489046 | 0.27384  |       0.966667  |                   2  |
  | LightningTrainer_9532b_00009 | TERMINATED |  10.0.37.7:491494 |            64  |            256 | 0.000395127 | 0.09642  |       0.933333  |                   8  |
  +------------------------------+------------+-------------------+----------------+----------------+-------------+----------+-----------------+----------------------+
```

As you can see in the `training_iteration` column, trials with a high loss
(and low accuracy) have been terminated early. The best performing trial used
`layer_1_size=32`, `layer_2_size=64`, and `lr=0.000489046`.

## Using Population Based Training to find the best parameters

The `ASHAScheduler` terminates those trials early that show bad performance.
Sometimes, this stops trials that would get better after more training steps,
and which might eventually even show better performance than other configurations.

Another popular method for hyperparameter tuning, called
[Population Based Training](https://deepmind.com/blog/article/population-based-training-neural-networks),
instead perturbs hyperparameters during the training run. Tune implements PBT, and
we only need to make some slight adjustments to our code.

In [14]:
def tune_mnist_pbt(num_samples=10):
    # The range of hyperparameter perturbation.
    mutations_config = (
        LightningConfigBuilder()
        .module(
            config = {
                "lr": tune.loguniform(1e-4, 1e-1),
            }
        ).build()
    )

    # Create a PBT scheduler
    scheduler = PopulationBasedTraining(
        perturbation_interval=1,
        time_attr="training_iteration",
        hyperparam_mutations={
            "lightning_config": mutations_config
        }
    )

    tuner = tune.Tuner(
        lightning_trainer,
        param_space={"lightning_config": lightning_config},
        tune_config=tune.TuneConfig(
            metric="ptl/val_accuracy",
            mode="max",
            num_samples=num_samples,
            scheduler=scheduler,
        ),
        run_config=air.RunConfig(
            name="tune_mnist_pbt",
        ),
    )
    results = tuner.fit()
    best_result = results.get_best_result(metric="ptl/val_accuracy", mode="max")
    best_result
    

In [None]:
# tune_mnist_asha(num_samples=num_samples)
tune_mnist_pbt(num_samples=num_samples)

find: ‘.git’: No such file or directory
2023-03-29 19:41:33,900	INFO worker.py:1415 -- Connecting to existing Ray cluster at address: 10.0.37.7:6379...
2023-03-29 19:41:33,910	INFO worker.py:1609 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_fxlstnzcjvmcl8zzdgutle9j7v/services?redirect_to=dashboard 
2023-03-29 19:41:34,099	INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_71294a5f136ff637d61fe85afddc1f65.zip' (60.69MiB) to Ray cluster...
2023-03-29 19:41:35,164	INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_71294a5f136ff637d61fe85afddc1f65.zip'.
2023-03-29 19:41:35,213	INFO tune.py:219 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
  "Consider boosting PBT performance by enabling `reuse_actors` as "


0,1
Current time:,2023-03-29 19:47:38
Running for:,00:06:03.46
Memory:,16.1/62.0 GiB

Trial name,status,loc,...odule_init_config /config/layer_1_size,...odule_init_config /config/layer_2_size,..._config/_module_i nit_config/config/lr,iter,total time (s),ptl/train_loss,ptl/train_accuracy,ptl/val_loss
LightningTrainer_6331d_00002,RUNNING,10.0.37.7:1249072,128,64,0.00115867,4,75.1855,0.073703,0.966667,0.0885324
LightningTrainer_6331d_00003,RUNNING,10.0.37.7:1263175,128,128,0.00226503,4,71.8001,0.150485,0.933333,0.0605363
LightningTrainer_6331d_00004,RUNNING,10.0.37.7:1250292,128,128,0.00712747,4,71.3569,0.12994,0.966667,0.083519
LightningTrainer_6331d_00005,RUNNING,10.0.37.7:1264205,128,128,0.00712747,4,74.3851,0.0211624,1.0,0.0618065
LightningTrainer_6331d_00000,PAUSED,10.0.37.7:1279466,128,128,0.00593956,5,91.3289,0.317815,0.966667,0.0472228
LightningTrainer_6331d_00001,PAUSED,10.0.37.7:1280989,128,128,0.000367715,5,94.0547,0.14838,0.966667,0.0590947
LightningTrainer_6331d_00006,PAUSED,10.0.37.7:1264873,128,128,0.00475165,4,72.0527,0.00528573,1.0,0.0938069
LightningTrainer_6331d_00007,PAUSED,10.0.37.7:1266534,128,128,0.00712747,4,73.5926,0.0962502,0.933333,0.0701004
LightningTrainer_6331d_00008,PAUSED,10.0.37.7:1279249,128,128,0.00188753,4,72.5808,0.131113,0.933333,0.0690694
LightningTrainer_6331d_00009,PAUSED,10.0.37.7:1279305,32,256,0.00343816,4,78.8291,0.209262,0.933333,0.0791726


2023-03-29 19:41:35,305	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2023-03-29 19:41:35,309	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2023-03-29 19:41:35,313	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2023-03-29 19:41:35,317	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2023-03-29 19:41:35,321	INFO data_parallel_trainer.py:358 -- GPUs ar

Trial name,_report_on,date,done,epoch,hostname,iterations_since_restore,node_ip,pid,ptl/train_accuracy,ptl/train_loss,ptl/val_accuracy,ptl/val_loss,should_checkpoint,step,time_since_restore,time_this_iter_s,time_total_s,timestamp,training_iteration,trial_id
LightningTrainer_6331d_00000,train_epoch_end,2023-03-29_19-47-29,False,4,ip-10-0-37-7,1,10.0.37.7,1279466,0.966667,0.317815,0.985119,0.0472228,True,720,20.5497,20.5497,91.3289,1680144448,5,6331d_00000
LightningTrainer_6331d_00001,train_epoch_end,2023-03-29_19-47-32,False,4,ip-10-0-37-7,1,10.0.37.7,1280989,0.966667,0.14838,0.982143,0.0590947,True,720,21.0163,21.0163,94.0547,1680144451,5,6331d_00001
LightningTrainer_6331d_00002,train_epoch_end,2023-03-29_19-46-22,False,3,ip-10-0-37-7,1,10.0.37.7,1249072,0.966667,0.073703,0.973958,0.0885324,True,576,18.6628,18.6628,75.1855,1680144382,4,6331d_00002
LightningTrainer_6331d_00003,train_epoch_end,2023-03-29_19-46-53,False,3,ip-10-0-37-7,1,10.0.37.7,1263175,0.933333,0.150485,0.981027,0.0605363,True,576,19.236,19.236,71.8001,1680144413,4,6331d_00003
LightningTrainer_6331d_00004,train_epoch_end,2023-03-29_19-46-26,False,3,ip-10-0-37-7,1,10.0.37.7,1250292,0.966667,0.12994,0.975446,0.083519,True,576,21.2308,21.2308,71.3569,1680144386,4,6331d_00004
LightningTrainer_6331d_00005,train_epoch_end,2023-03-29_19-46-56,False,3,ip-10-0-37-7,1,10.0.37.7,1264205,1.0,0.0211624,0.980097,0.0618065,True,576,21.1813,21.1813,74.3851,1680144416,4,6331d_00005
LightningTrainer_6331d_00006,train_epoch_end,2023-03-29_19-46-57,False,3,ip-10-0-37-7,1,10.0.37.7,1264873,1.0,0.00528573,0.975508,0.0938069,True,576,20.9611,20.9611,72.0527,1680144417,4,6331d_00006
LightningTrainer_6331d_00007,train_epoch_end,2023-03-29_19-47-01,False,3,ip-10-0-37-7,1,10.0.37.7,1266534,0.933333,0.0962502,0.977679,0.0701004,True,576,20.5557,20.5557,73.5926,1680144421,4,6331d_00007
LightningTrainer_6331d_00008,train_epoch_end,2023-03-29_19-47-26,False,3,ip-10-0-37-7,1,10.0.37.7,1279249,0.933333,0.131113,0.977679,0.0690694,True,576,20.0167,20.0167,72.5808,1680144446,4,6331d_00008
LightningTrainer_6331d_00009,train_epoch_end,2023-03-29_19-47-28,False,3,ip-10-0-37-7,1,10.0.37.7,1279305,0.933333,0.209262,0.976191,0.0791726,True,576,21.1968,21.1968,78.8291,1680144448,4,6331d_00009


(RayTrainWorker pid=1133165) Missing logger folder: /home/ray/ray_results/tune_mnist_pbt/LightningTrainer_6331d_00001_1_layer_1_size=32,layer_2_size=64,lr=0.0012_2023-03-29_19-41-44/rank_1/lightning_logs [repeated 3x across cluster]
(TrainTrainable pid=1137088) 2023-03-29 19:42:10,849	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(LightningTrainer pid=1137088) 2023-03-29 19:42:10,866	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(RayTrainWorker pid=1137080) GPU available: False, used: False
(RayTrainWorker pid=1137080) TPU available: False, using: 0 TPU cores
(RayTrainWorker pid=1137080) IPU available: False, using: 0 IPUs
(RayTrainWorker pid=1137080) HPU available:





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.005939557174932708}


2023-03-29 19:42:45,910	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00005:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.005939557174932708 --- (* 1.2) --> 0.007127468609919249
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

 2023-03-29 19:42:45,851	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(RayTrainWorker pid=1147829) Missing logger folder: /home/ray/ray_results/tune_mnist_pbt/LightningTrainer_6331d_00005_5_layer_1_size=64,layer_2_size=128,lr=0.0746_2023-03-29_19-42-12/rank_2/lightning_logs [repeated 5x across cluster]
(RayTrainWorker pid=1147827) 2023-03-29 19:42:45,851	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU t





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.005939557174932708}


2023-03-29 19:43:04,726	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00007:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.005939557174932708 --- (* 1.2) --> 0.007127468609919249
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(RayTrainWorker pid=1162952) 2023-03-29 19:43:07,797	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TrainTrainable pid=1163350) 2023-03-29 19:43:07,797	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(LightningTrainer pid=1163350) 2023-03-29 19:43:07,830	INFO trainable.py:914 -- Re





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.005939557174932708}


2023-03-29 19:44:16,372	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00006:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.005939557174932708 --- (* 0.8) --> 0.0047516457399461665
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(RayTrainWorker pid=1193244) Missing logger folder: /home/ray/ray_results/tune_mnist_pbt/LightningTrainer_6331d_00008_8_layer_1_size=32,layer_2_size=128,lr=0.0004_2023-03-29_19-42-46/rank_2/lightning_logs
(RayTrainWorker pid=1193244) Missing logger folder: /home/ray/ray_results/tune_mnist_pbt/LightningTrainer_6331d_00008_8_layer_1_size=32,layer_2_size=128,lr=0.0004_2023-03-29_19-42-46/rank_2/lightning_logs
(RayTrainWorker pid=1193244) Missing logger folder: /home/ray/ray_results/tune_mnist_pbt/LightningTrainer_6331d_00008_8_layer_1_size=32,layer_2_size=128,lr=0.0004_2023-03-29_19-42-46/rank_2/light





MUTATION: 



 

 [repeated 3x across cluster]


{'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}


(RayTrainWorker pid=1195537)





ORIGINAL: 

 2023-03-29 19:44:22.068149: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

 

 [repeated 2x across cluster]


{'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.005939557174932708}


2023-03-29 19:44:26,083	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00008:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.005939557174932708 --- (resample) --> 0.0018875272358204525
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(RayTrainWorker pid=1199367) GPU available: False, used: False
(RayTrainWorker pid=1199367) TPU available: False, using: 0 TPU cores
(RayTrainWorker pid=1199367) IPU available: False, using: 0 IPUs
(RayTrainWorker pid=1199367) HPU available: False, using: 0 HPUs
(RayTrainWorker pid=1195538) Traceback (most recent call last):
(RayTrainWorker pid=1195538)   File "/home/ray/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py", line 142, in _serve
(RayTrainWorker pid=1195538)     with self._listener.accept() as conn:
(RayTrainWorker pid=1195538)   File "/home/ray/anaconda3/lib/python3.7/multi





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.005939557174932708}


2023-03-29 19:45:19,432	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00004:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.005939557174932708 --- (* 1.2) --> 0.007127468609919249
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(RayTrainWorker pid=1224014) 2023-03-29 19:45:20.522669: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 4x across cluster]
(RayTrainWorker pid=1224016) HPU available: False, using: 0 HPUs
(RayTrainWorker pid=1224014)   f"The dirpath has changed from {dirpath_from_ckpt!r} to {self.dirpath!r}," [repeated 3x across cluster]
(RayTrainWorker pid=1224014) --





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.007127468609919249}


2023-03-29 19:46:01,366	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00001:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.007127468609919249 --- (resample) --> 0.0003677149260722046
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(TrainTrainable pid=1249072) 2023-03-29 19:46:04,162	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(RayTrainWorker pid=1239424) 2023-03-29 19:46:04,162	INFO data_parallel_trainer.py:358 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(RayTrainWorker pid=1239424) 2023-03-29 19:46:04,162	INFO data_parallel_traine





MUTATION:  {'lr': <ray.tune.search.sample.Float object at 0x7f901753aed0>}




ORIGINAL:  {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.0018875272358204525}


2023-03-29 19:46:24,845	INFO pbt.py:838 -- 

[PopulationBasedTraining] [Explore] Perturbed the hyperparameter config of trial6331d_00003:
lightning_config : 
    _module_init_config : 
        config : 
            lr : 0.0018875272358204525 --- (* 1.2) --> 0.0022650326829845428
    _trainer_init_config : 
    _trainer_fit_params : 
    _ddp_strategy_config : 
    _model_checkpoint_config : 

(RayTrainWorker pid=1252686) 2023-03-29 19:46:18.723303: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 2x across cluster]
(RayTrainWorker pid=1252686) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 2x across cluster]
(RayTrainWorker pid=1252686) 2023-03-29 19:46:18.874336: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. 

If you have more resources available (e.g. a GPU), you can modify the above parameters accordingly.

An example output of a run could look like this:

```bash
:emphasize-lines: 12

 +------------------------------+------------+-------+----------------+----------------+-----------+-----------+-----------------+----------------------+
 | Trial name                   | status     | loc   |   layer_1_size |   layer_2_size |        lr |      loss |   mean_accuracy |   training_iteration |
 |------------------------------+------------+-------+----------------+----------------+-----------+-----------+-----------------+----------------------|
 | LightningTrainer_85489_00000 | TERMINATED |       |            128 |            128 | 0.001     | 0.108734  |        0.973101 |                   10 |
 | LightningTrainer_85489_00001 | TERMINATED |       |            128 |            128 | 0.001     | 0.093577  |        0.978639 |                   10 |
 | LightningTrainer_85489_00002 | TERMINATED |       |            128 |            256 | 0.0008    | 0.0922348 |        0.979299 |                   10 |
 | LightningTrainer_85489_00003 | TERMINATED |       |             64 |            256 | 0.001     | 0.124648  |        0.973892 |                   10 |
 | LightningTrainer_85489_00004 | TERMINATED |       |            128 |             64 | 0.001     | 0.101717  |        0.975079 |                   10 |
 | LightningTrainer_85489_00005 | TERMINATED |       |             64 |             64 | 0.001     | 0.121467  |        0.969146 |                   10 |
 | LightningTrainer_85489_00006 | TERMINATED |       |            128 |            256 | 0.00064   | 0.053446  |        0.987062 |                   10 |
 | LightningTrainer_85489_00007 | TERMINATED |       |            128 |            256 | 0.001     | 0.129804  |        0.973497 |                   10 |
 | LightningTrainer_85489_00008 | TERMINATED |       |             64 |            256 | 0.0285125 | 0.363236  |        0.913867 |                   10 |
 | LightningTrainer_85489_00009 | TERMINATED |       |             32 |            256 | 0.001     | 0.150946  |        0.964201 |                   10 |
 +------------------------------+------------+-------+----------------+----------------+-----------+-----------+-----------------+----------------------+
```

As you can see, each sample ran the full number of 10 iterations.
All trials ended with quite good parameter combinations and showed relatively good performances.
In some runs, the parameters have been perturbed. And the best configuration even reached a
mean validation accuracy of `0.987062`!

In summary, AIR LightningTrainer is easy to extend to use with Tune. It only required adding a few lines of code to integrate with Ray Tuner to get great performing parameter configurations.

## More PyTorch Lightning Examples

- {ref}`Use LightningTrainer for Image Classification <lightning_mnist_example>`.
- {doc}`/tune/examples/includes/mnist_ptl_mini`:
  A minimal example of using [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
  to train a MNIST model. This example utilizes the Ray Tune-provided
  {ref}`PyTorch Lightning callbacks <tune-integration-pytorch-lightning>`.
- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)
  and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune.