<a href="https://colab.research.google.com/github/TirendazAcademy/PyTorch-Lightning-Tutorials/blob/main/Lightning_with_Tensorboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Pytorch Lightning?
![figure.png](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F8qhjh%2Fbtr5eobWvx3%2FXslpIFC0apO8lmSCUe8VVK%2Fimg.png)
PyTorch Lightning is an open-source Python library that provides a high-level interface for PyTorch. While PyTorch alone is sufficient for easily creating various AI models, the code can become complex when experimenting under more advanced conditions such as using GPUs, TPUs, 16-bit precision, or distributed learning. To address this, PyTorch Lightning was developed as a project that abstracts the code, aiming to establish a unified coding style beyond just a framework.

```python
dataset = LightningDataset()
model = LightningModel()
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model=model, datamodule=dataset)
```

This tutorial is heavily inspired by great pytorch-lightning tutorials before, including:

* [Pytorch lightning tutorials](https://lightning.ai/docs/pytorch/stable/tutorials.html?utm_source=chatgpt.com)
* [Lightning examples](https://github.com/Lightning-AI/tutorials/tree/main/lightning_examples)
* [Why You Should Use PyTorch Lightning and How to Get Started](https://www.sabrepc.com/blog/Deep-Learning-and-AI/why-use-pytorch-lightning)
* [Beginner guide to pytorch-lightning](https://www.kaggle.com/code/shivanandmn/beginners-guide-to-pytorch-lightning/notebook)

And the documentations:
* [Pytorch lightning - Read the Docs](https://lightning.ai/docs/pytorch/LTS/)

### Import required libraries

In [1]:
import os
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from torch.utils.data import random_split
import pytorch_lightning as pl
import torchmetrics
from torchmetrics import Metric
from pytorch_lightning.callbacks import EarlyStopping, Callback
from torchvision.transforms import RandomHorizontalFlip, RandomVerticalFlip
import torchvision
from pytorch_lightning.loggers import TensorBoardLogger
import lightning as L

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("torch version:",torch.__version__)
print("pytorch ligthening version:",pl.__version__)

torch version: 2.4.1+cu121
pytorch ligthening version: 2.5.0.post0


### Define a LightningModule
A LightningModule enables your PyTorch nn.Module to play together in complex ways inside the training_step (there is also an optional validation_step and test_step).

There are many reserved methods in the lighningmodules called hooks:

- ```configure_optimizers``` - this should return optimizer(Adam/SGD)
- ```training_step``` - training loop, takes batch and batch_idx as parameters
- ```validation_step```-validation loop, takes batch and batch_idx as parameters
- ```testing_step```- testing loop, takes batch and batch_idx as parameters


```python
class LightningModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
    
    def forward(self, x):
        pass
  
    def configure_optimizers(self):
        pass
  
    def loss_fn(self, output, target):
        pass 
  
    def training_step(self):
        pass
  
    def validation_step(self):
        pass
```

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import pytorch_lightning as pl

class MLP(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1 * 28 * 28, 200)  # MNIST image size
        self.layer2 = nn.Linear(200, 200)
        self.layer3 = nn.Linear(200, 10)  # MNIST has 10 classes
        self.relu = nn.ReLU()
        self.loss_fn = nn.CrossEntropyLoss()  # MNIST는 다중 분류 문제

    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten MNIST image
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        x = self.layer3(x)
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)  # Forward pass
        loss = self.loss_fn(logits, y)  # Compute loss
        self.log("train_loss", loss, prog_bar=True)  # 로그 저장
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=1e-3)  # Adam Optimizer


Define the validation_step and test_step methods for your MLP model in PyTorch Lightning:
- log "val_loss", "val_acc", "test_loss" and "test_acc"

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pytorch_lightning as pl

class MLP(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1 * 28 * 28, 200)  # MNIST image size
        self.layer2 = nn.Linear(200, 200)
        self.layer3 = nn.Linear(200, 10)  # MNIST has 10 classes

    def forward(self, x):
        x = x.view(x.shape[0], -1)  # Flatten, (B, 1*28*28)
        x = F.relu(self.layer1(x))  # (B, 200)
        x = F.relu(self.layer2(x))  # (B, 200)
        x = self.layer3(x)  # (B, 10)
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)  # Forward pass
        loss = F.cross_entropy(logits, y)
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()  # Accuracy 계산
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()  # Accuracy 계산
        self.log("test_loss", loss, prog_bar=True)
        self.log("test_acc", acc, prog_bar=True)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=1e-3)


### Define a dataset
Lightning supports ANY iterable (DataLoader, numpy, etc…) for the train/val/test/predict splits.

Hooks:
- ```train_dataloader()```
- ```val_dataloader()```
- ```test_dataloader()```
Above methods in lightning datamodule are dataloaders

- prepare_data(): Download and tokenize or do preprocessing on complete dataset, because this is called on single gpu if your using mulitple gpu, data here is not shared accross gpus.
- setup(): splitting or transformations etc. setup takes stage argument None by default or fit or test for training and testing respectively.

```python
class LightningDataset(pl.LightningDataModule):
    def __init__(self):
        super().__init__()
  
    def prepare_data(self):
        pass
  
    def setup(self, stage=None):
        pass
  
    def train_dataloader(self):
        pass
  
    def val_dataloader(self):
        pass
  
    def test_dataloader(self):
        pass
```

In [5]:
import torch
import pytorch_lightning as pl
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split

class MnistDataModule(pl.LightningDataModule):
    def __init__(self, data_dir="D:/data", batch_size=32, num_workers=4):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.num_workers = num_workers

    def prepare_data(self):
        # MNIST 데이터 다운로드
        datasets.MNIST(self.data_dir, train=True, download=True)
        datasets.MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        # 데이터 변환 (augmentation 포함)
        transform = transforms.Compose([
            transforms.RandomVerticalFlip(p=0.1),
            transforms.RandomHorizontalFlip(p=0.1),
            transforms.ToTensor()
        ])

        # 전체 MNIST 데이터셋 로드
        entire_dataset = datasets.MNIST(self.data_dir, train=True, transform=transform)
        
        # train/val 데이터셋 분할 (9:1 비율)
        train_size = int(0.9 * len(entire_dataset))
        val_size = len(entire_dataset) - train_size
        self.train_ds, self.val_ds = random_split(entire_dataset, [train_size, val_size])

        # 테스트 데이터셋
        self.test_ds = datasets.MNIST(self.data_dir, train=False, transform=transforms.ToTensor())

    def train_dataloader(self):
        return DataLoader(self.train_ds, batch_size=self.batch_size, num_workers=self.num_workers, shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.val_ds, batch_size=self.batch_size, num_workers=self.num_workers)

    def test_dataloader(self):
        return DataLoader(self.test_ds, batch_size=self.batch_size, num_workers=self.num_workers)


### Callbacks
PyTorch Lightning의 Callback 함수는 모델 학습 과정에서 특정 이벤트가 발생할 때 실행되는 사용자 정의 기능을 추가할 수 있도록 도와주는 강력한 도구입니다. 이를 통해 모델 학습, 검증, 예측 등의 과정에서 다양한 작업을 자동화할 수 있습니다.
- Early Stopping (학습 조기 종료)
- Model Checkpointing (최적의 모델 저장)
- Logging & Visualization (TensorBoard, WandB 등의 로깅)
- Learning Rate Scheduling (학습률 조정)
- Custom Actions (모델 성능 평가, 추가적인 데이터 로깅 등)

```
trainer = pl.Trainer(
    max_epochs=10,
    callbacks=[
        early_stopping,
        checkpoint_callback,
        PrintLearningRateCallback()
    ]
)
```

1. [Built-in-callbacks](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#built-in-callbacks)
2. [Callback API](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#callback-api)

#### 1️⃣ 간단한 Callback 예제

다음은 학습 시작과 종료 시 로그를 출력하는 간단한 Callback 입니다.

In [6]:
class MyPrintingCallback(Callback):
    def __init__(self):
        super().__init__()

    def on_train_start(self, trainer, pl_module):
        print("Starting to train!")

    def on_train_end(self, trainer, pl_module):
        print("Training is done.")

#### 2️⃣ Early Stopping Callback

PyTorch Lightning에서는 EarlyStopping 콜백을 제공하여 모델의 성능이 개선되지 않으면 학습을 자동으로 종료할 수 있습니다.

In [7]:
from pytorch_lightning.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    monitor="val_loss",  # 모니터링할 값
    patience=3,          # 개선이 없을 경우 종료할 epoch 수
    verbose=True,
    mode="min"           # 최소값이 가장 좋은 경우로 설정 (loss는 낮을수록 좋음)
)

#### 3️⃣ Model Checkpointing Callback

최고 성능을 보이는 모델을 저장하는 Callback도 사용할 수 있습니다.

In [8]:
from pytorch_lightning.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    monitor="val_loss",       # 기준이 되는 metric
    dirpath="checkpoints/",   # 저장 경로
    filename="best-checkpoint",  # 파일 이름
    save_top_k=1,             # 가장 좋은 k개의 모델만 저장
    mode="min",               # loss가 낮을수록 좋은 모델로 판단
    verbose=True
)

#### 4️⃣ Practice: Custom Callback (사용자 정의)

커스텀 Callback을 직접 만들 수도 있습니다.

먼저, 모든 10번째 배치마다 현재 학습률을 출력하는 Callback 을 만들어보겠습니다. (Hint: on_train_batch_end)

In [9]:
import pytorch_lightning as pl

class PrintLearningRateCallback(pl.Callback):
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        # Optimizer에서 현재 학습률 가져오기
        lr = trainer.optimizers[0].param_groups[0]['lr']
        
        # 10번째 배치마다 출력
        if (batch_idx + 1) % 10 == 0:
            print(f"Batch {batch_idx + 1}: Learning Rate = {lr:.6f}")


#### 5️⃣ Practice: Gradient Clipping (자동 그래디언트 클리핑)

PyTorch Lightning에서는 trainer에 gradient_clip_val을 설정하면 그래디언트 클리핑을 할 수 있지만,
커스텀 Callback을 사용하면 특정 조건에서만 적용할 수도 있습니다.

1. [torch.clamp](https://pytorch.org/docs/stable/generated/torch.clamp.html#torch.clamp)
2. [on_after_backward](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#on-after-backward)

In [10]:
import pytorch_lightning as pl
import torch.nn.utils as nn_utils

class GradientClippingCallback(pl.Callback):
    def __init__(self, clip_value=0.5):
        super().__init__()
        self.clip_value = clip_value

    def on_after_backward(self, trainer, pl_module):
        # 모든 파라미터의 그래디언트를 clip_value 이하로 클리핑
        nn_utils.clip_grad_norm_(pl_module.parameters(), self.clip_value)


#### 6️⃣ Practice: Epoch 별 학습 시간 측정 Callback
각 epoch이 끝날 때마다 소요 시간을 측정하여 출력하는 Callback입니다.

1. [on_train_epoch_start](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#on-train-epoch-start)
2. [on_train_epoch_end](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#on-train-epoch-end)

In [11]:
import time

class TimerCallback(pl.Callback):
    def on_train_epoch_start(self, trainer, pl_module):
        self.epoch_start_time = time.time()

    def on_train_epoch_end(self, trainer, pl_module):
        elapsed_time = time.time() - self.epoch_start_time
        print(f"Epoch {trainer.current_epoch} 소요 시간: {elapsed_time:.2f}초")

# Setting the hyperparameters
PyTorch Lightning에서는 여러 가지 하이퍼파라미터를 제공합니다.
- [Trainer flags](https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-flags)

In [15]:
model = MLP()
dm = MnistDataModule(
    data_dir="D:/data",
    batch_size=100,
    num_workers=4,
)
trainer = pl.Trainer(
    logger=TensorBoardLogger("tb_logs", name="mnist_model_v0"),
    accelerator="gpu",
    devices=[0],
    min_epochs=1,
    max_epochs=2,
    callbacks=[PrintLearningRateCallback(), GradientClippingCallback(), TimerCallback()],
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


# Training the model

In [16]:
trainer.fit(model, dm)
trainer.validate(model, dm)
trainer.test(model, dm)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params | Mode 
------------------------------------------
0 | layer1 | Linear | 157 K  | train
1 | layer2 | Linear | 40.2 K | train
2 | layer3 | Linear | 2.0 K  | train
------------------------------------------
199 K     Trainable params
0         Non-trainable params
199 K     Total params
0.797     Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode


                                                                           

c:\Users\HP\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


Epoch 0:   2%|▏         | 9/540 [00:00<00:10, 51.10it/s, v_num=1, train_loss=2.080]Batch 10: Learning Rate = 0.001000
Epoch 0:   4%|▎         | 19/540 [00:00<00:06, 83.65it/s, v_num=1, train_loss=1.410]Batch 20: Learning Rate = 0.001000
Epoch 0:   5%|▌         | 29/540 [00:00<00:04, 104.27it/s, v_num=1, train_loss=1.000]Batch 30: Learning Rate = 0.001000
Epoch 0:   7%|▋         | 39/540 [00:00<00:04, 117.62it/s, v_num=1, train_loss=0.732]Batch 40: Learning Rate = 0.001000
Epoch 0:   9%|▉         | 49/540 [00:00<00:03, 128.75it/s, v_num=1, train_loss=0.668]Batch 50: Learning Rate = 0.001000
Epoch 0:  11%|█         | 59/540 [00:00<00:03, 135.14it/s, v_num=1, train_loss=0.736]Batch 60: Learning Rate = 0.001000
Epoch 0:  13%|█▎        | 69/540 [00:00<00:03, 139.23it/s, v_num=1, train_loss=0.811]Batch 70: Learning Rate = 0.001000
Epoch 0:  15%|█▍        | 79/540 [00:00<00:03, 143.75it/s, v_num=1, train_loss=0.753]Batch 80: Learning Rate = 0.001000
Epoch 0:  16%|█▋        | 89/540 [00:00<00:

`Trainer.fit` stopped: `max_epochs=2` reached.


Epoch 1: 100%|██████████| 540/540 [00:12<00:00, 44.61it/s, v_num=1, train_loss=0.179, val_loss=0.225, val_acc=0.931]


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation DataLoader 0: 100%|██████████| 60/60 [00:00<00:00, 296.01it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
         val_acc            0.9396666884422302
        val_loss            0.19124844670295715
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
c:\Users\HP\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:420: Consider setting `persistent_workers=True` in 'test_dataloader' to speed up the dataloader worker initialization.


Testing DataLoader 0: 100%|██████████| 100/100 [00:00<00:00, 493.82it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.9559999704360962
        test_loss           0.1369308978319168
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.1369308978319168, 'test_acc': 0.9559999704360962}]

In [18]:
%load_ext tensorboard
%tensorboard --logdir tb_logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 19084), started 0:00:07 ago. (Use '!kill 19084' to kill it.)