### PyTorch Lightning Tutorial - Lightweight PyTorch Wrapper For ML Researchers
- https://www.youtube.com/watch?v=Hgg8Xy6IRig
- from: PyTorch Course by Python Engineer Channel


Let's see how much easier PyTorch is and how easy it is to write a training loop 
and maybe tensorboard

In [3]:
# !pip install pytorch-lightning

In [1]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F
import matplotlib.pyplot as plt
import pytorch_lightning as pl
from pytorch_lightning import Trainer
from torch.utils.data import DataLoader
import os
from pytorch_lightning.loggers import TensorBoardLogger

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# 2.1.2
print(pl.__version__)

2.1.2


Code below are modified from: 
- https://github.com/patrickloeber/pytorchTutorial/blob/master/13_feedforward.py

In [3]:
# Hyper-parameters
input_size = 784  # 28x28
hidden_size = 500
num_classes = 10
num_epochs = 2
batch_size = 100
learning_rate = 0.001


In [4]:
# logger = TensorBoardLogger("tb_logs", name="my_model")
# trainer = Trainer(logger=logger)

In [5]:
train_set = torchvision.datasets.MNIST(root=os.getcwd(),  #  get path of the current working directory
                                       download=True, 
                                       train=True, 
                                       transform=transforms.ToTensor())

test_set = torchvision.datasets.MNIST(root=os.getcwd(), 
                                      download=True, 
                                      train=False, 
                                      transform=transforms.ToTensor())

# use 20% of training data for validation
train_set_size = int(len(train_set) * 0.8)
valid_set_size = len(train_set) - train_set_size

# split the train set into two
seed = torch.Generator().manual_seed(42)
train_set, valid_set = torch.utils.data.random_split(train_set, 
                                         [train_set_size, valid_set_size], 
                                         generator=seed)

In [93]:
# Fully connected neural network with one hidden layer

# class NeuralNet(nn.Module):
class LitNeuralNet(pl.LightningModule):
    def __init__(self, input_size, hidden_size, num_classes):
        super(LitNeuralNet, self).__init__()
        self.input_size = input_size
        self.l1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.l2 = nn.Linear(hidden_size, num_classes)  
        self.validation_outputs = []
        self.test_outputs = []
    
    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        # no activation and no softmax at the end
        return out


    # also need to add trainig setup and optimizer to the class
    # https://lightning.ai/docs/pytorch/stable/starter/introduction.html
    def training_step(self, batch, batch_idx):
        '''
            This is the training loop that we simplified that was looking like this:
            https://github.com/patrickloeber/pytorchTutorial/blob/master/13_feedforward.py#L68-L86
        '''
        images, y = batch # unpack batch

        # origin shape: [100, 1, 28, 28]
        # resized: [100, 784]
        # reshape in such so each image pixel is a numeric feature and 100 is the batch_size
        images = images.reshape(-1, 28  * 28 )

        y_pred = self(images) # pred_y
        loss = F.cross_entropy(y_pred, y)
        tensorboard_logs = {'train_loss': loss}
        '''
        compare to the not using the pytorch lighgting
            - the only things on left out here is not having to write .to(device) and .zero_grad() , .backward() and .step()
            - so essentially it's nice to see Lightning allows me to skip a lot of the manual looping and optimizer/loss management

            # origin shape: [100, 1, 28, 28]
            # resized: [100, 784]
            images = images.reshape(-1, 28*28).to(device)
            labels = labels.to(device)
            
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        '''
        
        self.log('train_loss', loss)

        return {'loss': loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

    def train_dataloader(self): 
        
        # dataset = torchvision.datasets.MNIST(os.getcwd(),  #  get path of the current working directory
        #                                      train=True, 
        #                                      download=True, 
        #                                      transform = transforms.ToTensor())  
          
        loader =  torch.utils.data.DataLoader(dataset=train_set, 
                                              batch_size=batch_size, 
                                              num_workers=4, # use multi-core to load data
                                              shuffle=False)

        return loader
    

    def validation_step(self, batch, batch_idx):
        '''
            This is the training loop that we simplified that was looking like this:
            https://github.com/patrickloeber/pytorchTutorial/blob/master/13_feedforward.py#L68-L86
        '''
        images, y = batch # unpack batch

        # origin shape: [100, 1, 28, 28]
        # resized: [100, 784]
        # reshape in such so each image pixel is a numeric feature and 100 is the batch_size
        images = images.reshape(-1, 28  * 28 )

        y_pred = self(images) # pred_y
        loss = F.cross_entropy(y_pred, y)
        tensorboard_logs = {'val_loss': loss}

        # self.log('val_loss', loss)

        self.validation_outputs.append(loss)

        return {'val_loss': loss, 'log': tensorboard_logs}  

    def val_dataloader(self): 
        '''
            官方文档好像有点不一样？用的同一个 validation DataLoader 就是不同的 input
                https://lightning.ai/docs/pytorch/stable/common/evaluation_basic.html

            train_loader = DataLoader(train_set)
            valid_loader = DataLoader(valid_set)

            # train with both splits
            trainer = L.Trainer()
            trainer.fit(model, train_loader, valid_loader)    

            视频里偷懒了，直接把  test set 当作 validation set 了.. 我改正确吧.
        '''

        loader =  torch.utils.data.DataLoader(dataset=valid_set, 
                                              batch_size=batch_size, 
                                              num_workers=4, # use multi-core to load data
                                              shuffle=False)

        return loader      
    

    # NotImplementedError: Support for `validation_epoch_end` has been removed in v2.0.0. `LitNeuralNet` implements this method. 
    # You can use the `on_validation_epoch_end` hook instead. To access outputs, save them in-memory as instance attributes. You can find migration examples in https://github.com/Lightning-AI/lightning/pull/16520.
    def on_validation_epoch_end(self):
        '''
            what am i expect ot see with this?
             好像是 training 的 progress bar 如果没有这个，就不会显示 loss, 变成看不到..? 
        '''
        # this is because we "return {'val_loss': loss, 'log': tensorboard_logs}"  
        # in the validation_step  
        # print(self.validation_outputs)

        avg_loss = torch.stack(self.validation_outputs).mean()

        self.log('val_loss', avg_loss, on_epoch=True) # this will alow trainer.fit() progress bar showing loss?? no it doesn'..

        print(f' val_loss: {avg_loss}') # have to manually printed out... so lame...

        self.validation_outputs = [] # reset to empty for the next epoch..

        return {'val_loss': avg_loss}

    def test_step(self, batch, batch_idx):        
        images, y = batch # unpack batch

        # origin shape: [100, 1, 28, 28]
        # resized: [100, 784]
        # reshape in such so each image pixel is a numeric feature and 100 is the batch_size
        images = images.reshape(-1, 28  * 28 )

        # print(f'images.max(): {images.max()}')

        y_pred = self(images) # pred_y
        loss = F.cross_entropy(y_pred, y)
        tensorboard_logs = {'test_loss': loss}

        self.test_outputs.append(loss)

        self.log('test_loss', loss)

        return {'test_loss': loss, 'log': tensorboard_logs}  
    
    def on_test_epoch_end(self):
    
        avg_loss = torch.stack(self.test_outputs).mean()

        self.log('test_loss_agg', avg_loss, on_epoch=True) # this will alow trainer.fit() progress bar showing loss?? no it doesn'..

        print(f' test_loss_agg: {avg_loss}') # have to manually printed out... so lame...

        self.test_outputs = [] # reset to empty for the next epoch..

        return {'test_loss_agg': avg_loss}

In [94]:
torch.set_num_threads(4)



In [95]:
# trainer is the training loop

# trainer = Trainer(fast_dev_run=True) # fast_dev_run only run one batch so you can test if your model works!!
trainer = Trainer(max_epochs=num_epochs, 
                  fast_dev_run=False
                  )

model = LitNeuralNet(
    input_size=input_size,
    hidden_size=hidden_size,
    num_classes=num_classes
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [96]:
'''
    Intead of doing a double for loop to iterate through epoches and batches like the old way
        n_total_steps = len(train_loader)
        for epoch in range(num_epochs):
            for i, (images, labels) in enumerate(train_loader):  
                ...

    you can simply do the following
'''

train_loader = DataLoader(train_set,
                          batch_size=batch_size, 
                          num_workers=4, # use multi-core to load data
                          shuffle=False)

# must set shuffle=False otherwise will get the following error:
# PossibleUserWarning: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test/predict dataloaders.category=PossibleUserWarning,
# Question: Why? Answer: according to chatGPT, it's to ensure Reproducibility to ensures consistency in evaluation metrics across different runs and models.
valid_loader = DataLoader(valid_set,
                          batch_size=batch_size, 
                          num_workers=4, # use multi-core to load data
                          shuffle=False) 
 

# TODO: need to understand some of these parameters..!
trainer.fit(model=model,
            # gpus=1, # if you have GPU haha
            # deterministic= True, # make it more reproduceable
            # auto_lr_find=True,
            # gradient_clip_val, # some value between 0 and 1??
            train_dataloaders=train_loader, # this overrides the val_dataloader you defined I assume??
            val_dataloaders=valid_loader
            )


  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 392 K 
1 | relu | ReLU   | 0     
2 | l2   | Linear | 5.0 K 
--------------------------------
397 K     Trainable params
0         Non-trainable params
397 K     Total params
1.590     Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/opt/anaconda3/envs/pytorch_py38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:436: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 67.76it/s] val_loss: 2.290222644805908
                                                                           

/opt/anaconda3/envs/pytorch_py38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:436: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


Epoch 0: 100%|██████████| 480/480 [00:21<00:00, 22.53it/s, v_num=21] val_loss: 0.18457770347595215
Epoch 1: 100%|██████████| 480/480 [00:39<00:00, 12.05it/s, v_num=21] val_loss: 0.12921838462352753
Epoch 1: 100%|██████████| 480/480 [00:53<00:00,  8.95it/s, v_num=21]

`Trainer.fit` stopped: `max_epochs=2` reached.


Epoch 1: 100%|██████████| 480/480 [00:53<00:00,  8.93it/s, v_num=21]


In [97]:
# perform on test set!!
# todo: need to look into how to create a custom Dataset in pytorch
test_loader = DataLoader(dataset=test_set,
                          batch_size=batch_size, 
                          num_workers=4, # use multi-core to load data
                          shuffle=False) 

In [98]:
# wait, but where is the output? 
# oh nice!! it even printed out a Test Metric table ~, maybe it's because self.log()?
trainer.test(model, dataloaders=test_loader)

/opt/anaconda3/envs/pytorch_py38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:436: Consider setting `persistent_workers=True` in 'test_dataloader' to speed up the dataloader worker initialization.


Testing DataLoader 0: 100%|██████████| 100/100 [00:01<00:00, 63.48it/s] test_loss_agg: 0.10936376452445984
Testing DataLoader 0: 100%|██████████| 100/100 [00:01<00:00, 63.05it/s]


[{'test_loss': 0.10936377197504044, 'test_loss_agg': 0.10936376452445984}]

In [82]:
!ls lightning_logs/version19/checkpoints/epoch=1-step=960.ckpt

ls: lightning_logs/version19/checkpoints/epoch=1-step=960.ckpt: No such file or directory


In [83]:
# perform single inference

# issue: __init__() missing 3 required positional arguments: 'input_size', 'hidden_size', and 'num_classes'??
# solved: wow.. this is so dum,... I have to input all these params even when loading the model? 
model = LitNeuralNet.load_from_checkpoint(checkpoint_path="lightning_logs/version_8/checkpoints/epoch=1-step=960.ckpt",
                                        input_size=input_size,
                                        hidden_size=hidden_size,
                                        num_classes=num_classes)
model.eval()
x = torch.rand(1, 28 * 28)

with torch.no_grad():
    y_hat = model(x)

print(torch.argmax(y_hat, dim=1).item()) # use item to unwrap

3


-----------------
Question: __How to use pytorh lightning with tensorboard__?

Install dependencies
- pip install lightning[app]
- pip install tensorboard

Run the following Command
- tensorboard --logdir=lightning_logs

Common issues when tensorboard not working:
- restart 一下 notebook kernal 就行， 然后确认一下，下面这个 event path 有在你的 log 里面 `events.out.tfevents.1701028601.Qis-MacBook-Pro-2.local.14682.0`

-----------------
*Question*: why is my v_num= empty? Epoch 0: 100%|██████████| 2/2 [00:00<00:00,  4.29it/s, loss=2.32, v_num=]
- *Answer*: 好像 fast_dev_run=True 然后用 lightning 2.x 就可以有了
    - Epoch 0: 100%|██████████| 480/480 [00:19<00:00, 25.11it/s, v_num=2]

*Question*: Difference between Lightning and Pytorch-Lightning?
- *Answer*: 好像一个是 library 一个是 command line tool    

TODO: 官方文档好好过一遍: https://lightning.ai/docs/pytorch/stable/starter/introduction.html

### High Level Summary
- 大体说，就是让你用一些类似于 sklearn 的 api ， 比如 .fit(), .predict() 这些，来写pytorch model

### Other thoughts
- 官方文档改的好烂..
- 然后有一些好的 param 和 logging 好像也没有了.. 非常扯，比如我想要的 val_loss 和 test_loss 都没有自动带..
- 