Jupyter Notebook의 공부 내용은 AladdinPersson의 Pytorch Lightning tutorials 에서 가져왔다.
- https://www.youtube.com/watch?v=XbIN9LaQycQ&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&ab_channel=AladdinPersson

# Pytorch Lightning
https://www.youtube.com/watch?v=XbIN9LaQycQ&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&ab_channel=AladdinPersson

Pytorch lightning 설치 명령어: `pip install lightning`

In [None]:
!pip install lightning

Pytorch Lightning(이하 PL)은 Pytorch의 더 추상화된 버전으로, 마치 Tensorflow위에 Keras가 있는 느낌이다. Aladdin Persson의 경우는 PL을 시작하게 된 계기가 multi gpu training이 필요할 때 였다고 한다. Multi GPU를 사용하면 distributed training이나 TPU training 같은 경우에, PL이 이를 해결해줘서 매우 좋다고 한다. 그래서 PL 사이트를 보면 `Scale your models`라고 나와있는데, 이것이 multi-gpu 관련 내용이라 보면 된다.

두 번째 장점은 boilerplate 관련이다. 홈페이지에서 `with out boilerplate` 라고 적힌 것 처럼, 반복되는 코드를 줄이고, 우리의 코드를 더 compact하게 만들어 clear하고 maintain하기 쉽게 만들어준다.
- boilerplate는 변화없이 여러 군데에서 반복되는 코드를 말한다.
- 나의 경우는 더 깔끔한 코드를 만들고자, PytorchLightning을 공부하려고 한다.


PL의 연습과 더불어 장점을 보고자 간단한 Neural Net 모델을 가지고 실습을 진행한다.

먼저, Pytorch로 작성된 간단한 예제 코드를 가져왔다. (아래 코드는 매우 간단한 코드이지만, 나중에 복잡한 Pytorch 코드를 다룰 때도 동일한 방식을 사용할 수 있다.)
- 코드 출처: https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/pytorch_lightning/1.%20start%20code/simple_fc.py

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split


# ============ FC layer 2개로 이루어진 model
class NN(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# Set device cuda for GPU if it's available otherwise run on the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

# Load Data
entire_dataset = datasets.MNIST(
    root="dataset/", train=True, transform=transforms.ToTensor(), download=True
)
train_ds, val_ds = random_split(entire_dataset, [50000, 10000])
test_ds = datasets.MNIST(
    root="dataset/", train=False, transform=transforms.ToTensor(), download=True
)
train_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_ds, batch_size=batch_size, shuffle=False)

# Initialize network
model = NN(input_size=input_size, num_classes=num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Train Network
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(tqdm(train_loader)):
        # Get data to cuda if possible
        data = data.to(device=device)
        targets = targets.to(device=device)

        # Get to correct shape
        data = data.reshape(data.shape[0], -1)

        # Forward
        scores = model(data)
        loss = criterion(scores, targets)

        # Backward
        optimizer.zero_grad()
        loss.backward()

        # Gradient descent or adam step
        optimizer.step()


# Check accuracy on training & test to see how good our model
# ==================== 이 부분에서 우리가 F1 score던, ROC-AUC던, 바꾸고 싶을 때 boilerplate 가 발생할 수 있는데, 이것도 알아볼 것이다.
def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval()

    # We don't need to keep track of gradients here so we wrap it in torch.no_grad()
    with torch.no_grad():
        # Loop through the data
        for x, y in loader:

            # Move data to device
            x = x.to(device=device)
            y = y.to(device=device)

            # Get to correct shape
            x = x.reshape(x.shape[0], -1)

            # Forward pass
            scores = model(x)
            _, predictions = scores.max(1)

            # Check how many we got correct
            num_correct += (predictions == y).sum()

            # Keep track of number of samples
            num_samples += predictions.size(0)

    model.train()
    return num_correct / num_samples


# Check accuracy on training & test to see how good our model
model.to(device)
print(f"Accuracy on training set: {check_accuracy(train_loader, model)*100:.2f}")
print(f"Accuracy on validation set: {check_accuracy(val_loader, model)*100:.2f}")
print(f"Accuracy on test set: {check_accuracy(test_loader, model)*100:.2f}")

이제 이 코드를 Lighting 모듈로 바꿔볼 것이다.

# 2. Lightning Module

https://www.youtube.com/watch?v=HGF2iyThWT8&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=2

위의 코드를 그대로 복사해놓고, Lightning에서의 변경점을 @@@ 로 표시하며 진행하도록 하겠다.  
여기서의 Pytorch -> Lightning 코드 변환은 class NN(nn.Module) 부분에서 이루어진다.
- pl.LightningModule: 이는 기본적인 function을 담당하여, nn.Module과 상당히 같다. 이는 nn.Module을 inherit하고, 이에 추가적인 function들을 가지고있다.

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@


# ============ 간단한 FC layer 2개로 이루어진 model
class NN(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# 먼저 __init__와 forward까지는 위와 똑같이 하고 진행해도 코드 실행은 위의 pytorch 코드와 같이 된다. 하지만, 이것이 잘 구현한 버전이라고 하긴 힘들다.
class NN(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### 2. @@@ PL에서 추가된 부분이다. 아직은 안쓰지만 나중가서 사용하게 될 부분들이고, 이들은 기본중에서도 기본 pl.LightningModule 코드이다.
    def training_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('train_loss',loss)
      # PL에 log관련 함수도 있다. https://youtu.be/HGF2iyThWT8?list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&t=604
      # log는 옵션으로, PL이라고 해서 꼭 구현해야 되는건 아니다.
      return loss

    def validation_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('val_loss',loss)
      return loss

    def test_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('test_loss',loss)
      return loss

    def _common_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.loss_fn(scores, y)
      return loss, scores, y

    def predict_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.forward(x)
      preds = torch.argmax(scores, dim=1)
      return preds

    # 마지막은 optimizer 구성이다. 보통은 optimizer에 인자를 줄 때, model.parameters()를 하겠지만 여기선 self.parameters()를 한다.
    def configure_optimizers(self):
      return optim.Adam(self.parameters(), lr=0.001)
### @@@

# 3. Trainer

https://www.youtube.com/watch?v=eQvI5eAL0nA&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=3

여기서는 위에서 구현만하고 사용하지 않았던 training_step 부분을 실제로 사용하도록 할 것이다.  
여기서는 기존에 있던 Train Network 부분을 지울 것이다. 그리고 이 부분을 대체할 trainer 라는 것을 만들 것이다.

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@


# ============ 간단한 FC layer 2개로 이루어진 model
class NN(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### 2. @@@ PL에서 추가된 부분이다. 아직은 안쓰지만 나중가서 사용하게될 부분들이고, 이들은 기본중에서도 기본 추가 코드이다.
    def training_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('train_loss',loss)
      return loss

    def validation_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('val_loss',loss)
      return loss

    def test_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('test_loss',loss)
      return loss

    def _common_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.forward(x)
      loss = self.loss_fn(scores, y)
      return loss, scores, y

    def predict_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.forward(x)
      preds = torch.argmax(scores, dim=1)
      return preds

    def configure_optimizers(self):
      return optim.Adam(self.parameters(), lr=0.001)


# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

# Load Data
entire_dataset = datasets.MNIST(
    root="dataset/", train=True, transform=transforms.ToTensor(), download=True
)
train_ds, val_ds = random_split(entire_dataset, [50000, 10000])
test_ds = datasets.MNIST(
    root="dataset/", train=False, transform=transforms.ToTensor(), download=True
)
train_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_ds, batch_size=batch_size, shuffle=False)



model = NN(input_size=input_size, num_classes=num_classes).to(device)

trainer = pl.Trainer(accelerator="gpu", devices = 1, min_epochs=1, max_epochs=3, precision=16) # 3. @@@ 여기서 trainer를 사용한다.
trainer.fit(model, train_loader, val_loader) # 3. @@@ 이 부분은 lightning data module을 볼 때 다시 바뀔 수도 있다. 하지만 지금은 이대로 둔다.
trainer.validate(model, val_loader) # 3. @@@
trainer.test(model, test_loader) # 3. @@@

# 기존 Pytorch의 training 코드
'''
# Train Network
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(tqdm(train_loader)):
        # Get data to cuda if possible
        data = data.to(device=device)
        targets = targets.to(device=device)

        # Get to correct shape
        data = data.reshape(data.shape[0], -1)

        # Forward
        scores = model(data)
        loss = criterion(scores, targets)

        # Backward
        optimizer.zero_grad()
        loss.backward()

        # Gradient descent or adam step
        optimizer.step()
'''


# Check accuracy on training & test to see how good our model
def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval()

    # We don't need to keep track of gradients here so we wrap it in torch.no_grad()
    with torch.no_grad():
        # Loop through the data
        for x, y in loader:

            # Move data to device
            x = x.to(device=device)
            y = y.to(device=device)

            # Get to correct shape
            x = x.reshape(x.shape[0], -1)

            # Forward pass
            scores = model(x)
            _, predictions = scores.max(1)

            # Check how many we got correct
            num_correct += (predictions == y).sum()

            # Keep track of number of samples
            num_samples += predictions.size(0)

    model.train()
    return num_correct / num_samples


# Check accuracy on training & test to see how good our model
model.to(device)
print(f"Accuracy on training set: {check_accuracy(train_loader, model)*100:.2f}")
print(f"Accuracy on validation set: {check_accuracy(val_loader, model)*100:.2f}")
print(f"Accuracy on test set: {check_accuracy(test_loader, model)*100:.2f}")


결과를 보면 `val_dataloader's sampler has shuffling enabled` 라고 해서, validation loader에 shuffle이 가능하도록 설정되있음을 경고해준다. (val loader는 shuffle되있으면 정확한 score 파악이 어렵다.)

그리고 accuracy가 training set, valid set에 대해 같은 것으로 보아 data set 처리에서 뭔가 잘못되었을 확률이 크다.
- val loader도 train set을 넣어준 것이 문제였다.

pl.Trainer()에서 줄수 있는 인자는 다음과 같다.
- accelerator: gpu를 사용할지 (예: accelerator ="gpu")
- gpus: gpu를 몇개 사용할지, 혹은 몇 번 gpu를 사용할지 (예: gpus = 2는 gpu 2개 사용, gpus=[0,1]은 0번째, 1번쨰 gpu사용)
  - gpus는 devices로 바꿀 수 있다. 예: devices = 1

- min_epochs: 우리가 학습시키고 싶은 최소 epoch (예: min_epochs = 1)
- max_epochs: 우리가 학습시키고 싶은 최대 epoch (예: max_epochs = 1)
- precision: 부동소수점 정확도 지정 (precision = 16)
그 외 pl.Trainer의 인자
- logger: tensorboard를 위한 것
- enable_checkpointing: callbacks을 위한 것
- num_nodes: distributed training을 위한 것
- overfit_batches: batch에 대해 overfit하는 것
  - 우리가 항상 training 전에 해야하는 것은 single batch에 대해서 overfitting을 해서 코드 제대로 짜고 training 이루어지는지 확인하는 것이다.
- fast_dev_run: 만약 True로 설정하면, batch에 대해서 training, validation, test를 한번에 구함. 이렇게 하면, 우리의 training, valid, test 파이프라인이 잘 돌아가는지 한번에 확인 가능
- strategy: multi-gpu 사용할 때, multiple gpu에 모델 copy해서 데이터를 parallel하게 실행함.
- profiler: 어디서 bottlenecking이 발생하는지, data loader에서 시간이 오래 걸리는지 어떤지



trainer.fit()의 경우 인자
- model: pl.LightningModule 모델을 넣어야한다.
- train_dataloaders:
- val_dataloaders:
- datamodule: 다음 강좌에서 본다.
- ckpt_path: 어디서부터 학습을 다시 재개할 것인지

**이제 trainer를 통해서 PL 모듈기반 모델 내부에 정의한 training_step과 validation_step을 사용했음을 알 수 있다.**
- 이러한 부분에서 편리하고 compact한 코드 작성이였음을 알 수 있다.

# 4. Metrics

https://www.youtube.com/watch?v=e6Nw01v2X4s&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=4

여기서는 torch metrics를 볼 것이다. 여기에는 간단한 accuracy부터 F1 score, ROC-AUC 값 등등 다양하게 있고, torchmetrics에 없어도 custom metrics를 만들 수 있다고 한다.

In [None]:
!pip install torchmetrics

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.




# 우리만의 custom Metrics 만들기
class MyAccuracy(Metric): # 4. @@@ Metric을 inherit한다.
  def __init__(self):
    super().__init__()
    self.add_state("total", default=torch.tensor(0), dist_reduce_fx = "sum") # dist_reduce_fx는 multi gpu 관련된 것으로 보이다.
    self.add_state("correct", default=torch.tensor(0), dist_reduce_fx = "sum")

  def update(self, preds, target):
    preds = torch.argmax(preds, dim=1)
    assert preds.shape == target.shape
    self.correct += torch.sum(preds == target)
    self.total += target.numel() # target의 element 개수

  def compute(self):
    return self.correct.float() / self.total.float()

class NN(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 50)
        self.fc2 = nn.Linear(50, num_classes)
        self.loss_fn = nn.CrossEntropyLoss()
        self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
        self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@ 여기선 클래스마다 분포가 동일하여, accuracy와 값이 같을 것이다.


    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### 2. @@@
    def training_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      # self.log('train_loss',loss) # 4. @@@ -

      # 추가된 코드는 조금 느릴 것이다. 나중에 profiler를 해보면 여기서 시간이 많이 걸리는 것을 확인할 수 있을 것이다.
      accuracy = self.accuracy(scores, y) # 4. @@@
      f1_score = self.f1_score(scores, y) # 4. @@@
      self.log_dict({"train_loss": loss, "train_accracy": accuracy, "train_f1_score":f1_score},
                    on_step=False, on_epoch=True, prog_bar = True)
                    # 4. @@@ on_step을 True로 하면 아마 batch 단위로 출력하는 것 같다. 그래서 on_epoch을 True로 하면 epoch 단위로 accuracy가 출력되는 듯 하다.
                    # 만약 epoch을 True로 하면 맨 처음 epoch에선 batch 단위로 계산해도 accuracy같은 것이 출력되지 않을 것이다.
                    # 그런데 step을 True로 하면 맨 처음 epoch 안에서도 batch 단위로 계산할 때마다 accuracy같은 값이 바뀔 것이다.

      return {'loss': loss, "scores": scores, "y";y}# loss



'''
    def training_epoch_end(self, outputs): # 4. @@@ training으로부터의 모든 output을 받는다. 이러한 경우에는 training_step의 output을 dictionary 형태로 바꾼다.
      pass
'''


    def validation_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('val_loss',loss)
      return loss

    def test_step(self, batch, batch_idx):
      loss, scores, y = self._common_step(batch, batch_idx)
      self.log('test_loss',loss)
      return loss

    def _common_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.forward(x)
      loss = self.loss_fn(scores, y)
      return loss, scores, y

    def predict_step(self, batch, batch_idx):
      x, y = batch
      x = x.reshape(x.size(0), -1)
      scores = self.forward(x)
      preds = torch.argmax(scores, dim=1)
      return preds

    def configure_optimizers(self):
      return optim.Adam(self.parameters(), lr=0.001)


# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

# Load Data
entire_dataset = datasets.MNIST(
    root="dataset/", train=True, transform=transforms.ToTensor(), download=True
)
train_ds, val_ds = random_split(entire_dataset, [50000, 10000])
test_ds = datasets.MNIST(
    root="dataset/", train=False, transform=transforms.ToTensor(), download=True
)
train_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_ds, batch_size=batch_size, shuffle=False)



model = NN(input_size=input_size, num_classes=num_classes).to(device)

trainer = pl.Trainer(accelerator="gpu", devices = 1, min_epochs=1, max_epochs=3, precision=16) # 3. @@@
trainer.fit(model, train_loader, val_loader) # 3. @@@
trainer.validate(model, val_loader) # 3. @@@
trainer.test(model, test_loader) # 3. @@@

# 기존 Pytorch의 training 코드
'''
# Train Network
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(tqdm(train_loader)):
        # Get data to cuda if possible
        data = data.to(device=device)
        targets = targets.to(device=device)

        # Get to correct shape
        data = data.reshape(data.shape[0], -1)

        # Forward
        scores = model(data)
        loss = criterion(scores, targets)

        # Backward
        optimizer.zero_grad()
        loss.backward()

        # Gradient descent or adam step
        optimizer.step()
'''


# Check accuracy on training & test to see how good our model
# 이는 error prone(에러 발생이 쉬운)코드 이고, messy한 코드이다.
def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval()

    # We don't need to keep track of gradients here so we wrap it in torch.no_grad()
    with torch.no_grad():
        # Loop through the data
        for x, y in loader:

            # Move data to device
            x = x.to(device=device)
            y = y.to(device=device)

            # Get to correct shape
            x = x.reshape(x.shape[0], -1)

            # Forward pass
            scores = model(x)
            _, predictions = scores.max(1)

            # Check how many we got correct
            num_correct += (predictions == y).sum()

            # Keep track of number of samples
            num_samples += predictions.size(0)

    model.train()
    return num_correct / num_samples


# Check accuracy on training & test to see how good our model
model.to(device)
print(f"Accuracy on training set: {check_accuracy(train_loader, model)*100:.2f}")
print(f"Accuracy on validation set: {check_accuracy(val_loader, model)*100:.2f}")
print(f"Accuracy on test set: {check_accuracy(test_loader, model)*100:.2f}")

self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
        self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@

이 두개는 torchmetrics의 예시 코드 정도로 보면 된다.

# 5. Data Module

https://www.youtube.com/watch?v=NjF1ZpRO4Ws&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=5

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.




# 우리만의 custom Metrics 만들기
class MyAccuracy(Metric): # 4. @@@ Metric을 inherit한다.
  def __init__(self):
    super().__init__()
    self.add_state("total", default=torch.tensor(0), dist_reduce_fx = "sum") # dist_reduce_fx는 multi gpu 관련된 것으로 보이다.
    self.add_state("correct", default=torch.tensor(0), dist_reduce_fx = "sum")

  def update(self, preds, target):
    preds = torch.argmax(preds, dim=1)
    assert preds.shape == target.shape
    self.correct += torch.sum(preds == target)
    self.total += target.numel() # target의 element 개수

  def compute(self):
    return self.correct.float() / self.total.float()

class NN(pl.LightningModule):
  def __init__(self, input_size, num_classes):
    super().__init__()
    self.fc1 = nn.Linear(input_size, 50)
    self.fc2 = nn.Linear(50, num_classes)
    self.loss_fn = nn.CrossEntropyLoss()
    self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
    self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@ 여기선 클래스마다 분포가 동일하여, accuracy와 값이 같을 것이다.


  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  ### 2. @@@
  def training_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    # self.log('train_loss',loss) # 4. @@@ -

    # 추가된 코드는 좀 느릴 것이다. 나중에 profiler를 해보면 여기서 시간이 많이 걸리는 것을 확인할 수 있을 것이다.
    accuracy = self.accuracy(scores, y) # 4. @@@
    f1_score = self.f1_score(scores, y) # 4. @@@
    self.log_dict({"train_loss": loss, "train_accracy": accuracy, "train_f1_score":f1_score},
                  on_step=False, on_epoch=True, prog_bar = True)
                  # 4. @@@ on_step을 True로 하면 아마 batch 단위로 출력하는듯? 그래서 on_epoch을 True로 하면 epoch 단위로?
                  # 만약 epoch을 True로 하면 맨 처음 epoch에선 batch 단위로 계산해도 accuracy같은거 출력이 안될것이다.
                  # 그런데 step을 True로 하면 맨 처음 epoch 안에서도 batch 단위로 계산할 때마다 accuracy같은 값이 바뀔 것이다.

    return {"loss": loss, "scores": scores, "y":y}# loss

  def validation_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('val_loss',loss)
    return loss

  def test_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('test_loss',loss)
    return loss

  def _common_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    loss = self.loss_fn(scores, y)
    return loss, scores, y

  def predict_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    preds = torch.argmax(scores, dim=1)
    return preds

  # 마지막은 optimizer 구성이다. 보통은 model.parameters를 하겠지만 여기선 self.parameters를 한다.
  def configure_optimizers(self):
    return optim.Adam(self.parameters(), lr=0.001)


class MnistDataModule(pl.LightningDataModule): # @@@ 5.
  def __init__(self, data_dir, batch_size, num_workers):
    super().__init__()
    self.data_dir = data_dir
    self.batch_size = batch_size
    self.num_workers = num_workers

  def prepare_data(self):
    datasets.MNIST(self.data_dir, train=True, download=True)
    datasets.MNIST(self.data_dir, train=False, download=True)

  def setup(self, stage): # stage 이해 필요 *********************
    # done with multiple gpu, this is called on every gpu in the system
    entire_dataset = datasets.MNIST(
        root = self.data_dir,
        train=True,
        transform = transforms.ToTensor(),
        download=False
    )

    self.train_ds, self.val_ds = random_split(entire_dataset, (50000, 10000))

    self.train_ds = datasets.MNIST(self.data_dir,
                                  train=False,
                                  transform = transforms.ToTensor(),
                                  download=False)

  def train_dataloader(self): # 여기 train_loader라고 했었는데, 반드시 train_dataloader로 정의해야한다 라고 에러가 나왔다.
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=True)

  def val_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)

  def test_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)


# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

'''
# Load Data
entire_dataset = datasets.MNIST(
    root="dataset/", train=True, transform=transforms.ToTensor(), download=True
)
train_ds, val_ds = random_split(entire_dataset, [50000, 10000])
test_ds = datasets.MNIST(
    root="dataset/", train=False, transform=transforms.ToTensor(), download=True
)
train_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(dataset=train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_ds, batch_size=batch_size, shuffle=False)
'''

dm = MnistDataModule(data_dir = "dataset/", batch_size = batch_size, num_workers=4) # 5. @@@

model = NN(input_size=input_size, num_classes=num_classes)

trainer = pl.Trainer(accelerator="gpu", devices = 1, min_epochs=1, max_epochs=3, precision=16) # 3. @@@ 여기서 trainer를 사용한다.
trainer.fit(model, dm) # 3. @@@ 이 부분은 lightning data module을 볼 때 다시 바뀔 수도 있다. 하지만 지금은 이대로 둔다.
trainer.validate(model, dm) # 3. @@@
trainer.test(model, dm) # 3. @@@

INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name     | Type               | Params
------------------------------------------------
0 | fc1      | Linear             | 39.2 K
1 | fc2      | Linear             | 510   
2 | loss_fn  | CrossEntropyLoss   | 0     
3 | accuracy | MulticlassAccuracy | 0     
4 | f1_score | MulticlassF1Score  | 0     
------------------------------------------------
39.8 K    Trainable params
0         Non-trainable params
39.8 K    Total params
0.159     Total estim

Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

[{'test_loss': 0.2638879716396332}]

# 6. Data Module

https://www.youtube.com/watch?v=UtQoZ_v57uI&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=6

이번에는 코드를 더 clean하게 만들어볼 것이다.  
여태까지는 python 파일 하나에 모든 코드를 넣었는데, 이를 모듈화 할것이다.  
config.py, dataset.py, model.py, train.py 로 나눌 것이다. 여기서는 주석처리로 각각 부분을 나타낼 것이다.

In [None]:
# ================================= dataset.py

import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.

class MnistDataModule(pl.LightningDataModule): # @@@ 5.
  def __init__(self, data_dir, batch_size, num_workers):
    super().__init__()
    self.data_dir = data_dir
    self.batch_size = batch_size
    self.num_workers = num_workers

  def prepare_data(self):
    datasets.MNIST(self.data_dir, train=True, download=True)
    datasets.MNIST(self.data_dir, train=False, download=True)

  def setup(self, stage): # stage 이해 필요 *********************
    # done with multiple gpu, this is called on every gpu in the system
    entire_dataset = datasets.MNIST(
        root = self.data_dir,
        train=True,
        transform = transforms.ToTensor(),
        download=False
    )

    self.train_ds, self.val_ds = random_split(entire_dataset, (50000, 10000))

    self.train_ds = datasets.MNIST(self.data_dir,
                                  train=False,
                                  transform = transforms.ToTensor(),
                                  download=False)

  def train_dataloader(self): # 여기 train_loader라고 했었는데, 반드시 train_dataloader로 정의해야한다 라고 에러가 나왔다.
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=True)

  def val_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)

  def test_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)





# ================================= model.py

import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.

class NN(pl.LightningModule):
  def __init__(self, input_size, learning_rate, num_classes):
    super().__init__()
    self.fc1 = nn.Linear(input_size, 50)
    self.fc2 = nn.Linear(50, num_classes)
    self.lr = learning_rate
    self.loss_fn = nn.CrossEntropyLoss()
    self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
    self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@ 여기선 클래스마다 분포가 동일하여, accuracy와 값이 같을 것이다.


  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  ### 2. @@@
  def training_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    # self.log('train_loss',loss) # 4. @@@ -

    # 추가된 코드는 좀 느릴 것이다. 나중에 profiler를 해보면 여기서 시간이 많이 걸리는 것을 확인할 수 있을 것이다.
    accuracy = self.accuracy(scores, y) # 4. @@@
    f1_score = self.f1_score(scores, y) # 4. @@@
    self.log_dict({"train_loss": loss, "train_accracy": accuracy, "train_f1_score":f1_score},
                  on_step=False, on_epoch=True, prog_bar = True)
                  # 4. @@@ on_step을 True로 하면 아마 batch 단위로 출력하는듯? 그래서 on_epoch을 True로 하면 epoch 단위로?
                  # 만약 epoch을 True로 하면 맨 처음 epoch에선 batch 단위로 계산해도 accuracy같은거 출력이 안될것이다.
                  # 그런데 step을 True로 하면 맨 처음 epoch 안에서도 batch 단위로 계산할 때마다 accuracy같은 값이 바뀔 것이다.

    return {"loss": loss, "scores": scores, "y":y}# loss

  def validation_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('val_loss',loss)
    return loss

  def test_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('test_loss',loss)
    return loss

  def _common_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    loss = self.loss_fn(scores, y)
    return loss, scores, y

  def predict_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    preds = torch.argmax(scores, dim=1)
    return preds

  # 마지막은 optimizer 구성이다. 보통은 model.parameters를 하겠지만 여기선 self.parameters를 한다.
  def configure_optimizers(self):
    return optim.Adam(self.parameters(), lr=self.lr)








# ================================= config.py -> setting관련

# Training Hyperparameters
INPUT_SIZE = 784
NUM_CLASSES = 10
LEARNING_RATE = 0.001
BATCH_SIZE = 64
NUM_EPOCHS = 3

# Dataset
DATA_DIR = "dataset/"
NUM_WORKERS = 4

# Computer related
ACCELERATOR = "gpu"
DEVICES = [0]
PRECISION = 16




# ================================= train.py

import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.

dm = MnistDataModule(data_dir = "dataset/", batch_size = batch_size, num_workers=4) # 5. @@@

model = NN(input_size=input_size, num_classes=num_classes)

trainer = pl.Trainer(accelerator="gpu", devices = 1, min_epochs=1, max_epochs=3, precision=16) # 3. @@@ 여기서 trainer를 사용한다.
trainer.fit(model, dm) # 3. @@@ 이 부분은 lightning data module을 볼 때 다시 바뀔 수도 있다. 하지만 지금은 이대로 둔다.
trainer.validate(model, dm) # 3. @@@
trainer.test(model, dm) # 3. @@@

In [None]:
# Main code:

import torch
import pytorch_lightning as pl
from model import NN
from dataset import MnistDataModule
import config

**pip install black**을 통해 코드 스타일을 통일해볼 수 있다.
- 영상 막바지에 사용함 ******

# 7. Callbacks

https://www.youtube.com/watch?v=UtQoZ_v57uI&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=6

여기서는 Keras에도 있는 callback에 대해 알아볼 것이다.

In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.
from pytorch_lightning.callbacks import EarlyStopping, Callback # 7. @@@



class MyPrintingCallback(Callback): # 7. @@@
  def __init__(self):
    super().__init__()

  def on_train_start(self, trainer, pl_module): # training 시작할 때의 log 표시인듯
    print("Starting to train")

  def on_train_end(self, trainer, pl_module): # training 끝날 때의 log 표시인듯
    print("Training is done")



# 우리만의 custom Metrics 만들기
class MyAccuracy(Metric): # 4. @@@ Metric을 inherit한다.
  def __init__(self):
    super().__init__()
    self.add_state("total", default=torch.tensor(0), dist_reduce_fx = "sum") # dist_reduce_fx는 multi gpu 관련된 것으로 보이다.
    self.add_state("correct", default=torch.tensor(0), dist_reduce_fx = "sum")

  def update(self, preds, target):
    preds = torch.argmax(preds, dim=1)
    assert preds.shape == target.shape
    self.correct += torch.sum(preds == target)
    self.total += target.numel() # target의 element 개수

  def compute(self):
    return self.correct.float() / self.total.float()

class NN(pl.LightningModule):
  def __init__(self, input_size, num_classes):
    super().__init__()
    self.fc1 = nn.Linear(input_size, 50)
    self.fc2 = nn.Linear(50, num_classes)
    self.loss_fn = nn.CrossEntropyLoss()
    self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
    self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@ 여기선 클래스마다 분포가 동일하여, accuracy와 값이 같을 것이다.


  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  ### 2. @@@
  def training_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    # self.log('train_loss',loss) # 4. @@@ -

    # 추가된 코드는 좀 느릴 것이다. 나중에 profiler를 해보면 여기서 시간이 많이 걸리는 것을 확인할 수 있을 것이다.
    accuracy = self.accuracy(scores, y) # 4. @@@
    f1_score = self.f1_score(scores, y) # 4. @@@
    self.log_dict({"train_loss": loss, "train_accracy": accuracy, "train_f1_score":f1_score},
                  on_step=False, on_epoch=True, prog_bar = True)
                  # 4. @@@ on_step을 True로 하면 아마 batch 단위로 출력하는듯? 그래서 on_epoch을 True로 하면 epoch 단위로?
                  # 만약 epoch을 True로 하면 맨 처음 epoch에선 batch 단위로 계산해도 accuracy같은거 출력이 안될것이다.
                  # 그런데 step을 True로 하면 맨 처음 epoch 안에서도 batch 단위로 계산할 때마다 accuracy같은 값이 바뀔 것이다.

    return {"loss": loss, "scores": scores, "y":y}# loss

  def validation_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('val_loss',loss)
    return loss

  def test_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('test_loss',loss)
    return loss

  def _common_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    loss = self.loss_fn(scores, y)
    return loss, scores, y

  def predict_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    preds = torch.argmax(scores, dim=1)
    return preds

  # 마지막은 optimizer 구성이다. 보통은 model.parameters를 하겠지만 여기선 self.parameters를 한다.
  def configure_optimizers(self):
    return optim.Adam(self.parameters(), lr=0.001)


class MnistDataModule(pl.LightningDataModule): # @@@ 5.
  def __init__(self, data_dir, batch_size, num_workers):
    super().__init__()
    self.data_dir = data_dir
    self.batch_size = batch_size
    self.num_workers = num_workers

  def prepare_data(self):
    datasets.MNIST(self.data_dir, train=True, download=True)
    datasets.MNIST(self.data_dir, train=False, download=True)

  def setup(self, stage): # stage 이해 필요 *********************
    # done with multiple gpu, this is called on every gpu in the system
    entire_dataset = datasets.MNIST(
        root = self.data_dir,
        train=True,
        transform = transforms.ToTensor(),
        download=False
    )

    self.train_ds, self.val_ds = random_split(entire_dataset, (50000, 10000))

    self.train_ds = datasets.MNIST(self.data_dir,
                                  train=False,
                                  transform = transforms.ToTensor(),
                                  download=False)

  def train_dataloader(self): # 여기 train_loader라고 했었는데, 반드시 train_dataloader로 정의해야한다 라고 에러가 나왔다.
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=True)

  def val_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)

  def test_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)


# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

torch.set_float32_matmul_precision("medium") # lightning 속도를 더 빠르게 한다 함.

dm = MnistDataModule(data_dir = "dataset/", batch_size = batch_size, num_workers=4) # 5. @@@

model = NN(input_size=input_size, num_classes=num_classes)

trainer = pl.Trainer(accelerator="gpu",
                     devices = 1,
                     min_epochs=1,
                     max_epochs=1000,
                     precision=16,
                     callbacks=[MyPrintingCallback(), EarlyStopping(monitor="val_loss")]) # 3. @@@ 여기서 trainer를 사용한다.
trainer.fit(model, dm) # 3. @@@ 이 부분은 lightning data module을 볼 때 다시 바뀔 수도 있다. 하지만 지금은 이대로 둔다.
trainer.validate(model, dm) # 3. @@@
trainer.test(model, dm) # 3. @@@

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 94920414.89it/s]


Extracting dataset/MNIST/raw/train-images-idx3-ubyte.gz to dataset/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 28650826.35it/s]


Extracting dataset/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 27186496.94it/s]


Extracting dataset/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 23119573.75it/s]


Extracting dataset/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/MNIST/raw



INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name     | Type               | Params
------------------------------------------------
0 | fc1      | Linear             | 39.2 K
1 | fc2      | Linear             | 510   
2 | loss_fn  | CrossEntropyLoss   | 0     
3 | accuracy | MulticlassAccuracy | 0     
4 | f1_score | MulticlassF1Score  | 0     
------------------------------------------------
39.8 K    Trainable params
0         Non-trainable params
39.8 K    Total params
0.159     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Starting to train


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Exception ignored in: <function _after_fork at 0x7f3438661990>
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1635, in _after_fork
    thread._reset_internal_locks(False)
  File "/usr/lib/python3.10/threading.py", line 885, in _reset_internal_locks
    def _reset_internal_locks(self, is_alive):
KeyboardInterrupt: 


In [None]:
# callbacks.py

from pytorch_lightning.callbacks import EarlyStopping, Callback

class MyPrintingCallback(Callback):
  def __init__(self):
    super().__init__()

  def on_train_start(self, trainer, pl_module): # training 시작할 때의 log 표시인듯
    print("Starting to train")

  def on_train_end(self, trainer, pl_module): # training 끝날 때의 log 표시인듯
    print("Training is done")


# 8. Logging with Tensorboard

https://www.youtube.com/watch?v=iCO3h4WhvdQ&list=PLhhyoLH6IjfyL740PTuXef4TstxAK6nGP&index=8

In [None]:
!pip install tensorboard



In [None]:
# =========== Import Libraries
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import nn, optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import random_split
import pytorch_lightning as pl # @@@
import torchmetrics # 4. @@@
from torchmetrics import Metric# 4. @@@ 이건 우리의 custom metric을 만들 때 사용할 것이다.
from pytorch_lightning.callbacks import EarlyStopping, Callback # 7. @@@
from torchvision.transforms import RandomHorizontalFlip, RandomVerticalFlip # 8. @@@
import torchvision
from pytorch_lightning.loggers import TensorBoardLogger


class MyPrintingCallback(Callback): # 7. @@@
  def __init__(self):
    super().__init__()

  def on_train_start(self, trainer, pl_module): # training 시작할 때의 log 표시인듯
    print("Starting to train")

  def on_train_end(self, trainer, pl_module): # training 끝날 때의 log 표시인듯
    print("Training is done")



# 우리만의 custom Metrics 만들기
class MyAccuracy(Metric): # 4. @@@ Metric을 inherit한다.
  def __init__(self):
    super().__init__()
    self.add_state("total", default=torch.tensor(0), dist_reduce_fx = "sum") # dist_reduce_fx는 multi gpu 관련된 것으로 보이다.
    self.add_state("correct", default=torch.tensor(0), dist_reduce_fx = "sum")

  def update(self, preds, target):
    preds = torch.argmax(preds, dim=1)
    assert preds.shape == target.shape
    self.correct += torch.sum(preds == target)
    self.total += target.numel() # target의 element 개수

  def compute(self):
    return self.correct.float() / self.total.float()


logger = TensorBoardLogger("tb_logs", name = "mnist_model_v1") # 8. @@@ tb_logs는 우리가 tensorbaordLog를 저장할 디렉토리

class NN(pl.LightningModule):
  def __init__(self, input_size, num_classes):
    super().__init__()
    self.fc1 = nn.Linear(input_size, 50)
    self.fc2 = nn.Linear(50, num_classes)
    self.loss_fn = nn.CrossEntropyLoss()
    self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # 4. @@@
    self.f1_score = torchmetrics.F1Score(task="multiclass", num_classes = num_classes) # 4. @@@ 여기선 클래스마다 분포가 동일하여, accuracy와 값이 같을 것이다.


  def forward(self, x):
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

  ### 2. @@@
  def training_step(self, batch, batch_idx):
    x, y = batch
    loss, scores, y = self._common_step(batch, batch_idx)
    # self.log('train_loss',loss) # 4. @@@ -

    # 추가된 코드는 좀 느릴 것이다. 나중에 profiler를 해보면 여기서 시간이 많이 걸리는 것을 확인할 수 있을 것이다.
    accuracy = self.accuracy(scores, y) # 4. @@@
    f1_score = self.f1_score(scores, y) # 4. @@@
    self.log_dict({"train_loss": loss,
                   "train_accracy": accuracy,
                   "train_f1_score":f1_score},
                  on_step=False, on_epoch=True, prog_bar = True)
                  # 4. @@@ on_step을 True로 하면 아마 batch 단위로 출력하는듯? 그래서 on_epoch을 True로 하면 epoch 단위로?
                  # 만약 epoch을 True로 하면 맨 처음 epoch에선 batch 단위로 계산해도 accuracy같은거 출력이 안될것이다.
                  # 그런데 step을 True로 하면 맨 처음 epoch 안에서도 batch 단위로 계산할 때마다 accuracy같은 값이 바뀔 것이다.

    if(batch_idx % 100 == 0): # 8. @@@
      x = x[:8]
      grid = torchvision.utils.make_grid(x.view(-1,1,28,28)) # 이미지 8장을 보여주도록 하는 코드라 함.
      self.logger.experiment.add_image("mnist_images", grid, self.global_step) # 이건 pytorch docs의 tensorboard 파트 봐야겠음

    return {"loss": loss, "scores": scores, "y":y}# loss

  def validation_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('val_loss',loss)
    return loss

  def test_step(self, batch, batch_idx):
    loss, scores, y = self._common_step(batch, batch_idx)
    self.log('test_loss',loss)
    return loss

  def _common_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    loss = self.loss_fn(scores, y)
    return loss, scores, y

  def predict_step(self, batch, batch_idx):
    x, y = batch
    x = x.reshape(x.size(0), -1)
    scores = self.forward(x)
    preds = torch.argmax(scores, dim=1)
    return preds

  # 마지막은 optimizer 구성이다. 보통은 model.parameters를 하겠지만 여기선 self.parameters를 한다.
  def configure_optimizers(self):
    return optim.Adam(self.parameters(), lr=0.001)


class MnistDataModule(pl.LightningDataModule): # @@@ 5.
  def __init__(self, data_dir, batch_size, num_workers):
    super().__init__()
    self.data_dir = data_dir
    self.batch_size = batch_size
    self.num_workers = num_workers

  def prepare_data(self):
    datasets.MNIST(self.data_dir, train=True, download=True)
    datasets.MNIST(self.data_dir, train=False, download=True)

  def setup(self, stage): # stage 이해 필요 *********************
    # done with multiple gpu, this is called on every gpu in the system
    entire_dataset = datasets.MNIST(
        root = self.data_dir,
        train=True,
        transform = transforms.Compose([
            transforms.RandomVerticalFlip(), # 8. @@@
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor()
        ]),
        download=False
    )

    self.train_ds, self.val_ds = random_split(entire_dataset, (50000, 10000))

    self.train_ds = datasets.MNIST(self.data_dir,
                                  train=False,
                                  transform = transforms.ToTensor(),
                                  download=False)

  def train_dataloader(self): # 여기 train_loader라고 했었는데, 반드시 train_dataloader로 정의해야한다 라고 에러가 나왔다.
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=True)

  def val_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)

  def test_dataloader(self):
    return DataLoader(self.train_ds,
                      batch_size=self.batch_size,
                      num_workers=self.num_workers,
                      shuffle=False)


# Hyperparameters
input_size = 784
num_classes = 10
learning_rate = 0.001
batch_size = 64
num_epochs = 3

torch.set_float32_matmul_precision("medium") # lightning 속도를 더 빠르게 한다 함.

dm = MnistDataModule(data_dir = "dataset/", batch_size = batch_size, num_workers=4) # 5. @@@

model = NN(input_size=input_size, num_classes=num_classes)

trainer = pl.Trainer(logger = logger, # 이는 자동으로 우리가 어떤 metric에 대해서 logging하는지 알 수 있다. 우리의 model 파트를 보면 log_dict에 train_loss, accuracy, f1 score를 해놓았다.
                     accelerator="gpu",
                     devices = 1,
                     min_epochs=1,
                     max_epochs=1000,
                     precision=16,
                     callbacks=[MyPrintingCallback(), EarlyStopping(monitor="val_loss")]) # 3. @@@ 여기서 trainer를 사용한다.
trainer.fit(model, dm) # 3. @@@ 이 부분은 lightning data module을 볼 때 다시 바뀔 수도 있다. 하지만 지금은 이대로 둔다.
trainer.validate(model, dm) # 3. @@@
trainer.test(model, dm) # 3. @@@

  rank_zero_warn(
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name     | Type               | Params
------------------------------------------------
0 | fc1      | Linear             | 39.2 K
1 | fc2      | Linear             | 510   
2 | loss_fn  | CrossEntropyLoss   | 0     
3 | accuracy | MulticlassAccuracy | 0     
4 | f1_score | MulticlassF1Score  | 0     
------------------------------------------------
39.8 K    Trainable params
0         Non-trainable params
39.8 K    Total params
0.1

Sanity Checking: 0it [00:00, ?it/s]

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-8kf90p9t'


Starting to train


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

RuntimeError: ignored

- self.log에 등록한 것과 log_dict에 등록한 것만 EarlyStopping 인자에 넣을 수 있는듯,