# 9. Hyperparameter Tuning

> PyTorch 기반으로 여러 config들을 통해 학습할 때에 사용되는 parameter들을 실험자가 손 쉽게 지역 최적해를 구할 수 있도록 도와주는 Ray Tune 프레임워크로 최적화하는 방법을 학습합니다.  
이를 통해 Grid & Random, 그리고 Bayesian 같은 기본적인 Parameter Search 방법론들과 Ray Tune 모듈을 사용하여 PyTorch 딥러닝 프로젝트 코드 구성을 하는 방법을 익히게 됩니다.

<br>

## Reference

- [Pytorch와 Ray 같이 사용하기](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)

<br>

## 9.1 Hyperparameter Tuning

- 모델 스스로 학습하지 않는 값은 사람이 지정
  - learning rate, 모델의 크기, optimizer 등
- 하이퍼 파라메터에 의해서 값의 크게 좌우 될 때도 있음 (요즘은 그닥?)
- 마지막 0.01을 쥐어짜야 할 때 도전해볼만!

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1HGaKiTmoL8lUCVYOJaNOwWZOBBpzFqOO' width=400/>

- 출처 : https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs190033

<br>

- 가장 기본적인 방법
  - grid search : 값들을 일정한 범위 내에서 선택
  - random search : 값을 랜덤하게 선택
- 최근에는 베이지안 기반 기법들이 주도
  - 관련 논문 : BOHB (2018)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1xD62XXQohfURXwLal2hoW1xQDbJLgjsY' width=600/>

- 출처: https://dl.acm.org/doi/pdf/10.5555/2188385.2188395

<br>

## 9.2 Ray

- multi-node multi processing 지원 모듈
- ML/DL 의 병렬 처리를 위해 개발된 모듈
- 기본적으로 현재의 분산병렬 ML/DL 모듈의 표준
- Hyperparameter Search를 위한 다양한 모듈 제공

<br>

### 9.2.1 Example Code

```python
data_dir = os.path.abspath("./data")
load_data(data_dir)

# config 에 search space 지정
config = {
    "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
    "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16])
}

# 학습 스케줄링 알고리즘 지정
#  - 가망이 없는 하이퍼파라미터는 이후 스텝에서 변경 대상에서 제외됨
scheduler = ASHAScheduler(
    metric="loss", mode="min", max_t=max_num_epochs, grace_period=1,
    reduction_factor=2)

# 결과 출력 양식 지정
reporter = CLIReporter(
    metric_columns=["loss", "accuracy", "training_iteration"])

# 병렬 처리 양식으로 학습 시행
result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
    config=config, num_samples=num_samples,
    scheduler=scheduler,
    progress_reporter=reporter)
```

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src='https://drive.google.com/uc?id=1weM8YEKnkjpoUZktlPwjGw3wxpKZSWmc' width=600/>

<br>

## 9.3 실습: Ray-tune for Hyperparameter Turning

In [1]:
!pip install ray

Collecting ray
  Downloading ray-1.5.2-cp37-cp37m-manylinux2014_x86_64.whl (51.0 MB)
[K     |████████████████████████████████| 51.0 MB 75 kB/s 
Collecting aiohttp-cors
  Downloading aiohttp_cors-0.7.0-py3-none-any.whl (27 kB)
Collecting aioredis<2
  Downloading aioredis-1.3.1-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 3.0 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 59.1 MB/s 
Collecting py-spy>=0.2.0
  Downloading py_spy-0.3.8-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 73.0 MB/s 
[?25hCollecting redis>=3.5.0
  Downloading redis-3.5.3-py2.py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 558 kB/s 
[?25hCollecting opencensus
  Downloading opencensus-0.7.13-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 74.7 MB/s 


In [2]:
!pip install tensorboardX # ray 가 내부적으로 tensorboardX 라는 모듈을 사용한다.

Collecting tensorboardX
  Downloading tensorboardX-2.4-py2.py3-none-any.whl (124 kB)
[?25l[K     |██▋                             | 10 kB 37.3 MB/s eta 0:00:01[K     |█████▎                          | 20 kB 40.6 MB/s eta 0:00:01[K     |████████                        | 30 kB 43.1 MB/s eta 0:00:01[K     |██████████▌                     | 40 kB 26.9 MB/s eta 0:00:01[K     |█████████████▏                  | 51 kB 16.4 MB/s eta 0:00:01[K     |███████████████▉                | 61 kB 14.4 MB/s eta 0:00:01[K     |██████████████████▍             | 71 kB 13.7 MB/s eta 0:00:01[K     |█████████████████████           | 81 kB 15.1 MB/s eta 0:00:01[K     |███████████████████████▊        | 92 kB 13.8 MB/s eta 0:00:01[K     |██████████████████████████▎     | 102 kB 14.9 MB/s eta 0:00:01[K     |█████████████████████████████   | 112 kB 14.9 MB/s eta 0:00:01[K     |███████████████████████████████▋| 122 kB 14.9 MB/s eta 0:00:01[K     |████████████████████████████████| 124 kB 14.

In [3]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.12.0-py2.py3-none-any.whl (1.6 MB)
[?25l[K     |▏                               | 10 kB 36.6 MB/s eta 0:00:01[K     |▍                               | 20 kB 37.8 MB/s eta 0:00:01[K     |▋                               | 30 kB 41.8 MB/s eta 0:00:01[K     |▉                               | 40 kB 13.4 MB/s eta 0:00:01[K     |█                               | 51 kB 15.8 MB/s eta 0:00:01[K     |█▏                              | 61 kB 14.4 MB/s eta 0:00:01[K     |█▍                              | 71 kB 10.5 MB/s eta 0:00:01[K     |█▋                              | 81 kB 11.7 MB/s eta 0:00:01[K     |█▉                              | 92 kB 12.8 MB/s eta 0:00:01[K     |██                              | 102 kB 12.8 MB/s eta 0:00:01[K     |██▏                             | 112 kB 12.8 MB/s eta 0:00:01[K     |██▍                             | 122 kB 12.8 MB/s eta 0:00:01[K     |██▋                             | 133 kB 12.8 MB/s eta 

In [4]:
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler

import wandb



In [5]:
def load_data(data_dir='./data'):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform
    )
    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    return trainset, testset

In [6]:
class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [7]:
def train_cifar(config, checkpoint_dir=None, data_dir=None):
    # 모델이 학습하는 과정이 하나의 함수 안에 들어가 있어야 한다. (그래야 나중에 ray 에서 이것을 불러올 수 있다.)
    net = Net(config['l1'], config['l2'])

    device = 'cpu'
    if torch.cuda.is_available():
        device = 'cuda:0'
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config['lr'], momentum=0.9)

    if checkpoint_dir:
        model_state, optimizer_state = torch.load(
            os.path.join(checkpoint_dir, 'checkpoint')
        )
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs]
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config['batch_size']),
        shuffle=True,
        num_workers=8
    )
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config['batch_size']),
        shuffle=True,
        num_workers=8
    )
    wandb.init(project='torch-turn', entity='zgotter')
    wandb.watch(net) # wandb tracking

    for epoch in range(10): # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0
            
        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()                

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        wandb.log({'val_loss': val_loss})
        wandb.log({'loss': loss})

        with tune.checkpoint_dir(epoch) as checkpoint_dir:
            path = os.path.join(checkpoint_dir, 'checkpoint')
            torch.save((net.state_dict(), optimizer.state_dict()), path)

        tune.report(loss=(val_loss / val_steps), accuracy=(correct / total))
    print('Finished Training')

In [8]:
def test_accuracy(net, device='cpu'):
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2
    )

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total            

In [11]:
from ray.tune.suggest.bayesopt import BayesOptSearch
from ray.tune.suggest.hyperopt import HyperOptSearch

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):

    data_dir = os.path.abspath('./data')
    load_data(data_dir)
    
    config = {
        'l1': tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        'l2': tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        'lr': tune.loguniform(1e-4, 1e-1), # 0.1 단위 하이퍼파라미터는 로그균등분포에서 추출 (0.0001 ~ 0.1)
        'batch_size': tune.choice([2, 4, 6, 8, 16])
    }

    scheduler = ASHAScheduler(
        metric='loss',
        mode='min',
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2
    )

    reporter = CLIReporter(
        # parameter_columns=['l1', 'l2', 'lr', 'batch_size']
        metric_columns=['loss', 'accuracy', 'training_iteration']
    )

    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={'cpu': 2, 'gpu': gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter
    )

    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(best_trial.last_result["accuracy"]))    

    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)
    
    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(
        os.path.join(best_checkpoint_dir, "checkpoint")
    )
    best_trained_model.load_state_dict(model_state)

    test_acc = test_accuracy(best_trained_model, device)
    print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    wandb.login(key="a0816ef4efd1811ed041758f0456fc4d72d47a8b")
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)

[34m[1mwandb[0m: W&B API key is configured (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /content/data/cifar-10-python.tar.gz


  0%|          | 0/170498071 [00:00<?, ?it/s]

Extracting /content/data/cifar-10-python.tar.gz to /content/data
Files already downloaded and verified


2021-08-20 03:34:19,993	INFO services.py:1247 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m
2021-08-20 03:34:22,174	INFO registry.py:67 -- Detected unknown callable for trainable. Converting to class.


== Status ==
Memory usage on this node: 1.5/12.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.32 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/DEFAULT_2021-08-20_03-34-22
Number of trials: 10/10 (9 PENDING, 1 RUNNING)
+---------------------+----------+-------+--------------+------+------+-------------+
| Trial name          | status   | loc   |   batch_size |   l1 |   l2 |          lr |
|---------------------+----------+-------+--------------+------+------+-------------|
| DEFAULT_8260c_00000 | RUNNING  |       |            2 |   64 |    4 | 0.000217048 |
| DEFAULT_8260c_00001 | PENDING  |       |            4 |    8 |  256 | 0.00052147  |
| DEFAULT_8260c_00002 | PENDING  |       |            2 |  256 |  128 | 0.00181035  |
| DEFAULT_8260c_00003 | PENDING  |       |            8 |   64 |  256 | 0.000104527 |
| D

[2m[36m(pid=652)[0m   cpuset_checked))
[2m[36m(pid=652)[0m wandb: Currently logged in as: zgotter (use `wandb login --relogin` to force relogin)


== Status ==
Memory usage on this node: 1.9/12.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.32 GiB heap, 0.0/3.66 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/DEFAULT_2021-08-20_03-34-22
Number of trials: 10/10 (9 PENDING, 1 RUNNING)
+---------------------+----------+-------+--------------+------+------+-------------+
| Trial name          | status   | loc   |   batch_size |   l1 |   l2 |          lr |
|---------------------+----------+-------+--------------+------+------+-------------|
| DEFAULT_8260c_00000 | RUNNING  |       |            2 |   64 |    4 | 0.000217048 |
| DEFAULT_8260c_00001 | PENDING  |       |            4 |    8 |  256 | 0.00052147  |
| DEFAULT_8260c_00002 | PENDING  |       |            2 |  256 |  128 | 0.00181035  |
| DEFAULT_8260c_00003 | PENDING  |       |            8 |   64 |  256 | 0.000104527 |
| D

[2m[36m(pid=652)[0m wandb: Tracking run with wandb version 0.12.0
[2m[36m(pid=652)[0m wandb: Syncing run electric-fog-1
[2m[36m(pid=652)[0m wandb:  View project at https://wandb.ai/zgotter/torch-turn
[2m[36m(pid=652)[0m wandb:  View run at https://wandb.ai/zgotter/torch-turn/runs/2j53xzqe
[2m[36m(pid=652)[0m wandb: Run data is saved locally in /root/ray_results/DEFAULT_2021-08-20_03-34-22/DEFAULT_8260c_00000_0_batch_size=2,l1=64,l2=4,lr=0.00021705_2021-08-20_03-34-22/wandb/run-20210820_033425-2j53xzqe
[2m[36m(pid=652)[0m wandb: Run `wandb offline` to turn off syncing.
[2m[36m(pid=652)[0m   cpuset_checked))
[2m[36m(pid=652)[0m   return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


== Status ==
Memory usage on this node: 2.1/12.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2.0/2 CPUs, 0/1 GPUs, 0.0/7.32 GiB heap, 0.0/3.66 GiB objects (0.0/2.0 CPU_group_79efd6f806b05723e6dcfd20ff1c193f, 0.0/1.0 accelerator_type:P100, 0.0/2.0 CPU_group_0_79efd6f806b05723e6dcfd20ff1c193f)
Result logdir: /root/ray_results/DEFAULT_2021-08-20_03-34-22
Number of trials: 10/10 (9 PENDING, 1 RUNNING)
+---------------------+----------+-------+--------------+------+------+-------------+
| Trial name          | status   | loc   |   batch_size |   l1 |   l2 |          lr |
|---------------------+----------+-------+--------------+------+------+-------------|
| DEFAULT_8260c_00000 | RUNNING  |       |            2 |   64 |    4 | 0.000217048 |
| DEFAULT_8260c_00001 | PENDING  |       |            4 |    8 |  256 | 0.00052147  |
| DEFAULT_8260c_00002 | PENDING  |       |            2 |  256 |  128



[2m[36m(pid=652)[0m [10, 20000] loss: 0.108


KeyboardInterrupt: ignored