<a href="https://colab.research.google.com/github/xuwangfmc/dlbook/blob/main/wb_hydra/Sweeps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sweep

该教程主要介绍如何将Weights&Bias的Artifacts工具运用到PyTorch当中，实现模型的超参数搜索以找到性能表现最好的模型设置。

Weights&Bias的Artifacts页面如下所示：
![6P0ATeH.png](https://s2.loli.net/2022/01/22/57duZjnVeXcQv1m.png)

Sweep记录了每次不同超参数设置下的模型表现，以及各超参数对模型性能的重要性。

运行Weights&Bias的超参数搜索方法主要分为三个步骤：

- 1) 定义Sweep的配置：可以通过创建字典或者YAML文件来设置需要搜索的超参数、搜索策略、优化指标等。

- 2) 初始化Sweep：通过wandb.sweep(sweep_config)传入Sweep的配置和完成Sweep的初始化。

- 3) 运行Sweep代理：调用wandb.agent()函数，传入sweep_id以及定义了模型结构的训练函数:wandb.agent(sweep_id, function=train)

## 安装wandb库并登录账号

在https://wandb.ai/authorize 登录账号并粘贴API key。

In [1]:
!pip install wandb
!wandb login

Collecting wandb
  Downloading wandb-0.12.9-py2.py3-none-any.whl (1.7 MB)
[?25l[K     |▏                               | 10 kB 34.5 MB/s eta 0:00:01[K     |▍                               | 20 kB 43.5 MB/s eta 0:00:01[K     |▋                               | 30 kB 32.7 MB/s eta 0:00:01[K     |▊                               | 40 kB 18.7 MB/s eta 0:00:01[K     |█                               | 51 kB 15.1 MB/s eta 0:00:01[K     |█▏                              | 61 kB 16.8 MB/s eta 0:00:01[K     |█▍                              | 71 kB 13.7 MB/s eta 0:00:01[K     |█▌                              | 81 kB 15.1 MB/s eta 0:00:01[K     |█▊                              | 92 kB 15.0 MB/s eta 0:00:01[K     |██                              | 102 kB 13.1 MB/s eta 0:00:01[K     |██                              | 112 kB 13.1 MB/s eta 0:00:01[K     |██▎                             | 122 kB 13.1 MB/s eta 0:00:01[K     |██▌                             | 133 kB 13.1 MB/s eta 

## 定义Sweep的配置

通过创建字典或者YAML文件可以完成Sweep的配置。

- Metric：这是Sweep尝试优化的指标。Metric有name和goal两个参数。

- 搜索策略：运用'method'变量指定。常用有随机搜索、网格搜索、贝叶斯搜索三种方式。

- Parameters：包含一系列超参数的字典，可以是离散值、最大值、最小值或者某种分布。

In [2]:
import torch
import torch.optim as optim
import torch.nn.functional as F
import torchvision.datasets as datasets
import torch.nn as nn
import wandb
from torchvision import datasets, transforms

sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'loss',
      'goal': 'minimize'   
    },
    'parameters': {
        'epochs': {
            'values': [2, 5, 10]
        },
        'batch_size': {
            'values': [256, 128, 64, 32]
        },
        'dropout': {
            'values': [0.3, 0.4, 0.5]
        },
        'learning_rate': {
            'values': [1e-2, 1e-3, 1e-4, 3e-4, 3e-5, 1e-5]
        },
        'fc_layer_size':{
            'values':[128,256,512]
        },
        'optimizer': {
            'values': ['adam', 'sgd']
        },
    }
}

## 初始化Sweep

In [4]:
sweep_id = wandb.sweep(sweep_config, project="Pytorch-sweeps")

Create sweep with ID: ea46d7yc
Sweep URL: https://wandb.ai/xuwangfmc/Pytorch-sweeps/sweeps/ea46d7yc


配置数据集

In [5]:
def build_dataset(batch_size):
   transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
   dataset = datasets.MNIST('../data', train=True, download=True,
                       transform=transform)
   train_loader = torch.utils.data.DataLoader(dataset,batch_size=batch_size)

   return train_loader

定义网络架构及训练方式如下：

In [7]:
def train():
    # Default values for hyper-parameters we're going to sweep over
    config_defaults = {
        'epochs': 5,
        'batch_size': 128,
        'learning_rate': 1e-3,
        'optimizer': 'adam',
        'fc_layer_size': 128,
        'dropout': 0.5,
    }
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Initialize a new wandb run
    wandb.init(config=config_defaults)
    
    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config
    
    # Define the model architecture - This is a simplified version of the VGG19 architecture
    network = nn.Sequential(
     nn.Flatten(start_dim=1)
    ,nn.Linear(784, config.fc_layer_size)
    ,nn.ReLU()
    ,nn.Dropout(config.dropout)
    ,nn.Linear(config.fc_layer_size, 10)
    ,nn.LogSoftmax(dim=1)
     )
    train_loader = build_dataset(config.batch_size)
    # Set of Conv2D, Conv2D, MaxPooling2D layers with 32 and 64 filters

    # Define the optimizer
    if config.optimizer=='sgd':
      optimizer = optim.SGD(network.parameters(),lr=config.learning_rate, momentum=0.9)
    elif config.optimizer=='adam':
      optimizer = optim.Adam(network.parameters(),lr=config.learning_rate)

    network.train()
    network = network.to(device)
    for i in range(config.epochs):
      closs= 0
      for batch_idx, (data, target) in enumerate(train_loader):
          data, target = data.to(device), target.to(device)
          optimizer.zero_grad()
          output = network(data)
          loss = F.nll_loss(output, target)
          loss.backward()
          closs = closs + loss.item()
          optimizer.step()
          wandb.log({"batch loss":loss.item()})
      wandb.log({"loss":closs/config.batch_size}) 

## 运行Sweep代理

In [8]:
wandb.agent(sweep_id, train)

[34m[1mwandb[0m: Agent Starting Run: 136ep5ty with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	dropout: 0.5
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	fc_layer_size: 512
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: Currently logged in as: [33mxuwangfmc[0m (use `wandb login --relogin` to force relogin)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw



VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,▃▃▂▃▂▃▂▂▁▄▆▃▃▂▁▄▄▁▆▂▂▁▃▂█▁▄▆▁▁▂▁▁▁▂▁█▃▂▁
loss,█▄▃▂▂▂▂▁▁▁

0,1
batch loss,0.02907
loss,5.63215


[34m[1mwandb[0m: Agent Starting Run: uba87p95 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	dropout: 0.3
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	fc_layer_size: 256
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,▆▅▅▅▃▄▂▁▁▃▅▂▂▃▂▁▂▂▄▄▂▂█▅▁▁▂▂▁▁▁▁▁▂▁▁▄▁▂▁
loss,█▄▃▂▂▂▁▁▁▁

0,1
batch loss,0.0042
loss,3.53322


[34m[1mwandb[0m: Agent Starting Run: i0zsf9vd with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	dropout: 0.3
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	fc_layer_size: 512
[34m[1mwandb[0m: 	learning_rate: 3e-05
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▅▄▃▃▂▃▃▂▂▄▂▂▂▂▂▁▁▃▂▁▂▃▁▂▂▂▃▁▁▁▂▂▁▂▁▃▁▂▁
loss,█▃▂▁▁

0,1
batch loss,0.05773
loss,11.52504


[34m[1mwandb[0m: Agent Starting Run: j0olzaph with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	dropout: 0.4
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fc_layer_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.0003
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▆▃▂▂▂▃▁▃▂▂▂▂▂▃▂▂▂▁▁▃▁▂▁▂▂▂▂▁▁▁▂▁▃▁▂▂▂▁▃
loss,█▁

0,1
batch loss,0.08392
loss,3.80378


[34m[1mwandb[0m: Agent Starting Run: sq4di8n4 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	dropout: 0.3
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fc_layer_size: 256
[34m[1mwandb[0m: 	learning_rate: 3e-05
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▇▆▅▅▄▄▃▃▃▃▂▃▃▃▃▂▂▁▁▂▂▂▁▂▂▂▁▁▁▂▂▂▂▁▂▁▂▂▃
loss,█▁

0,1
batch loss,0.15931
loss,6.47653


[34m[1mwandb[0m: Agent Starting Run: keledvtf with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	dropout: 0.3
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	fc_layer_size: 128
[34m[1mwandb[0m: 	learning_rate: 3e-05
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▆▅▄▃▃▃▃▃▂▄▂▂▂▂▂▂▁▂▃▂▂▃▁▂▃▂▃▁▂▁▂▂▁▁▂▂▁▂▁
loss,█▃▂▁▁

0,1
batch loss,0.12498
loss,17.43142


[34m[1mwandb[0m: Agent Starting Run: viljp9lq with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	dropout: 0.5
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	fc_layer_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.0003
[34m[1mwandb[0m: 	optimizer: sgd


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▇▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▁
loss,█▃▂▁▁

0,1
batch loss,0.53494
loss,1.84654


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: pyyz16vb with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	dropout: 0.4
[34m[1mwandb[0m: 	epochs: 2
[34m[1mwandb[0m: 	fc_layer_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,█▄▂▃▃▃▂▂▃▂▂▂▂▂▃▂▂▂▂▁▂▂▁▁▂▂▂▁▁▁▂▁▁▂▂▂▂▂▂▃
loss,█▁

0,1
batch loss,0.28656
loss,0.78056


[34m[1mwandb[0m: Agent Starting Run: 9wnygjzv with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	dropout: 0.4
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	fc_layer_size: 256
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	optimizer: adam


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
batch loss,▆▃▅▅▂▃▄▃▆▆▄▄▄▆▆▃▆▂▃▂▄█▁▃▄▇▃▇▃▄▆▃▄▂▃▂▄▆▃▆
loss,█▄▃▂▁

0,1
batch loss,0.28999
loss,6.9134


[34m[1mwandb[0m: Agent Starting Run: pxc0sgjc with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	dropout: 0.4
[34m[1mwandb[0m: 	epochs: 10
[34m[1mwandb[0m: 	fc_layer_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	optimizer: adam


[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.


打开输出结果的链接即可查看到各种超参数设置下模型的训练结果。