<a href="https://colab.research.google.com/github/xuwangfmc/dlbook/blob/main/wb_hydra/Dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dashboard

该教程主要介绍如何将Weights&Bias的Dashboard工具运用到PyTorch当中，并监测模型的训练、测试及参数更新情况。

Weights&Bias的Dashboard页面如下所示：
![t8Rpp2J.png](https://s2.loli.net/2022/01/22/VvlfIHxwFBCGmyc.png)

主要分为Charts、System、Logs和Files四个页面。

- 1) **Charts**展示的是wandb.log()设置的损失、准确率等结果，以及wandb.watch()得到的模型训练过程中的梯度和参数变化。

- 2) **System**展示的是模型在训练过程中的CPU、GPU、内存等使用情况。

- 3) **Logs**记录的是代码运行过程中会打印到控制台的信息。

- 4) **Files**记录了Weights&Bias保存的所有文件。包括onnx模型、参数设置等文件。


以下是Weights&Bias工具的伪代码


```python
# import the library
import wandb

# start a new expoeriment
wandb.init(project="my-first-project")

# capture a dictionary of hyperparameters with config
wandb.config = {"learning_rate": 0.01, "epochs": 30, "batch_size": 20}

# set up model and data
model, dataloader = get_model(), get_data()

# optional: track gradient
wandb.watch(model)

for batch in dataloader:
  metrics = model.training_step()
  # log metrics inside your training loop to visualize model performance
  wandb.log(metrics)
# optional： save model at the end
model.to_onnx()
wandb.save("model.onnx") 
```




- wandb.init()会初始化一个新的运行，返回一个对象并创建本地目录。目录存放着所有日志和文件，并且会异步传输到Weights&Bias服务器当中。

- wandb.config用于设置模型的超参数，在config中捕获的模型设置可用于组织和查询实验结果。

- wandb.log()会记录训练过程中的损失、准确率等信息。

- wandb.watch()会追踪模型训练时的梯度情况。

以下将通过实际案例来展示dashboard的使用方法：

## 配置PyTorch相关的参数参数

In [1]:
import random

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from tqdm.notebook import tqdm

# Ensure deterministic behavior
torch.backends.cudnn.deterministic = True
random.seed(hash("setting random seeds") % 2**32 - 1)
np.random.seed(hash("improves reproducibility") % 2**32 - 1)
torch.manual_seed(hash("by removing stochasticity") % 2**32 - 1)
torch.cuda.manual_seed_all(hash("so runs are repeatable") % 2**32 - 1)

# Device configuration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## 安装wandb库

In [2]:
!pip install wandb --upgrade

Collecting wandb
  Downloading wandb-0.12.9-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.2 MB/s 
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.26-py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 23.7 MB/s 
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.3-py2.py3-none-any.whl (142 kB)
[K     |████████████████████████████████| 142 kB 61.7 MB/s 
[?25hCollecting subprocess32>=3.5.3
  Downloading subprocess32-3.5.4.tar.gz (97 kB)
[K     |████████████████████████████████| 97 kB 6.2 MB/s 
[?25hCollecting yaspin>=1.0.0
  Downloading yaspin-2.1.0-py3-none-any.whl (18 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.8-py3-none-any.whl (9.5 kB)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting configparser>=3.8.1
  Downloading configparser-5.2.0-py3-none-any.whl (19 kB)
Colle

## 加载wandb库并登录账号

如果第一次使用，需要通过 https://wandb.ai/authorize 注册账号并粘贴相应密钥。

In [3]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## 通过wandb.init追踪数据和超参数

通过以下示例代码来设置模型训练的超参数、调用的数据集等信息。

In [4]:
config = dict(
    epochs=5,
    classes=10,
    kernels=[16, 32],
    batch_size=128,
    learning_rate=0.005,
    dataset="MNIST",
    architecture="CNN")

## 模型定义、训练与测试

定义模型的整体流程`model_pipeline`函数如下：

In [5]:
def model_pipeline(hyperparameters):

    # tell wandb to get started
    with wandb.init(project="pytorch-demo", config=hyperparameters):
      # access all HPs through wandb.config, so logging matches execution!
      config = wandb.config

      # make the model, data, and optimization problem
      model, train_loader, test_loader, criterion, optimizer = make(config)
      print(model)

      # and use them to train the model
      train(model, train_loader, criterion, optimizer, config)

      # and test its final performance
      test(model, test_loader)
    return model

上述函数中顺序调用了`make`、`train`和`test`函数。它们的具体功能如下：
- `make`函数：实现模型的构建及数据集的调用。

- `train`函数：定义模型的具体训练方式。

- `test`函数：测试训练模型在测试集的表现。

### 实验配置设置
`make`函数主要用于加载数据集，配置损失函数、优化器、模型结构等信息。其实现示例如下：

In [6]:
def make(config):
    # Make the data
    train, test = get_data(train=True), get_data(train=False)
    train_loader = make_loader(train, batch_size=config.batch_size)
    test_loader = make_loader(test, batch_size=config.batch_size)

    # Make the model
    model = ConvNet(config.kernels, config.classes).to(device)

    # Make the loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(
        model.parameters(), lr=config.learning_rate)
    return model, train_loader, test_loader, criterion, optimizer

定义数据集的加载函数`get_data`和`make_loader`：

In [7]:
def get_data(slice=5, train=True):
    full_dataset = torchvision.datasets.MNIST(root=".",
                          train=train, 
                          transform=transforms.ToTensor(),
                          download=True)
    #  equiv to slicing with [::slice] 
    sub_dataset = torch.utils.data.Subset(
      full_dataset, indices=range(0, len(full_dataset), slice))
    
    return sub_dataset

def make_loader(dataset, batch_size):
    loader = torch.utils.data.DataLoader(dataset=dataset,
                       batch_size=batch_size, 
                       shuffle=True,
                       pin_memory=True, num_workers=8)
    return loader

定义卷积神经网络模型`ConvNet`如下：

In [8]:
# CConvolutional neural network
class ConvNet(nn.Module):
    def __init__(self, kernels, classes=10):
        super(ConvNet, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, kernels[0], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, kernels[1], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7 * 7 * kernels[-1], classes)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

### 模型训练设置
定义模型的训练方式`train`函数并记录训练结果，示例如下：

In [9]:
def train(model, loader, criterion, optimizer, config):
    # tell wandb to watch what the model gets up to: gradients, weights, and more!
    wandb.watch(model, criterion, log="all", log_freq=10)

    # Run training and track with wandb
    total_batches = len(loader) * config.epochs
    example_ct = 0  # number of examples seen
    batch_ct = 0
    for epoch in tqdm(range(config.epochs)):
        for _, (images, labels) in enumerate(loader):

            loss = train_batch(images, labels, model, optimizer, criterion)
            example_ct +=  len(images)
            batch_ct += 1

            # Report metrics every 25th batch
            if ((batch_ct + 1) % 25) == 0:
                train_log(loss, example_ct, epoch)


def train_batch(images, labels, model, optimizer, criterion):
    images, labels = images.to(device), labels.to(device)
    
    # Forward pass ➡
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # Backward pass ⬅
    optimizer.zero_grad()
    loss.backward()

    # Step with optimizer
    optimizer.step()

    return loss
    
def train_log(loss, example_ct, epoch):
    loss = float(loss)

    # where the magic happens
    wandb.log({"epoch": epoch, "loss": loss}, step=example_ct)
    print(f"Loss after " + str(example_ct).zfill(5) + f" examples: {loss:.3f}")

### 模型测试设置

定义模型的测试方式及性能评估准则`test`函数如下：

In [10]:
def test(model, test_loader):
    model.eval()

    # Run the model on some test examples
    with torch.no_grad():
        correct, total = 0, 0
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print(f"Accuracy of the model on the {total} " +
              f"test images: {100 * correct / total}%")
        
        wandb.log({"test_accuracy": correct / total})

    # Save the model in the exchangeable ONNX format
    torch.onnx.export(model, images, "model.onnx")
    wandb.save("model.onnx")

## 运行并查看模型的训练和测试结果

In [11]:
# Build, train and analyze the model with the pipeline
model = model_pipeline(config)

[34m[1mwandb[0m: Currently logged in as: [33mxuwangfmc[0m (use `wandb login --relogin` to force relogin)


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-labels-idx1-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-images-idx3-ubyte.gz to ./MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-labels-idx1-ubyte.gz to ./MNIST/raw

ConvNet(
  (layer1): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (layer2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Linear(in_features=1568, out_features=10, bias=True)
)


  cpuset_checked))


  0%|          | 0/5 [00:00<?, ?it/s]

Loss after 03072 examples: 0.452
Loss after 06272 examples: 0.210
Loss after 09472 examples: 0.208
Loss after 12640 examples: 0.037
Loss after 15840 examples: 0.150
Loss after 19040 examples: 0.142
Loss after 22240 examples: 0.115
Loss after 25408 examples: 0.091
Loss after 28608 examples: 0.029
Loss after 31808 examples: 0.030
Loss after 35008 examples: 0.011
Loss after 38176 examples: 0.027
Loss after 41376 examples: 0.062
Loss after 44576 examples: 0.014
Loss after 47776 examples: 0.058
Loss after 50944 examples: 0.022
Loss after 54144 examples: 0.093
Loss after 57344 examples: 0.052
Accuracy of the model on the 2000 test images: 98.15%


VBox(children=(Label(value=' 0.11MB of 0.11MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,▁▁▁▃▃▃▃▅▅▅▅▆▆▆▆███
loss,█▄▄▁▃▃▃▂▁▁▁▁▂▁▂▁▂▂
test_accuracy,▁

0,1
epoch,4.0
loss,0.05239
test_accuracy,0.9815


打开输出结果最下方的链接即可查看Weights&Bias记录的各种信息。