<a href="https://colab.research.google.com/github/timsetsfire/wandb-examples/blob/main/colab/W%26B_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch + W&B

The purpose of this lab is to instrument W&B a top of existing ML Workflows which might be leveraging 
* PyTorch
* Tensorboard (for metric tracking)
* Python `logging` (for metric tracking)

We will augment this workflow by leveraging 
* Wandb Experiments and syncing with Tensorboard
* Wandb logging
* Wandb Artifacts for dataset and model logging / versioning
* Tables to surface prediction examples on Test datasets
* track lineage of all artifacts and experiments completed

Lastly, we'll do a simple demo of sweeps and interact with the runs via W&B API to
* query runs and run summaries
* artifacts

In [1]:
%%capture
!pip install wandb easydict --upgrade

In [2]:
%%capture
!pip install tensorboard dill

## Logging In

In [3]:
#@title Enter host address
#@markdown Enter the host url which corresponds to your WB instance.
host = "https://api.wandb.ai" #@param {type: "string"}


In [4]:
import wandb
## when using wandb anywhere other than wandb.ai, you must 
## provide a proper host, so the client knows where to communcate
## details of the experiment
# wandb.login(key = key, host = host)
wandb.login(host = host)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [14]:
import os
import random
import logging
import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from tqdm.notebook import tqdm
from torch.utils.tensorboard import SummaryWriter

# Ensure deterministic behavior
torch.backends.cudnn.deterministic = True
random.seed(hash("setting random seeds") % 2**32 - 1)
np.random.seed(hash("improves reproducibility") % 2**32 - 1)
torch.manual_seed(hash("by removing stochasticity") % 2**32 - 1)
torch.cuda.manual_seed_all(hash("so runs are repeatable") % 2**32 - 1)

# Device configuration
# if you wind up with any device other than cpu, some code below will need to 
# change specific to the way we are interacting with torch tensors.  
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")

# remove slow mirror from list of MNIST mirrors
torchvision.datasets.MNIST.mirrors = [mirror for mirror in torchvision.datasets.MNIST.mirrors
                                      if not mirror.startswith("http://yann.lecun.com")]

## Get Data (and log it)

When we get data and log it, there are obviously tons of way to complete this.  Depending on how you log data, and whether or not you log your retrieval mechanism is a matter of preference and internal guidelines you need to follow.  

In our approach, we will write a `getter` for our data.  The benefit of writing a getting, we can log this getter with our dataset as part of the artifact metadata.



Before we get started it is important to set the name space for your project.  This is going to be accomplished by passing a `project_name` as well as an `entity` to your wandb experiment

`entity` corresponds to the team to which the project will be associated with.  The `entity` could be a team name, or your user name.  

In [6]:
project_name = "demos" #@param {type: "string"}
entity = "tim-w" #@param {type: "string"}

## Logging data

W&B is very unopinionated with regard to how you track your experiments.  We could log data in any number of ways.  
* Log one artifact which represents all the data - training, validation, and test data to one artifact 
* Log several artifacts - one for each of the training, validation, and test data loaders.  

It is a matter of what best suites your needs and workflows and expectations.  

### Anatomy of an artifact 

The `Artifact` class will correspond to an entry in the W&B Artifact registry.  The artifact has 
* a name
* a type
* metadata
* description
* files, directory of files, or references

Example usage 
```
run = wandb.init(project = "my-project")
artifact = wandb.Artifact(name = "my_artifact", type = "data")
artifact.add_file("/path/to/my/file.txt")
run.log_artifact(artifact)
run.finish()
```

In [15]:
## create the data directory locally if it does not already exists
from pathlib import Path
data_path = Path("./data")
data_path.mkdir(exist_ok = True)

## define out data getter 
def get_data(slice=5, train=True):
  '''
  helper function to get data
  args: 
    slice: Int => passed to torch.utils.data.Subset indices argument
    train: Boolean => True to download training data, False for test data
  '''
  full_dataset = torchvision.datasets.MNIST(root=".",
                                            train=train, 
                                            transform=transforms.ToTensor(),
                                            download=True)
  #  equiv to slicing with [::slice] 
  sub_dataset = torch.utils.data.Subset(
    full_dataset, indices=range(0, len(full_dataset), slice))

  return sub_dataset

In [16]:
logging.basicConfig(
                format="%(levelname)s - %(asctime)s - %(message)s",
        )
logger = logging.getLogger("CNN-Logger")
logger.setLevel("INFO")

## Our First W&B Experiment / Run

We are going to 
* get our training and test data
* split the training data into training and validation
* create artifacts for all three dataset
* log those artifacts to W&B.  

In [9]:
#%%wandb -h 600 
import pickle
from dill.source import getsource
from dill import detect
from datetime import datetime 

with wandb.init(project = project_name, job_type = "data-acquisition") as run:

  train, test = get_data(train=True), get_data(train=False)
  train, validation = torch.utils.data.random_split(train, [10000, 2000])

  torch.save(train, './data/training_data.pt')
  torch.save(validation, './data/validation_data.pt')
  torch.save(test, './data/test_data.pt')

  train_artifact = wandb.Artifact(name = "mnist-training-data", type = "dataset", 
                                  description = "training data",
                                  metadata = { 
                                      "data-set": "MNIST training",
                                      "getter": getsource(detect.code(get_data))}
                                  )
  train_artifact.add_file("./data/training_data.pt")

  validation_artifact = wandb.Artifact(name = "mnist-validation-data", type = "dataset", 
                                       description = "validation data",
                                       metadata = { 
                                      "data-set": "MNIST validation",
                                      "getter": getsource(detect.code(get_data))})
  validation_artifact.add_file("./data/validation_data.pt")

  test_artifact = wandb.Artifact(name = "mnist-test-data", type = "dataset", 
                                 description = "test data",
                                 metadata = { 
                                      "data-set": "MNIST test",
                                      "getter": getsource(detect.code(get_data))})
  test_artifact.add_file("./data/test_data.pt")  
  
  run.log_artifact(train_artifact)
  run.log_artifact(validation_artifact)
  run.log_artifact(test_artifact)

[34m[1mwandb[0m: Currently logged in as: [33mtim-w[0m. Use [1m`wandb login --relogin`[0m to force relogin


Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-images-idx3-ubyte.gz to ./MNIST/raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./MNIST/raw/train-labels-idx1-ubyte.gz to ./MNIST/raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-images-idx3-ubyte.gz to ./MNIST/raw

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./MNIST/raw/t10k-labels-idx1-ubyte.gz to ./MNIST/raw



VBox(children=(Label(value='98.302 MB of 98.302 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, m…

## Artifact usage (Creating the DAG)

Part of the value of W&B is the ability to capture lineage via Experiments and Artifacts.  Next up for our work flow is to specifiy a model and commence training.  

It is key to remember that experiments create and consume artifacts and we have already completed one  experimemtns where we created dataset artifacts.  

Next up, we will commence an experiment that will consume the artifacts from the previous run for the purposes of training model, then we will create a model artifact.

## Specify the model



In [17]:
# Conventional and convolutional neural network
class ConvNet(nn.Module):
    def __init__(self, kernels, classes=10):
        super(ConvNet, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, kernels[0], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, kernels[1], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7 * 7 * kernels[-1], classes)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

In [18]:
def make_loader(dataset, batch_size):
    loader = torch.utils.data.DataLoader(dataset=dataset,
                                         batch_size=batch_size, 
                                         shuffle=True,
                                         pin_memory=True, num_workers=2)
    return loader

## Training

In our first model training experiment, we are going to sync our wandb experiment to tensorboard -> so no wandb specific logging will be instrumented.  


In [12]:
# %%wandb -h 600
# Run training and track with wandb, but no explicit logging.  
# since we were alredy using tensorboard via WritterSumamry, we'll 
# sync w&b to tensorboard.
config = dict(
    epochs=5,
    classes=10,
    kernels=[16, 32],
    batch_size=128,
    learning_rate=0.01,
    dataset="MNIST",
    architecture="CNN"
    )

with wandb.init(project = project_name, 
                 job_type = "training", 
                 config = config,
                 sync_tensorboard = True) as run:

  config = wandb.config
  ## or, ifyou have a nasty nested dictionary for your config
  # config = EasyDict(wandb.config)

  run.use_artifact(f"{run.entity}/{run.project}/mnist-training-data:latest")
  run.use_artifact(f"{run.entity}/{run.project}/mnist-validation-data:latest")
  ## download and instantiation of the artifacts might be necessary.  

  train_loader = make_loader(train, batch_size=config.batch_size)
  validation_loader = make_loader(validation, batch_size=config.batch_size)

  model = ConvNet(config.kernels, config.classes).to(device)
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

  writer = SummaryWriter(log_dir = "./wandb/latest-run")
  total_batches = len(train_loader) * config.epochs
  example_ct = 0  # number of examples seen
  batch_ct = 0
  for epoch in tqdm(range(config.epochs)):
    for step, (images, labels) in enumerate(train_loader):
      images, labels = images.to(device), labels.to(device)
      # Forward pass ➡
      outputs = model(images)
      loss = criterion(outputs, labels)
      # Backward pass ⬅
      optimizer.zero_grad()
      loss.backward()
      # Step with optimizer
      optimizer.step()
      example_ct +=  len(images)
      batch_ct += 1
      # Report metrics every 25th batch
      if ((batch_ct + 1) % 25) == 0:
        writer.add_scalar("Train Metrics/loss", loss, batch_ct)
        writer.add_scalar("epoch", loss, batch_ct)
        logger.info(f"Epoch: {epoch}, Loss: {loss.detach().numpy()}")
    with torch.no_grad():
      correct, total = 0, 0
      for images, labels in validation_loader:
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)
          _, predicted = torch.max(outputs.data, 1)
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
          loss = criterion(outputs, labels)
          writer.add_scalar("Validation Metrics/loss", loss, batch_ct)
          writer.add_scalar("epoch", epoch, batch_ct)
      logger.info(f"Epoch {epoch}, Accuracy of the model on the {total} test images: {100 * correct / total}%")
      writer.add_scalar("Validation Metrics/accuracy", correct/total, batch_ct)
      writer.add_scalar("epoch", epoch, batch_ct)

  torch.save(model.state_dict(), "model.pt")
  model_artifact = wandb.Artifact(name = "mnist-model", type = "model")
  model_artifact.add_file("model.pt")
  run.log_artifact(model_artifact)


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 0.27248701453208923
INFO:CNN-Logger:Epoch: 0, Loss: 0.09341172873973846
INFO:CNN-Logger:Epoch: 0, Loss: 0.10226831585168839
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 test images: 97.3%
INFO:CNN-Logger:Epoch: 1, Loss: 0.04534098878502846
INFO:CNN-Logger:Epoch: 1, Loss: 0.029938332736492157
INFO:CNN-Logger:Epoch: 1, Loss: 0.2079644650220871
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 test images: 97.25%
INFO:CNN-Logger:Epoch: 2, Loss: 0.09108864516019821
INFO:CNN-Logger:Epoch: 2, Loss: 0.02359282784163952
INFO:CNN-Logger:Epoch: 2, Loss: 0.02562839910387993
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 test images: 96.2%
INFO:CNN-Logger:Epoch: 3, Loss: 0.014863993041217327
INFO:CNN-Logger:Epoch: 3, Loss: 0.06850112974643707
INFO:CNN-Logger:Epoch: 3, Loss: 0.090143121778965
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 test images: 98.0%
INFO:CNN-Logger:Epoch: 4, Loss: 0.01115749217569828
INFO:CNN-Log

VBox(children=(Label(value='0.208 MB of 0.208 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,█▃▃▂▁▆▃▁▁▁▂▃
Validation Metrics/accuracy,██▁
Validation Metrics/loss,▁█▆▂
epoch,▂▁▁▁▁▁▁▃▁▁▁▆▁▁▁█
global_step,▁▂▂▂▃▃▄▄▅▅▆▆▆▇██

0,1
Train Metrics/loss,0.09014
Validation Metrics/accuracy,0.962
Validation Metrics/loss,0.07485
epoch,3.0
global_step,316.0


## Test Data Evaluation

In [13]:
import pandas as pd
with wandb.init(project = project_name, entity = entity, job_type = "evaluation") as run:
  model_artifact = run.use_artifact(model_artifact.wait())
  ## instantiate the model if necessary
  # model_dir = model_artifact.download()
  # model = ConvNet(config.kernels, config.classes)
  # model.load_state_dict(torch.load(f"{model_dir}/model.pt"))
  run.use_artifact(f"{run.entity}/{run.project}/mnist-test-data:latest")
  ## same goes for the dataset
  test_loader = make_loader(test, batch_size=config.batch_size)

  model.eval()
  # Run the model on some test examples

  with torch.no_grad():
      correct, total = 0, 0
      total_loss = 0
      all_data = []
      for images, labels in test_loader:
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)
          _, predicted = torch.max(outputs.data, 1)
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
          loss = criterion(outputs, labels)*labels.size(0)
          total_loss += loss
          wandb_images = []
          for image in images.numpy():
            temp = wandb.Image(image)
            wandb_images.append(temp) 
          scores = pd.DataFrame( outputs.numpy().tolist(), columns = [f"p{i}" for i in range(outputs.shape[1])]).to_dict(orient = "series")
          data = {"images":wandb_images, "predicted": predicted.numpy().tolist(), "labels": labels.numpy().tolist()}
          data = {**data, **scores}
          all_data.append(pd.DataFrame(data))
      import pandas as pd 
      df = pd.concat(all_data)
      wandb.log({"Predictions vs Actuals": wandb.Table(dataframe = df)})
      run.log({"Test Metrics/loss": total_loss / total, "Test Metrics/accuracy": correct / total})
      logger.info(f"Accuracy of the model on the {total} " +
            f"test images: {100 * correct / total}%")
          

INFO:CNN-Logger:Accuracy of the model on the 2000 test images: 97.7%


VBox(children=(Label(value='1.828 MB of 1.828 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Test Metrics/accuracy,▁
Test Metrics/loss,▁

0,1
Test Metrics/accuracy,0.977
Test Metrics/loss,0.08655


## Sweeps

Very Naive instrumentation of sweeps - sweeps running a single machine

We'll switch from using the `sync_tensorboard` argument to using `wandb.log`.  

Also, we'll use sweeps to evaluate different batch sizes and learning rates.  

1. Add wandb: In your Python script, add a couple lines of code to log hyperparameters and output metrics from your script.
2. Write config: Define the variables and ranges to sweep over. Pick a search strategy— we support grid, random, and Bayesian search, plus techniques for faster iterations like early stopping. Check out some example configs here.
Initialize sweep: Launch the sweep server. We host this central controller and coordinate between the agents that execute the sweep.
3. Launch agent(s): Run a single-line command on each machine you'd like to use to train models in the sweep. The agents ask the central sweep server what hyperparameters to try next, and then they execute the runs.
4. Visualize results: Open our live dashboard to see all your results in one central place.

![](https://1039519455-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Lqya5RvLedGEWPhtkjU-1972196547%2Fuploads%2Fgit-blob-d7820a5646e118213a46afd4faa2c02eed7faf5c%2Fcentral-sweep-server-3%20(2)%20(2)%20(3)%20(3)%20(2)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(3)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(1)%20(2).png?alt=media)

## Sweep Configuration

In [14]:
sweep_config = {
  'method': 'grid', 
  'metric': {
      'name': 'Validation Metrics/loss',  ## matches what i write via SummaryWriter
      'goal': 'minimize'
  },
  'early_terminate':{
      'type': 'hyperband',
      'min_iter': 5
  },
  'parameters': {
      'learning_rate':{
          'values': [0.05,0.025,0.01,0.005,0.001]
      }, 
      'batch_size': { 
          'values': [128, 256]
      }
  }
}


## Training Function

In [15]:
def sweep_train(config_defaults = dict(learning_rate=0.01, batch_size = 128)): 

  config_standard = dict(
    epochs=5,
    classes=10,
    kernels=[16, 32],
    batch_size=128,
    dataset="MNIST",
    architecture="CNN"
    )
  
  config = {**config_defaults, **config_standard}

  with wandb.init(config = config, sync_tensorboard = False) as run:

    config = wandb.config
    ## or, ifyou have a nasty nested dictionary for your config
    # config = EasyDict(wandb.config)

    run.use_artifact(f"{run.entity}/{run.project}/mnist-training-data:latest")
    run.use_artifact(f"{run.entity}/{run.project}/mnist-validation-data:latest")

    train_loader = make_loader(train, batch_size=config.batch_size)
    validation_loader = make_loader(validation, batch_size=config.batch_size)


    model = ConvNet(config.kernels, config.classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

    total_batches = len(train_loader) * config.epochs
    example_ct = 0  # number of examples seen
    batch_ct = 0
    for epoch in tqdm(range(config.epochs)):
      for _, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        # Forward pass ➡
        outputs = model(images)
        loss = criterion(outputs, labels)
        # Backward pass ⬅
        optimizer.zero_grad()
        loss.backward()
        # Step with optimizer
        optimizer.step()
        example_ct +=  len(images)
        batch_ct += 1
        # Report metrics every 25th batch
        if ((batch_ct + 1) % 25) == 0:
          logger.info(f"Epoch: {epoch}, Loss: {loss.detach().numpy()}")
          run.log({ "Train Metrics/loss": loss, "epoch": epoch})
      with torch.no_grad():
        correct, total = 0, 0
        for images, labels in validation_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            loss = criterion(outputs, labels)
            run.log({"Validation Metrics/loss": loss, "epoch": epoch})
        logger.info(f"Epoch {epoch}, Accuracy of the model on the {total} validation images: {100 * correct / total}%")
        run.log({"Validation Metrics/accuracy": correct / total, "epoch": epoch})

    torch.save(model.state_dict(), f"{run.id}-model.pt")
    model_artifact = wandb.Artifact(name = f"{run.id}-mnist-model", type = "model")
    model_artifact.add_file(f"{run.id}-model.pt")
    run.log_artifact(model_artifact)


## Initiate the Sweep

In [16]:
sweep_id = wandb.sweep(sweep_config, project=project_name)

Create sweep with ID: cfgbqajh
Sweep URL: https://wandb.ai/tim-w/demos/sweeps/cfgbqajh


## Run the Sweep Agent

In [17]:
wandb_agent = wandb.agent(sweep_id, function=sweep_train, count = 5)

[34m[1mwandb[0m: Agent Starting Run: hc73c6u5 with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.05


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 2.291313648223877
INFO:CNN-Logger:Epoch: 0, Loss: 2.3052728176116943
INFO:CNN-Logger:Epoch: 0, Loss: 2.3036835193634033
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 validation images: 9.7%
INFO:CNN-Logger:Epoch: 1, Loss: 2.3076467514038086
INFO:CNN-Logger:Epoch: 1, Loss: 2.298306703567505
INFO:CNN-Logger:Epoch: 1, Loss: 2.3043770790100098
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 validation images: 9.7%
INFO:CNN-Logger:Epoch: 2, Loss: 2.313431978225708
INFO:CNN-Logger:Epoch: 2, Loss: 2.3059988021850586
INFO:CNN-Logger:Epoch: 2, Loss: 2.2887861728668213
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 validation images: 11.6%
INFO:CNN-Logger:Epoch: 3, Loss: 2.309713840484619
INFO:CNN-Logger:Epoch: 3, Loss: 2.322235107421875
INFO:CNN-Logger:Epoch: 3, Loss: 2.3006088733673096
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 validation images: 10.65%
INFO:CNN-Logger:Epoch: 4, Loss: 2.306809902191162
INFO:CNN

VBox(children=(Label(value='0.245 MB of 0.245 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,▂▄▄▅▃▄▆▅▁▅█▃▅▃▃
Validation Metrics/accuracy,▁▁█▅█
Validation Metrics/loss,▃▆▇▄▅▆▆▄▂▅▄▇▅▃▃▆▅▃▁▅▆▄▄▃▂▁▄█▇▂▂▃▅▅▃▄▄▆▁▅
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████

0,1
Train Metrics/loss,2.29698
Validation Metrics/accuracy,0.116
Validation Metrics/loss,2.30653
epoch,4.0


[34m[1mwandb[0m: Agent Starting Run: 042b7efg with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.025


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 0.41747933626174927
INFO:CNN-Logger:Epoch: 0, Loss: 0.33109694719314575
INFO:CNN-Logger:Epoch: 0, Loss: 0.17806939780712128
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 validation images: 94.1%
INFO:CNN-Logger:Epoch: 1, Loss: 0.1433674693107605
INFO:CNN-Logger:Epoch: 1, Loss: 0.21962569653987885
INFO:CNN-Logger:Epoch: 1, Loss: 0.2470577210187912
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 validation images: 95.25%
INFO:CNN-Logger:Epoch: 2, Loss: 0.1303713619709015
INFO:CNN-Logger:Epoch: 2, Loss: 0.21828550100326538
INFO:CNN-Logger:Epoch: 2, Loss: 0.17146150767803192
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 validation images: 96.25%
INFO:CNN-Logger:Epoch: 3, Loss: 0.11800725013017654
INFO:CNN-Logger:Epoch: 3, Loss: 0.1038677841424942
INFO:CNN-Logger:Epoch: 3, Loss: 0.17247097194194794
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 validation images: 95.65%
INFO:CNN-Logger:Epoch: 4, Loss: 0.0615539

VBox(children=(Label(value='0.265 MB of 0.265 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,█▆▃▃▄▅▃▄▃▂▂▃▁▃▁
Validation Metrics/accuracy,▁▅█▆█
Validation Metrics/loss,▄▄▄█▅▃▃▅▄▂▆▃▁▂▂▃▂▂▄▁▁▂▂▂▂▂▃▂▅▁▃▂▄▃▃▃▃▂▁▁
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████

0,1
Train Metrics/loss,0.04598
Validation Metrics/accuracy,0.962
Validation Metrics/loss,0.06019
epoch,4.0


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 8nauok9b with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.01


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 0.6284818649291992
INFO:CNN-Logger:Epoch: 0, Loss: 0.20073556900024414
INFO:CNN-Logger:Epoch: 0, Loss: 0.12607581913471222
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 validation images: 94.95%
INFO:CNN-Logger:Epoch: 1, Loss: 0.16424758732318878
INFO:CNN-Logger:Epoch: 1, Loss: 0.16210998594760895
INFO:CNN-Logger:Epoch: 1, Loss: 0.06629647314548492
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 validation images: 95.9%
INFO:CNN-Logger:Epoch: 2, Loss: 0.041608866304159164
INFO:CNN-Logger:Epoch: 2, Loss: 0.07588917762041092
INFO:CNN-Logger:Epoch: 2, Loss: 0.11809847503900528
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 validation images: 97.3%
INFO:CNN-Logger:Epoch: 3, Loss: 0.03595416992902756
INFO:CNN-Logger:Epoch: 3, Loss: 0.055061642080545425
INFO:CNN-Logger:Epoch: 3, Loss: 0.025203054770827293
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 validation images: 97.45%
INFO:CNN-Logger:Epoch: 4, Loss: 0.01

VBox(children=(Label(value='0.285 MB of 0.285 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,█▃▂▃▃▂▁▂▂▁▂▁▁▁▂
Validation Metrics/accuracy,▁▃▆▇█
Validation Metrics/loss,▅▅▆▃▅▅▂█▆▄▆▆▂▄▄█▄▂▂▃▅▄▂▄▁▃▃▂▄▃▂▄▄▃▅▂▄▁▁▁
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████

0,1
Train Metrics/loss,0.06275
Validation Metrics/accuracy,0.98
Validation Metrics/loss,0.01904
epoch,4.0


[34m[1mwandb[0m: Agent Starting Run: roc5n7nr with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.005


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 0.25563564896583557
INFO:CNN-Logger:Epoch: 0, Loss: 0.21827128529548645
INFO:CNN-Logger:Epoch: 0, Loss: 0.04436584934592247
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 validation images: 95.9%
INFO:CNN-Logger:Epoch: 1, Loss: 0.053948935121297836
INFO:CNN-Logger:Epoch: 1, Loss: 0.07875742018222809
INFO:CNN-Logger:Epoch: 1, Loss: 0.09501388669013977
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 validation images: 98.0%
INFO:CNN-Logger:Epoch: 2, Loss: 0.08248032629489899
INFO:CNN-Logger:Epoch: 2, Loss: 0.11124870181083679
INFO:CNN-Logger:Epoch: 2, Loss: 0.020634770393371582
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 validation images: 97.9%
INFO:CNN-Logger:Epoch: 3, Loss: 0.034339915961027145
INFO:CNN-Logger:Epoch: 3, Loss: 0.033890265971422195
INFO:CNN-Logger:Epoch: 3, Loss: 0.09973181784152985
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 validation images: 97.0%
INFO:CNN-Logger:Epoch: 4, Loss: 0.01

VBox(children=(Label(value='0.305 MB of 0.305 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,█▇▂▂▃▃▃▄▁▁▁▃▁▁▂
Validation Metrics/accuracy,▁▇▇▄█
Validation Metrics/loss,▃▄▆▇▃▇▅█▃▁▇▃▃▃▂▂▃▂▄▅▃▃▁▂▂▂▂▄▅▆▄▂▄▃▂▃▁▂▁▄
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████

0,1
Train Metrics/loss,0.0475
Validation Metrics/accuracy,0.9825
Validation Metrics/loss,0.09966
epoch,4.0


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: i53rs22g with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	learning_rate: 0.001


  0%|          | 0/5 [00:00<?, ?it/s]

INFO:CNN-Logger:Epoch: 0, Loss: 0.9622626304626465
INFO:CNN-Logger:Epoch: 0, Loss: 0.5403366684913635
INFO:CNN-Logger:Epoch: 0, Loss: 0.26470300555229187
INFO:CNN-Logger:Epoch 0, Accuracy of the model on the 2000 validation images: 89.45%
INFO:CNN-Logger:Epoch: 1, Loss: 0.27045753598213196
INFO:CNN-Logger:Epoch: 1, Loss: 0.19254863262176514
INFO:CNN-Logger:Epoch: 1, Loss: 0.24743443727493286
INFO:CNN-Logger:Epoch 1, Accuracy of the model on the 2000 validation images: 94.45%
INFO:CNN-Logger:Epoch: 2, Loss: 0.23118655383586884
INFO:CNN-Logger:Epoch: 2, Loss: 0.1730300486087799
INFO:CNN-Logger:Epoch: 2, Loss: 0.16606776416301727
INFO:CNN-Logger:Epoch 2, Accuracy of the model on the 2000 validation images: 95.45%
INFO:CNN-Logger:Epoch: 3, Loss: 0.1044996827840805
INFO:CNN-Logger:Epoch: 3, Loss: 0.05652156099677086
INFO:CNN-Logger:Epoch: 3, Loss: 0.08873186260461807
INFO:CNN-Logger:Epoch 3, Accuracy of the model on the 2000 validation images: 96.55%
INFO:CNN-Logger:Epoch: 4, Loss: 0.096345

VBox(children=(Label(value='0.113 MB of 0.113 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train Metrics/loss,█▅▃▃▂▂▂▂▂▁▁▁▁▂▂
Validation Metrics/accuracy,▁▆▇██
Validation Metrics/loss,▇▆▆▅█▇▇▅▃▄▅▄▃▄▃▄▂▂▂▂▄▆▂▂▂▂▂▁▂▃▁▂▁▂▁▁▂▄▂▂
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████

0,1
Train Metrics/loss,0.19961
Validation Metrics/accuracy,0.9695
Validation Metrics/loss,0.09303
epoch,4.0


## Use API to interact with W&B

In [18]:
import pandas as pd
import wandb
api = wandb.Api()
sweep = api.sweep(f"{entity}/{project_name}/{sweep_id}")
temp_data = []
for r in sweep.runs:
 temp_dict = dict(**dict(r.summary), **r.config)
 temp_dict["run_id"] = r.id
 temp_dict["run_name"] = r.name
 temp_data.append( temp_dict)
df = pd.DataFrame(temp_data)
df.set_index("run_id", inplace = True)
best_run_id = sweep.best_run().id
best_run = api.run(f"{entity}/{project_name}/{best_run_id}")
df.loc[best_run_id]

[34m[1mwandb[0m: Sorting runs by +summary_metrics.Validation Metrics/loss


Validation Metrics/accuracy                 0.98
_step                                         99
epoch                                          4
_wandb                           {'runtime': 40}
_runtime                               40.897369
_timestamp                     1660662924.082012
Train Metrics/loss                      0.062746
Validation Metrics/loss                 0.019038
epochs                                         5
classes                                       10
dataset                                    MNIST
kernels                                 [16, 32]
batch_size                                   128
architecture                                 CNN
learning_rate                               0.01
run_name                            deft-sweep-3
Name: 8nauok9b, dtype: object

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
project_name = "demos" #@param {type: "string"}
entity = "tim-w" #@param {type: "string"}
sweep_id = "cfgbqajh"

In [2]:
import pandas as pd
import wandb
api = wandb.Api()
sweep = api.sweep(f"{entity}/{project_name}/{sweep_id}")
temp_data = []
for r in sweep.runs:
 temp_dict = dict(**dict(r.summary), **r.config)
 temp_dict["run_id"] = r.id
 temp_dict["run_name"] = r.name
 temp_data.append( temp_dict)
df = pd.DataFrame(temp_data)
df.set_index("run_id", inplace = True)
best_run_id = sweep.best_run().id
best_run = api.run(f"{entity}/{project_name}/{best_run_id}")
df.loc[best_run_id]

[34m[1mwandb[0m: Sorting runs by +summary_metrics.Validation Metrics/loss


_runtime                               40.897369
_timestamp                     1660662924.082012
Train Metrics/loss                      0.062746
Validation Metrics/loss                 0.019038
Validation Metrics/accuracy                 0.98
_step                                         99
epoch                                          4
_wandb                           {'runtime': 40}
epochs                                         5
classes                                       10
dataset                                    MNIST
kernels                                 [16, 32]
batch_size                                   128
architecture                                 CNN
learning_rate                               0.01
run_name                            deft-sweep-3
Name: 8nauok9b, dtype: object

In [3]:
model_artifact = [a for a in best_run.logged_artifacts() if a.type == "model"].pop()

In [25]:
import pandas as pd
with wandb.init(project = project_name, entity = entity, job_type = "evaluation", config = best_run.config) as run:
  config = wandb.config

  ## get and use model
  model_artifact = run.use_artifact(model_artifact.wait())
  model_artifact.download()
  model = ConvNet(config.kernels, config.classes)
  model.load_state_dict(torch.load(model_artifact.file()))

  ## get and use test data
  test_data_artifact = run.use_artifact(f"{run.entity}/{run.project}/mnist-test-data:latest")
  test_data_artifact.download()
  test = torch.load(test_data_artifact.file())
  ## same goes for the dataset
  test_loader = make_loader(test, batch_size=config.batch_size)

  criterion = nn.CrossEntropyLoss()
  
  model.eval()
  # Run the model on some test examples

  with torch.no_grad():
      correct, total = 0, 0
      total_loss = 0
      all_data = []
      for images, labels in test_loader:
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)
          _, predicted = torch.max(outputs.data, 1)
          total += labels.size(0)
          correct += (predicted == labels).sum().item()
          loss = criterion(outputs, labels)*labels.size(0)
          total_loss += loss
          wandb_images = []
          for image in images.numpy():
            temp = wandb.Image(image)
            wandb_images.append(temp) 
          scores = pd.DataFrame( outputs.numpy().tolist(), columns = [f"p{i}" for i in range(outputs.shape[1])]).to_dict(orient = "series")
          data = {"images":wandb_images, "predicted": predicted.numpy().tolist(), "labels": labels.numpy().tolist()}
          data = {**data, **scores}
          all_data.append(pd.DataFrame(data))
      import pandas as pd 
      df = pd.concat(all_data)
      wandb.log({"Predictions vs Actuals": wandb.Table(dataframe = df)})
      run.log({"Test Metrics/loss": total_loss / total, "Test Metrics/accuracy": correct / total})
      logger.info(f"Accuracy of the model on the {total} " +
            f"test images: {100 * correct / total}%")
          

INFO:CNN-Logger:Accuracy of the model on the 2000 test images: 98.0%


VBox(children=(Label(value='1.938 MB of 1.938 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Test Metrics/accuracy,▁
Test Metrics/loss,▁

0,1
Test Metrics/accuracy,0.98
Test Metrics/loss,0.05746


In [None]:
tim-w/model-registry/MNIST:v0

In [4]:
with wandb.init(project = project_name, job_type = "register") as run:
  run.use_artifact(model_artifact.wait())
  run.link_artifact(
      model_artifact, 
      f'tim-w/model-registry/MNIST-v2', 
      aliases = ["latest", "staging", "needs-validation"]
      )

[34m[1mwandb[0m: Currently logged in as: [33mtim-w[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.215 MB of 0.215 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…