### Weights and Biases (`wandb`) Demo

In deep learning, we perform a lot of model training especially for novel neural architectures. The problem is deep learning frameworks like PyTorch do not provide sufficient tools to visualize input data, track the progress of our experiments, log data, and visualize the outputs. 

`wandb` addresses this problem. In this demo, we show how to use `wandb` to visualize input data, prediction, and training progress in the form value of loss function and validation accuracy. 

**Note**: Before running this demo, please make sure that you have `wandb.ai` free account. 

Let us install `wandb`.

In [28]:
!pip install wandb



**Import** the required modules.

In [29]:
import torch
import torchvision
import wandb
import datetime
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from ui import progress_bar

**Login to and initialize** `wandb`. You will need to use your `wandb` API to run this demo.

As the config indicates, we will train our model using `cifar10` dataset, learning rate of `0.1`, and batch size of `128` for `100` epochs. 

epochs means a complete sampling of the dataset (train). In the `wandb` plots, step is the term used instead of epoch.  
batch size is the number of samples per training step.


In [30]:
wandb.login()
config = {
  "learning_rate": 0.1,
  "epochs": 100,
  "batch_size": 128,
  "dataset": "cifar10"
}
run = wandb.init(project="wandb-project", entity="upeee", config=config)

print(wandb.config)

2022-03-10 19:44:09.092609: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-10 19:44:09.092656: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


{'learning_rate': 0.1, 'epochs': 100, 'batch_size': 128, 'dataset': 'cifar10'}


### Build the model

Use a ResNet18 from `torchvision`. Remove the last layer that was used for 1k-class ImageNet classification. Since we will use CIFAR10, the last layer is replaced by a linear layer with 10 outputs. We will train the model from scratch, so we set `pretrained=False`.

In [31]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.resnet18(pretrained=False, progress=True)

model.fc = torch.nn.Linear(model.fc.in_features, 10)  
model.to(device)

# watch model gradients during training
wandb.watch(model)

[]

### Loss function, Optimizer, Scheduler and DataLoader

The appropriate loss function is cross entropy for multi-category classfication. We use `SGD` or stochastic gradient descent for optimization. Our learning rate that starts at `0.1` decays to zero at the end of total number of epochs. The decay is controlled by a cosine learning rate decay scheduler. 

Finally, we use `cifar10` dataset that is available in `torchvision`. We will discuss datasets and dataloaders in our future demo. For the meantime, we can treat dataloader as a data strcuture that dispenses batch size data from either the train or test split of the dataset.

In [32]:
loss = torch.nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=wandb.config.learning_rate)
scheduler = CosineAnnealingLR(optimizer, T_max=wandb.config.epochs)

x_train = datasets.CIFAR10(root='./data', train=True, 
                           download=True, 
                           transform=transforms.ToTensor())
x_test = datasets.CIFAR10(root='./data',
                          train=False, 
                          download=True, 
                          transform=transforms.ToTensor())
train_loader = DataLoader(x_train, 
                          batch_size=wandb.config.batch_size, 
                          shuffle=True, 
                          num_workers=2)
test_loader = DataLoader(x_test, 
                         batch_size=wandb.config.batch_size, 
                         shuffle=False, 
                         num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


### Visulaizing sample data from test split

We can visualize data from the test split by getting a batch sample: `image, label = iter(test_loader).next()`. We use `wandb` table to create a column for image, grount truth label and initial model predicted label.

In [33]:

label_human = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]

table_test = wandb.Table(columns=['Image', "Ground Truth", "Initial Pred Label",])

image, label = iter(test_loader).next()
with torch.no_grad():
  pred = torch.argmax(model(image.to(device)), dim=1).cpu().numpy()

for i in range(8):
  table_test.add_data(wandb.Image(image[i]),
                      label_human[label[i]], 
                      label_human[pred[i]])
  print(label_human[label[i]], "vs. ",  label_human[pred[i]])

#wandb.log({"Test data": table_test})
#wandb.run

cat vs.  ship
ship vs.  cat
ship vs.  deer
airplane vs.  cat
frog vs.  cat
frog vs.  cat
automobile vs.  cat
frog vs.  deer


### The train loop

At every epoch, we will run the train loop for the model.

In [34]:
def train(epoch):
  model.train()
  train_loss = 0
  correct = 0
  train_samples = 0

  # sample a batch. compute loss and backpropagate
  for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    target = target.to(device)
    output = model(data.to(device))
    loss_value = loss(output, target)
    loss_value.backward()
    optimizer.step()
    scheduler.step(epoch)
    train_loss += loss_value.item()
    train_samples += len(data)
    pred = output.argmax(dim=1, keepdim=True)
    correct += pred.eq(target.view_as(pred)).sum().item()
    if batch_idx % 10 == 0:
      accuracy = 100. * correct / len(train_loader.dataset)
      progress_bar(batch_idx,
                   len(train_loader),
                  'Train Epoch: {}, Loss: {:.6f}, Acc: {:.2f}%'.format(epoch+1, 
                  train_loss/train_samples, accuracy))
  
  train_loss /= len(train_loader.dataset)
  accuracy = 100. * correct / len(train_loader.dataset)

  return accuracy, train_loss

### The validation loop

After every epoch, we will run the validation loop for the model. In this way, we can track the progress of our model training.

In [35]:
def test():
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      output = model(data.to(device))   
      target = target.to(device)

      test_loss += loss(output, target).item()
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(target.view_as(pred)).sum().item()

  test_loss /= len(test_loader.dataset)
  accuracy = 100. * correct / len(test_loader.dataset)

  print('\nTest Loss: {:.4f}, Acc: {:.2f}%\n'.format(test_loss, accuracy))

  return accuracy, test_loss

### `wandb` plots

Finally, we will use `wandb` to visualize the training progress. We will use the following plots:
- Model gradients (`wandb.watch(model)`)
- Training loss (`"train loss": train_loss,`)
- Validation accuracy (`"Test accuracy": accuracy,`)
- Learning rate which decreases over epochs (`"Learning rate": optimizer.param_groups[0]['lr']`)

We re-use the earlier `table_test` to see the final prediction.

In [36]:
run.display(height=1000)

start_time = datetime.datetime.now()
best_acc = 0
for epoch in range(wandb.config["epochs"]):
    train_acc, train_loss = train(epoch)
    test_acc, test_loss = test()
    if test_acc > best_acc:
        wandb.run.summary["Best accuracy"] = test_acc
        best_acc = test_acc
        torch.save(model, "resnet18_best_acc.pth")
    wandb.log({
        "Train accuracy": train_acc,
        "Test accuracy": test_acc,
        "Train loss": train_loss,
        "Test loss": test_loss,
        "Learning rate": optimizer.param_groups[0]['lr']
    })

elapsed_time = datetime.datetime.now() - start_time
print("Elapsed time: %s" % elapsed_time)
wandb.run.summary["Elapsed train time"] = str(elapsed_time)

with torch.no_grad():
  pred = torch.argmax(model(image.to(device)), dim=1).cpu().numpy()

final_pred = []
for i in range(8):
    final_pred.append(label_human[pred[i]])
    print(label_human[label[i]], "vs. ",  final_pred[i])

table_test.add_column(name="Final Pred Label", data=final_pred)

wandb.log({"Test data": table_test})

wandb.finish()



 [>.............................]  Step: 3m7s | Tot: 1ms | Train Epoch: 1, Loss: 0.019536, Acc: 0.02% 1/391 




Test Loss: 0.0131, Acc: 43.60%


Test Loss: 0.0138, Acc: 43.26%


Test Loss: 0.0115, Acc: 53.88%


Test Loss: 0.0097, Acc: 58.77%


Test Loss: 0.0131, Acc: 54.82%


Test Loss: 0.0106, Acc: 60.54%


Test Loss: 0.0086, Acc: 66.44%


Test Loss: 0.0087, Acc: 66.75%


Test Loss: 0.0092, Acc: 66.95%


Test Loss: 0.0098, Acc: 68.41%


Test Loss: 0.0101, Acc: 67.46%


Test Loss: 0.0104, Acc: 68.95%


Test Loss: 0.0115, Acc: 66.91%


Test Loss: 0.0110, Acc: 70.57%


Test Loss: 0.0116, Acc: 69.33%


Test Loss: 0.0122, Acc: 68.36%


Test Loss: 0.0130, Acc: 70.42%


Test Loss: 0.0126, Acc: 70.71%


Test Loss: 0.0139, Acc: 69.93%


Test Loss: 0.0135, Acc: 69.89%


Test Loss: 0.0131, Acc: 71.51%


Test Loss: 0.0165, Acc: 68.57%


Test Loss: 0.0149, Acc: 70.41%


Test Loss: 0.0169, Acc: 68.15%


Test Loss: 0.0137, Acc: 71.63%


Test Loss: 0.0135, Acc: 72.58%


Test Loss: 0.0170, Acc: 69.00%


Test Loss: 0.0140, Acc: 73.59%


Test Loss: 0.0157, Acc: 72.15%


Test Loss: 0.0152, Acc: 72.82%


Test Loss


Test Loss: 0.0170, Acc: 74.90%


Test Loss: 0.0169, Acc: 75.62%


Test Loss: 0.0171, Acc: 75.43%


Test Loss: 0.0167, Acc: 75.49%


Test Loss: 0.0172, Acc: 75.37%


Test Loss: 0.0170, Acc: 75.45%


Test Loss: 0.0171, Acc: 75.40%


Test Loss: 0.0171, Acc: 75.59%


Test Loss: 0.0170, Acc: 75.50%


Test Loss: 0.0171, Acc: 75.69%


Test Loss: 0.0171, Acc: 75.66%


Test Loss: 0.0171, Acc: 75.45%


Test Loss: 0.0171, Acc: 75.63%


Test Loss: 0.0172, Acc: 75.72%


Test Loss: 0.0170, Acc: 75.61%


Test Loss: 0.0175, Acc: 75.73%


Test Loss: 0.0172, Acc: 75.54%


Test Loss: 0.0170, Acc: 75.70%


Test Loss: 0.0173, Acc: 75.73%


Test Loss: 0.0172, Acc: 75.78%


Test Loss: 0.0171, Acc: 75.57%


Test Loss: 0.0172, Acc: 75.79%


Test Loss: 0.0172, Acc: 75.71%


Test Loss: 0.0172, Acc: 75.69%


Test Loss: 0.0173, Acc: 75.70%


Test Loss: 0.0173, Acc: 75.79%


Test Loss: 0.0172, Acc: 75.60%


Test Loss: 0.0172, Acc: 75.63%


Test Loss: 0.0172, Acc: 75.69%


Test Loss: 0.0172, Acc: 75.79%


Test Loss

VBox(children=(Label(value='0.134 MB of 0.134 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Learning rate,████████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
Test accuracy,▁▃▅▆▆▆▆▇▇▇▇█▆███████████████████████████
Test loss,▄▃▂▁▂▃▃▄▄▅▄▅█▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
Train accuracy,▁▄▆▆▇▇██████████████████████████████████
Train loss,█▅▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Best accuracy,75.87
Elapsed train time,0:13:17.525729
Learning rate,2e-05
Test accuracy,75.79
Test loss,0.01724
Train accuracy,100.0
Train loss,0.0
