<a href="https://colab.research.google.com/github/JerryKurata/colab-pytorch/blob/master/Fashion_MNIST_Torch_TensorBoard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Monitoring Pytorch NN Performance with TensorBoard 

TensorBoard is part of the TensorFlow framework.  However, TensorBoard's architecture uses files as the method through which the ML framework communicate with TensorBoard to display data.  Therefore, a 3rd party product like Pytorch that creates the properly structured files can utilize TensorBoard to visualize data.

The code here is based on this tutorial:https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html.  I have modified the code to work in the Colab environment.




In [0]:
# Imports

# pyplot is plotting.  numpy is our best friend
import matplotlib.pyplot as plt
import numpy as np

# torch is general torch, torchvision is vision NN layers and utilities
#   .transforms is routine that transform vision data
import torch
import torchvision
import torchvision.transforms as transforms

# We are going to use torch NN libraries, functional API like keras, and optimizer
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim



In [2]:
#  As of Feb 2020, the PyTorch libraries do not support the 2.x version of 
# TensorBoard.  Therefore we load TensorFlow 1.x which contains TensorBoard 1.x
%tensorflow_version 2.x 

TensorFlow 2.x selected.


In [0]:
# hack to get around tp.code.api.io support bug in TensorFlow support in 
# Pytorch 'tensorflow_core._api.v2.io.gfile' has no attribute 'get_filesystem' #30966
import tensorflow as tf
import tensorboard as tb
# overwrite the default 2.x io gfile method with the one from the compat layer
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile


In [4]:
# Define transform with to tensor and normalizer.  We normalize each channel
#  values to -1.0 to 1.0 via image = (image - mean)/std.  
transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))])
# Download and transform the Fashion-MNIST training and testing datasets
train_data = torchvision.datasets.FashionMNIST('./data',
    download=True,
    train=True,
    transform=transform)
test_data = torchvision.datasets.FashionMNIST('./data',
    download=True,
    train=False,
    transform=transform)

# Define Loaders for training and evaluating with the training and test datasets
#  num_workers = 2 runs 2 subprocesses to speed the loading
train_loader = torch.utils.data.DataLoader(train_data, batch_size=4, 
                                           shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=4, 
                                           shuffle=False, num_workers=2)


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
Processing...
Done!


In [0]:
# Define class label names for displaying.  Class labels are [0,1,2,...,9]
classes = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
        'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot')


# Define a helper function to show the image
# helper function to show an image
# (used in the `plot_classes_preds` function below)
def matplotlib_imshow(img, one_channel=False):
    if one_channel:
        img = img.mean(dim=0)
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    if one_channel:   # grayscale?
        plt.imshow(npimg, cmap="Greys")
    else:             # rgb
        plt.imshow(np.transpose(npimg, (1, 2, 0)))



In [6]:
#  Define NN Model class
class Net(nn.Module):

  # Todo: add layer labels
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create instance of NN Model
net = Net()






When we train we use the loss criterion to measure loss, and the optimizer method to reduce loss.

Our items are one of the 10 classes of fashion items.  CrossEntropyLoss shows how poorly our model is doing at predicting each of the classes.

The optimizer will adjust parameters (weights) in the model to minimuze this loss.

In [0]:
# Define loss criterion and optimizer method
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

## Setup Pytorch support for TensorBoard

TensorBoard expects a folder containing sets of files for the various parts of the model.  Such as training and test loss or accuracy values over time, images used by the model, and derived information such as images that are the same class.

To populate the folders Pytorch has a Tensorboard library.  One key object in this library is the SummaryWriter which handles writing information to the folders.

As you read through notice how SummaryWriter is used once, and close on SummaryWrite is used multiple times.  close() writes files to the
folder that TensorBoard will read.

### Set up SummaryWriter

Since summaryWriter defines the location for the files we need to do this first.

In this example, we add subfolders below runs folder to segment various files from each other.  we would wan to the change the subfolder name if altering the experiment to prevent overwriting.

In [8]:
from torch.utils.tensorboard import SummaryWriter

# default `log_dir` is "runs" 
# We add subfolders below runs to segment various files from each other.  Change
#   the subfolder name if you alter the experiment to prevent overwriting.
log_dir = 'runs/fashion_mnist_experiment_1'
#  Clear folder
! rm -R $log_dir

#  Create instance of Summary Writer
writer = SummaryWriter(log_dir)


rm: cannot remove 'runs/fashion_mnist_experiment_1': No such file or directory


##  Save some sample training data images for visualization in TensorBoard

It is useful to be able to be able to see in TensorBoard some some sample images.  Here we output some sample images from the training data, but these could be any images you find useful.

In [0]:
# load next batch of images and labels from training data
images, labels = next(iter(train_loader))

# Write a grid of sample images.  
grid = torchvision.utils.make_grid(images)
writer.add_image('Sample_images', grid, 0)

# Write the model structure to TensorBoard
writer.add_graph(net, images)

# Close writer
writer.close()

In [10]:
# look at files in folder
! ls runs/fashion_mnist_experiment_1/

events.out.tfevents.1584838741.ae779b8cbb54.126.0


##  Save a visualization of the model graph we will train.

This will shows us the current models structure.  Note, with change the structure we should also change the subfolder for the data to prevent overwriting the model structure and creating confusion about what model produced what results.

In [0]:
writer.add_graph(net, images)
writer.close()

In [12]:
# look at files in folder
! ls runs/fashion_mnist_experiment_1/

events.out.tfevents.1584838741.ae779b8cbb54.126.0
events.out.tfevents.1584838743.ae779b8cbb54.126.1


## Save a projection of the data into 3D space

This creates are really nice projection of the images into 3-D space.  This images of the same class group together so we can see even thought these items look different, they are considered to be the same class

*Note:  In TensorBoard select “color: label” on the top left, as well as enabling “night mode”, which will make the images easier to see since their background is white.*


In [13]:
# helper function
def select_n_random(data, labels, n=100):
    '''
    Selects n random datapoints and their corresponding labels from a dataset
    '''
    assert len(data) == len(labels)

    perm = torch.randperm(len(data))
    return data[perm][:n], labels[perm][:n]

# select random images and their target indices
images, labels = select_n_random(train_data.data, train_data.targets)
print(labels)
# get the class labels for each image
class_labels = [classes[lab] for lab in labels]

# log embeddings
features = images.view(-1, 28 * 28)
writer.add_embedding(features,
                    metadata=class_labels,
                    label_img=images.unsqueeze(1))
writer.close()

tensor([5, 8, 6, 5, 7, 1, 5, 4, 5, 5, 0, 5, 7, 0, 7, 2, 3, 0, 1, 1, 1, 7, 3, 2,
        9, 1, 9, 5, 5, 1, 2, 0, 9, 9, 6, 5, 0, 2, 8, 6, 2, 9, 7, 8, 6, 1, 4, 2,
        6, 7, 1, 7, 5, 7, 7, 8, 9, 1, 3, 3, 8, 2, 6, 8, 8, 8, 9, 4, 3, 6, 6, 6,
        8, 3, 8, 1, 7, 0, 0, 5, 6, 4, 4, 3, 3, 6, 1, 2, 3, 9, 2, 2, 7, 1, 1, 8,
        5, 8, 7, 4])


### Show model training progress

TensorBoard lets us create graphs that show how our model's training is progressing.  

In [0]:
# helper functions
def images_to_probs(net, images):
    '''
    Generates predictions and corresponding probabilities from a trained
    network and a list of images
    '''
    output = net(images)
    # convert output probabilities to predicted class
    _, preds_tensor = torch.max(output, 1)
    preds = np.squeeze(preds_tensor.numpy())
    return preds, [F.softmax(el, dim=0)[i].item() for i, el in zip(preds, output)]


def plot_classes_preds(net, images, labels):
    '''
    Generates matplotlib Figure using a trained network, along with images
    and labels from a batch, that shows the network's top prediction along
    with its probability, alongside the actual label, coloring this
    information based on whether the prediction was correct or not.
    Uses the "images_to_probs" function.
    '''
    preds, probs = images_to_probs(net, images)
    # plot the images in the batch, along with predicted and true labels
    fig = plt.figure(figsize=(12, 48))
    for idx in np.arange(4):
        ax = fig.add_subplot(1, 4, idx+1, xticks=[], yticks=[])
        matplotlib_imshow(images[idx], one_channel=True)
        ax.set_title("{0}, {1:.1f}%\n(label: {2})".format(
            classes[preds[idx]],
            probs[idx] * 100.0,
            classes[labels[idx]]),
                    color=("green" if preds[idx]==labels[idx].item() else "red"))
    return fig

##  Train the model 

We finally get around to training the model.  We can log data using the *.add_scalar* method.  Also, we can add associated data with the *.add_figure* method.


In [15]:
running_loss = 0.0
for epoch in range(1):  # loop over the dataset multiple times

    for i, data in enumerate(train_loader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 1000 == 999:    # every 1000 mini-batches...

            # ...log the running loss
            writer.add_scalar('training loss',
                            running_loss / 1000,
                            epoch * len(train_loader) + i)

            # ...log a Matplotlib Figure showing the model's predictions on a
            # random mini-batch
            writer.add_figure('predictions vs. actuals',
                            plot_classes_preds(net, inputs, labels),
                            global_step=epoch * len(train_loader) + i)
            running_loss = 0.0
            writer.close() # write information to logs
            print(i)
print('Finished Training')

999
1999
2999
3999
4999
5999
6999
7999
8999
9999
10999
11999
12999
13999
14999
Finished Training


### Evaluate the trained model's performance on Testing Data

Of course, once we have the model trained, we want to evaluate it's performance. That is why separate training from testing/evaluation data.  And we never train with the testing/evaluation data.

So now we will use this testing/evaluation data to see how well our trained model does on data it was **not** trained on.

In [0]:
# 1. gets the probability predictions in a test_size x num_classes Tensor
# 2. gets the preds in a test_size Tensor
# takes ~10 seconds to run
class_probs = []
class_preds = []
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        output = net(images)
        class_probs_batch = [F.softmax(el, dim=0) for el in output]
        _, class_preds_batch = torch.max(output, 1)

        class_probs.append(class_probs_batch)
        class_preds.append(class_preds_batch)

test_probs = torch.cat([torch.stack(batch) for batch in class_probs])
test_preds = torch.cat(class_preds)

# helper function
#   precision-recall curve
def add_pr_curve_tensorboard(class_index, test_probs, test_preds, global_step=0):
    '''
    Takes in a "class_index" from 0 to 9 and plots the corresponding
    precision-recall curve
    '''
    tensorboard_preds = test_preds == class_index
    tensorboard_probs = test_probs[:, class_index]

    writer.add_pr_curve(classes[class_index],
                        tensorboard_preds,
                        tensorboard_probs,
                        global_step=global_step)
    writer.close()

# plot all the precision-recall (pr) curves
for i in range(len(classes)):
    add_pr_curve_tensorboard(i, test_probs, test_preds)

## Launch TensorBoard

TensorBoard can be run within the context of Colab thanks to Colab and TensorFlow being from Google.  And TensorBoard being part of Google's TensorFlow ecosystem.   However, there are a few rules to remember.

First, Colab runs on a server and the web interface only send commands to a backend running your code.  As such the UI we use has no idea about processes running on the server.  This means if you invoke TensorBoard twice (or more) the second attempt may fill since TensorBoard is already running on the server.  You should get a message telling you about this and giving you and telling your to execute the shell command kill xxx, where xxx is the Process ID (PID) of the previous TensorBoard instance.  

Second, if you fire off a lot of TensorBoard instances it may hang the UI.  This happens very seldom, but it does happen.  In general I suggest people execute the TensorBoard line manually rather than in via the "Run All".

In [17]:
!kill 831
%reload_ext tensorboard
%tensorboard --logdir $log_dir

/bin/bash: line 0: kill: (831) - No such process


<IPython.core.display.Javascript object>