# Pytorch Profiler with Tensorboard

Probably better to run this locally, as I am not sure how you can start a profiler server from colab.

Need to also have this installed:
```sh
pip install torch_tb_profiler
```

Sources:
- https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html

In [1]:
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

  from .autonotebook import tqdm as notebook_tqdm


Then prepare the input data. For this tutorial, we use the CIFAR10 dataset. Transform it to the desired format and use DataLoader to load each batch.



In [2]:
transform = T.Compose(
    [T.Resize(224),
     T.ToTensor(),
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [04:24<00:00, 644135.54it/s] 


Extracting ./data/cifar-10-python.tar.gz to ./data


Next, create Resnet model, loss function, and optimizer objects. To run on GPU, move model and loss to GPU device.



In [3]:
# Device configuration
def get_device():
    if torch.cuda.is_available():
        return torch.device('cuda')
    elif torch.backends.mps.is_available():
        return torch.device('mps')
    else:
        return torch.device('cpu')

device = get_device()

In [4]:
# Load a pre-trained resnet18 
model = torchvision.models.resnet18(pretrained=True).to(device)

# References to the loss and optimizer that is used by the model
criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Train the model 
model.train()

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /Users/vascomeerman/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:19<00:00, 2.40MB/s]


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [7]:
# Training step for each batch of data
def train(data):
    inputs, labels = data[0].to(device=device), data[1].to(device=device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

### Use profiler to record execution events

The profiler is enabled through the context manager and accepts several parameters,
some of the most useful are:

- ``schedule`` - callable that takes step (int) as a single parameter
  and returns the profiler action to perform at each step.

  In this example with ``wait=1, warmup=1, active=3, repeat=2``,
  profiler will skip the first step/iteration,
  start warming up on the second,
  record the following three iterations,
  after which the trace will become available and on_trace_ready (when set) is called.
  In total, the cycle repeats twice. Each cycle is called a "span" in TensorBoard plugin.

  During ``wait`` steps, the profiler is disabled.
  During ``warmup`` steps, the profiler starts tracing but the results are discarded.
  This is for reducing the profiling overhead.
  The overhead at the beginning of profiling is high and easy to bring skew to the profiling result.
  During ``active`` steps, the profiler works and records events.
- ``on_trace_ready`` - callable that is called at the end of each cycle;
  In this example we use ``torch.profiler.tensorboard_trace_handler`` to generate result files for TensorBoard.
  After profiling, result files will be saved into the ``./log/resnet18`` directory.
  Specify this directory as a ``logdir`` parameter to analyze profile in TensorBoard.
- ``record_shapes`` - whether to record shapes of the operator inputs.
- ``profile_memory`` - Track tensor memory allocation/deallocation. Note, for old version of pytorch with version
  before 1.10, if you suffer long profiling time, please disable it or upgrade to new version.
- ``with_stack`` - Record source information (file and line number) for the ops.
  If the TensorBoard is launched in VSCode ([reference](https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration)),
  clicking a stack frame will navigate to the specific code line.

In [8]:
# Schedule to use:
# wait=1 -> when the profiler is not active
# warmup=1 -> This is when the phas profiler starts tracing, but results are disregarded
# active=3 -> During this phase profiler traces AND records the data
# repeat=2 -> Specifies the upper bound on th number of cycles, called a "span" in tensorboard
schedule = torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2)

In [9]:
# The record_function context manager a 'code range' that is being tracked
# You can create multiple of these 'ranges' which will be tracked in parralel
with torch.profiler.profile(
        schedule=schedule,
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True, # Record shapes of operator inputs
        profile_memory=True, # Also log memory consumed by the tensors
        with_stack=True # Shows the stacktrace for the code of an operation
) as prof:
    # Loop through the batches in our training dataset
    for step, batch_data in enumerate(train_loader):
        
        # Any step outside the scheduled region will be ignored
        if step >= (1 + 1 + 3) * 2:
            break
        
        # Call out train function
        train(batch_data)
        
        # profiler.step sends a signal to the profiler that next step has started
        # Current step is stored as profile.step_num
        prof.step() 

STAGE:2022-12-07 16:18:07 68450:16419090 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
[W CPUAllocator.cpp:231] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
STAGE:2022-12-07 16:18:08 68450:16419090 ActivityProfilerController.cpp:300] Completed Stage: Collection
STAGE:2022-12-07 16:18:09 68450:16419090 output_json.cpp:417] Completed Stage: Post Processing
STAGE:2022-12-07 16:18:10 68450:16419090 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-12-07 16:18:11 68450:16419090 ActivityProfilerController.cpp:300] Completed Stage: Collection
STAGE:2022-12-07 16:18:13 68450:16419090 output_json.cpp:417] Completed Stage: Post Processing


Alternativly, you can do the same thing withouth a contextmanage:

```
prof = torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True,
        with_stack=True)
prof.start()
for step, batch_data in enumerate(train_loader):
    if step >= (1 + 1 + 3) * 2:
        break
    train(batch_data)
    prof.step()
prof.stop()
```