<a href="https://colab.research.google.com/github/wandb/edu/blob/main/lightning/performance/profile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

# Profiling PyTorch Code

In [None]:
%%capture
!pip install wandb pytorch_lightning torch_tb_profiler

In [None]:
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms

from torch.profiler import tensorboard_trace_handler
import wandb

# drop slow mirror from list of MNIST mirrors
torchvision.datasets.MNIST.mirrors = [mirror for mirror in torchvision.datasets.MNIST.mirrors
                                      if not mirror.startswith("http://yann.lecun.com")]
                                      
# load tensorboard extension for Colab                                      
%load_ext tensorboard

# login to W&B
!wandb login

# CNN Module

In [None]:
class Net(pl.LightningModule):
  """Very simple LeNet-style DNN, plus DropOut."""

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

    def training_step(self, batch, idx):
      inputs, labels = batch
      outputs = model(inputs)
      loss =  F.nll_loss(outputs, labels)

      return {"loss": loss}
      
    def configure_optimizers(self):
      return optim.Adadelta(self.parameters(), lr=0.1)


class TorchTensorboardProfilerCallback(pl.Callback):
  """Quick-and-dirty Callback for invoking TensorboardProfiler during training.
  
  For greater robustness, extend the pl.profiler.profilers.BaseProfiler, see
  https://pytorch-lightning.readthedocs.io/en/stable/advanced/profiler.html"""

  def __init__(self, profiler):
    super().__init__()
    self.profiler = profiler 
    self.dir = dir

  def on_train_batch_end(self, trainer, pl_module, outputs, *args, **kwargs):
    self.profiler.step()
    pl_module.log_dict(outputs)

# Run Profiled Training

In [None]:
# initial values are defaults, for all except batch_size, which has no default
config = {"batch_size": 32,
          "num_workers": 0,
          "pin_memory": False,
          "precision": 32,
          }

with wandb.init(entity="wandb", project="profiler", config=config) as run:

    # Set up MNIST data
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])

    dataset = datasets.MNIST('../data', train=True, download=True,
                        transform=transform)

    ## Using a raw DataLoader, rather than LightningDataModule, for greater transparency
    trainloader = torch.utils.data.DataLoader(
      dataset,
      # Key performance-relevant configuration parameters:
      ## batch_size: how many datapoints are passed through the network at once?
      batch_size=wandb.config.batch_size,  # larger batch sizes are more efficient, up to memory constraints
      ##  num_workers: how many side processes to launch for dataloading (should be >0)
      num_workers=wandb.config.num_workers,  # needs to be tuned given model/batch size/compute
      ## pin_memory: should a fixed "pinned" memory block be allocated on the CPU?
      pin_memory=wandb.config.pin_memory,  # should nearly always be True for GPU models, see https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
      )
    
    # Set up model
    model = Net()

    # Set up profiler
    wait, warmup, active, repeat = 1, 1, 2, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(schedule=schedule, on_trace_ready=tensorboard_trace_handler("wandb/latest-run/tbprofile"), with_stack=True)

    with profiler:
        profiler_callback = TorchTensorboardProfilerCallback(profiler)

        trainer = pl.Trainer(gpus=1, max_epochs=1,
                             logger=pl.loggers.WandbLogger(log_model=True, save_code=True),
                             callbacks=[profiler_callback], precision=wandb.config.precision)

        trainer.fit(model, trainloader)

# View PyTorch Profiler in Tensorboard

_NOTE_: if you run into issues here, restart the Colab and try again.
If issues persist, you may need to activate third-party cookies.

In [None]:
# bash command to silently kill any old instances of tensorboard
!ps | grep tensorboard | grep -Eo "^\s*[0-9]+" | xargs kill 2> /dev/null

# launch a new tensorboard pointed at the latest run
## may take a minute
%tensorboard --logdir wandb/latest-run/tbprofile

## Exercises

#### 1. Reading the Profiler 



Run training with the default `config`uration
and launch TensorBoard using the cell above.
This TensorBoard instance contains a complete profiling trace
of a few training steps for the network above,
which we will use to understand the computations that happen
during training and how to speed them up.

The "Overview" tab appears first.
Note the composition of the Execution Pie Chart.
Which slice is the largest?

At the bottom of the Overview tab,
you'll see a "Performance Recommendation".
It should report a percentage of time spent waiting on the `DataLoader`.
A model is considered to be bottle-necked by the `DataLoader`
if that step takes up at least 5% of the time.
How much time does that step take with the default configuration?

In the "Views" dropdown,
head to the "Operator" tab.
This tab lists which operations
are taking the most time in the network.
Review the "Device Self Time" pie chart.
Which operations are taking the most time:
convolutional operations (`conv` appears in the name)
or the fully-connected operations (`add` or `mm` in the name)?

Compare that to the parameter counts in the model's summary below.
Is this counterintuitive?

In [None]:
model.summarize();

Lastly, switch to the "Trace" tab using the "Views" dropdown.
It may take up to several minutes for this tab to populate,
and understanding the contents requires a deeper knowledge of
neural networks and GPU-accelerated computation,
so feel free to skip to the next section.

The Trace tab shows which operations were running in each thread
on the CPU and on the GPU.

In the main thread (the one in which the Profiler Steps appear),
locate the following steps:
1. the loading of data (hint: look for `enumerate` on the CPU, nothing on the GPU)
2. the forward pass to calculate the loss (hint: look for simultaneous activity on CPU+GPU,
with `aten` in the operation names)
3. the backward pass to calculate the gradient of the loss (hint: look for simultaneous activity on CPU+GPU, with `backward` in the operation names).

Notice that these are all run sequentially,
meaning that between loading one batch
and loading the next,
the `DataLoader` is effectively idling.

See the next section for the solution to this issue.

#### 2. Critical Improvement: `num_workers>0`


The "Performance Recommendation" should include a suggestion to change `num_workers`.

While the default value of `0`,
which disables multiprocessing for data-loading,
is almost always a bad choice,
even for models run entirely on the CPU,
there's not an alternative that's always better.

A decent but rough rule-of-thumb for the number of workers is that it should be
equal to the number of processors in the CPU, or perhaps less.

Run the cell below to determine the number of processors available
(on Colab it's typically 2).
Try with this as the value for `num_workers` and observe the effect on total runtime (printed to the command line by `wandb` in the Run Summary;
also available on the Run Page, in the Overview section).

Then try with half as many (but no less than 1!) and with twice as many.
It's common to tune this parameter based on runtime results
for individual architectures, datasets, and training algorithms.

In [None]:
!nproc

Review the "Overview" tab in the TensorBoard Profiler again.

You should see that the `DataLoader` is no longer
the largest slice of the pie chart.

Ideally, the GPU kernel operations are the largest slice --
indicating that the real meat of the network computation
is where the majority of time is spent.
This may or may not occur,
depending on hardware and implementation details.
The next section indicates
some additional optimizations that can tip the scales
more clearly in the direction of GPU operations.

_Note_: if you looked through the "Trace" tab
for the default configuration,
look at it again with `num_workers=2`.

What you see may depend on the precise
hardware and implementation you are using
(which varies across Colab sessions),
but in general the `DataLoader` code no longer
blocks the forward and backward passes,
and so the GPU threads should be more densely filled
with operations.

#### 3. Marginal Further Improvements


The speedup from introducing `num_workers>0`
is often 2x (and can be higher with CPUs with more processors).
The following improvements are generally smaller.

If you reviewed the "Trace" tab in the first exercise,
then you may have noticed a number of simultaneous CPU operations
matching the GPU operations in the forward and backward passes.
These include setup and other computations that must be performed
with each operation,
but which have constant cost as the size of the actual array computation
increases.

If we increase the sizes of the tensors in our network,
we can spread that cost over a larger operation.
This can be done either by increasing the network size
or by increasing the `batch_size`.

Try with a batch size of `10_000`
(or `50_000`, which may crash the instance by consuming too much RAM,
especially with `num_workers=0`;
if `10_000` also crashes your machine, reduce to `1024`).
Do you see a faster runtime?

Another easy but small win comes from changing the
`pin_memory` parameter of the `DataLoader` to `True`.
The details of how this works are out-of-scope for this notebook
([see this NVIDIA blogpost for more](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/)),
but effectively this increases CPU RAM cost
to decrease the latency for transferring
data from CPU to GPU.
This is almost always a good trade.

Finally,
the runtime can be improved by reducing the bit depth or `precision`
of the floating point numbers used in the matrix math that makes up our network.
The capacity to use varying precision is built into PyTorch
and made easy by PyTorch Lightning: just pass `precision=k`
to the `Trainer`, for `k` one of `64` ("double"; default in Python/NumPy), `32` ("single"; default in PyTorch), or `16` ("half").

Try running your model at half precision.

_Note_: this often doesn't have a large direct effect on runtime;
instead it allows for larger models and batch sizes to run on fixed hardware.
Try re-running a batch size that crashed with the default `precision=32` at `precision=16`. Does this model run faster than others?

For more on improving the performance of PyTorch code,
including more details on the optimizations above,
check out
[this excellent talk from NVIDIA's Szymon Migacz](https://www.youtube.com/watch?v=9mS1fIYj1So).

#### Endnote

You may have noticed that the losses for networks with larger
batch sizes are generally higher than those with lower batch sizes,
because the increased batch sizes mean there are fewer gradient updates
in an epoch.

To take full advantage of the speedup given by a larger batch size,
you would also need to scale the learning rate up,
(either
[linearly or with the square root](https://stackoverflow.com/questions/53033556/how-should-the-learning-rate-change-as-the-batch-size-change/53046624)),
this increasing the rate at which the weights change per update.
The choice of batch size may have an impact
on generalization performance as well,
but reports in the literature vary.