<a href="https://colab.research.google.com/github/rastringer/GPU_CUDA_overview/blob/main/nsight_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install jupyterlab-nvidia-nsight

# Profiling and optimizing PyTorch training

*(Make sure you are using the free T4 runtime in Colab)*

Since using GPUs is the most expensive step in ML training and inference, no small amount of work goes into optimizing their use. In the real world, very few organizations and developers work on low-level kernel optimizations. They typically work further up the stack with frameworks such as PyTorch, leaving PyTorch's optimizations to those working on its backend (which of course uses CUDA).

To give us a lens into the operations being performed on the accelerator and their efficiency in this scenario, there are a variety of profiling tools available. In this notebook, we will explore the use of Nvidia's [Nsight](https://developer.nvidia.com/nsight-systems). The software is available as a desktop application and command line tool.  

### Install Nsight tools

Since Colabs are essentially a linux-based virtual machine, we can use `apt get` to install the Nvidia tools

In [None]:
%%bash

apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli

### Check the installation

In [3]:
!nsys status -e

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 6.1.85+: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.


### Simple attention

Here's our basic attention mechanism that computes query, key, and value matrices to generate weighted representations of input data.
The SimpleTransformer class combines this attention mechanism with layer normalization in a residual connection setup.

We will include profiling code to measure CPU and GPU performance metrics when running the model on sample input data.

In [8]:
%%writefile profiler.py

import torch
import torch.nn as nn

class SimpleAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        attn_weights = torch.matmul(q, k.transpose(-2, -1))
        attn_weights = torch.softmax(attn_weights, dim=-1)

        return torch.matmul(attn_weights, v)

class SimpleTransformer(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = SimpleAttention(embed_dim)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_output = self.attention(x)
        return self.norm(x + attn_output)

# Create a model and sample input
embed_dim = 256
seq_length = 100
batch_size = 32

model = SimpleTransformer(embed_dim, num_heads=1).cuda()
sample_input = torch.randn(batch_size, seq_length, embed_dim).cuda()

import torch.cuda.profiler as profiler

# Warm-up run
model(sample_input)

# Profile the model
with profiler.profile(activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True) as prof:
    with profiler.record_function("model_inference"):
        model(sample_input)

# Print profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Writing profiler.py


In [10]:
!nsys profile --stats=true python profiler.py

Collecting data...
Traceback (most recent call last):
  File "/content/profiler.py", line 46, in <module>
    with profiler.profile(activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True) as prof:
AttributeError: module 'torch.cuda.profiler' has no attribute 'ProfilerActivity'
Generating '/tmp/nsys-report-4b28.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /content/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)    Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  ---------  -----------  ------------  ----------------------
     76.0    1,702,896,538         29  58,720,570.3  77,869,201.0      4,006  100,150,826  44,495,375.6  poll                  
     13.2      295,663,488      1,672     176,832.2    

(Numbers will differ slightly each time we run these cells)

Let's analyze the "cuda_gpu_kern_sum" report, which shows the GPU kernel executions:

* `volta_sgemm_128x64_tn` (61.9% of GPU time):
This is likely the matrix multiplication for computing attention weights (q * k.transpose(-2, -1)). It's using NVIDIA's optimized GEMM (General Matrix Multiplication) kernel. Typically the most compute-intensive operation in a. transformer model.

* `volta_sgemm_64x64_tn` (14.2% of GPU time):
This could be another part of the attention computation, possibly the final matrix multiplication with the value matrix (attn_weights * v).

* `volta_sgemm_128x64_nn` (11.3% of GPU time):
This might be the matrix multiplication in one of the linear layers (query, key, or value projection).

* `vectorized_layer_norm_kernel` (8.1% of GPU time):
This corresponds to the LayerNorm operation in the SimpleTransformer class.
vectorized_elementwise_kernel (3.1% of GPU time):
This could be the element-wise addition in the residual connection (x + attn_output).

* `softmax_warp_forward` (1.4% of GPU time):
This is the softmax operation applied to the attention weights.

The `SimpleAttention` class operations are primarily represented by items 1, 2, 3, and 6 in this list. These operations account for about 88.8% of the GPU kernel execution time, which indicates that the attention mechanism is indeed a significant part of the computation.
To optimize this, we could:

* Use the optimized attention mechanism as suggested in the tutorial (torch.nn.functional.scaled_dot_product_attention).
* Experiment with different batch sizes or sequence lengths to find the optimal configuration for your hardware.
* Consider using mixed precision (float16).

Let's optimize our attention mechanism to use the `torch.nn.functional.scaled_dot_product_attention` function, optimized for GPUs. This method uses the Flash Attention algorithm when available.

In [11]:
%%writefile profiler.py

import torch
import torch.nn as nn

import torch.nn.functional as F

class OptimizedAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.scale = embed_dim ** -0.5

    def forward(self, x):
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)

        return F.scaled_dot_product_attention(q, k, v, scale=self.scale)

# Update the SimpleTransformer class to use OptimizedAttention
class OptimizedTransformer(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = OptimizedAttention(embed_dim)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_output = self.attention(x)
        return self.norm(x + attn_output)

# Create a model and sample input
embed_dim = 256
seq_length = 1000
batch_size = 32

# Create a new model with the optimized attention
optimized_model = OptimizedTransformer(embed_dim, num_heads=1).cuda()
sample_input = torch.randn(batch_size, seq_length, embed_dim).cuda()

import torch.cuda.profiler as profiler

# Warm-up run
optimized_model(sample_input)

# Profile the optimized model
with profiler.profile(activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True) as prof:
    with profiler.record_function("optimized_model_inference"):
        optimized_model(sample_input)

# Print profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Overwriting profiler.py


In [12]:
!nsys profile --stats=true python profiler.py

Collecting data...
Traceback (most recent call last):
  File "/content/profiler.py", line 48, in <module>
    with profiler.profile(activities=[profiler.ProfilerActivity.CPU, profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True) as prof:
AttributeError: module 'torch.cuda.profiler' has no attribute 'ProfilerActivity'
Generating '/tmp/nsys-report-ed31.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /content/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)           Name         
 --------  ---------------  ---------  ------------  -----------  ---------  -----------  ------------  ---------------------
     79.1      501,166,791         17  29,480,399.5  2,985,176.0      3,734  100,157,864  38,521,524.2  poll                 
     11.6       73,748,312        639     115,412.1     12,66

(Numbers will differ slightly each time we run these cells)

Looking at the "cuda_gpu_kern_sum" report, we notice:

* volta_sgemm_128x64_tn (58.4% of GPU time, previously 61.9%):
  * We see a slight decrease in what is likely the matrix multiplication for computing attention weights. Though small on some tiny sample data, imagine these gains multiplied exponentially on real world training and inference involving text, images, video etc.
* volta_sgemm_64x64_tn (13.3%, previously 14.2%):
  * Final matrix multiplication with the value matrix.
* volta_sgemm_128x64_nn (10.6%, previously 11.3%):
  * The linear layer matrix multiplications.
* vectorized_layer_norm_kernel (7.7%, previously 8.1%):
  * This corresponds to the LayerNorm operation in the SimpleTransformer class.
* vectorized_elementwise_kernel (4.7% + 3.9% = 8.6%, previously 3.1%):
  * This now appears as two separate kernels, possibly for different elementwise operations.
* softmax_warp_forward (1.3%, previously 1.4%):
  * This is still the softmax operation applied to the attention weights.