# Coding Tutorial 15: Acceleration and Optimization in Deep Learning

```
Course: CSCI 5922 Spring 2025, University of Colorado Boulder
TA: Everley Tseng
Email: Yu-Yun.Tseng@colorado.edu
* AI assistant is used in making this tutorial
```

## Overview

Sections:
- Automatic Mixed Precision
- Operator Fusion and Dynamic Quantization
- Auto Fusion with `torch.compile()`
- vLLM

Objectives:
- Learn how to implement deep learning optimization

## Automatic Mixed Precision

You can use PyTorch's automatic mixed precision (AMP) to train models using lower precision. Specifically, PyTorch supports FP16 and FP8 for training, reducing memory usage and speeding up training. We will use a simple architecture for demonstration.

1. Use `torch.amp.autocast` to enable mixed precision during the forward pass.
2. Use `torch.amp.GradScaler` to scale the gradients during the backward pass.

Note: This method applies on both CPU and GPU. On NVIDIA GPUs, AMP is widely used as it optimizes the computation better. For the following cell, we use a GPU runtime to demonstrate how to code AMP. You may switch the device to `cpu`.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import autocast, GradScaler

# Simple neural network for illustration
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(784, 10)

    def forward(self, x):
        return self.fc(x)

# Initialize model, loss, and optimizer
model = SimpleModel().cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Scaler for mixed precision
scaler = GradScaler('cuda')

# Training loop with mixed precision
def train(model, data_loader):
    model.train()
    for data, target in data_loader:
        data, target = data.cuda(), target.cuda()

        optimizer.zero_grad()

        # Automatic Mixed Precision (AMP) Context Manager
        with autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)

        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

from torch.utils.data import Dataset, DataLoader

# Create a custom Dataset class to include both data and target
class SimpleDataset(Dataset):
    def __init__(self, data_size, input_dim, num_classes):
        self.data = torch.randn(data_size, input_dim)  # Random input data
        self.target = torch.randint(0, num_classes, (data_size,))  # Random target labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.target[idx]  # Return both data and target

# Create a dataset and data loader
dataset = SimpleDataset(data_size=100, input_dim=784, num_classes=10)
train_loader = DataLoader(dataset, batch_size=32)

# Run training loop
train(model, train_loader)
print('Finished training.')

Finished training.


## Operator Fusion And Dynamic Quantization

Now, we explore the concepts of operator fusion and dynamic quantization to improve the efficiency of deep learning models. Operator fusion is a technique where multiple operations, such as convolution and activation (ReLU), are combined into a single operation to reduce computation and speed up inference.

In this experiment, we apply operator fusion to a convolutional neural network (CNN) with three convolutional blocks, each consisting of a convolution layer followed by a ReLU activation. After fusing these operations, we apply dynamic quantization, which reduces the precision of the model weights from 32-bit floating point to 8-bit integers. This helps to reduce memory usage and increase inference speed, which is particularly important for deploying models on devices with limited resources.

We will be using the `torch.quantization` package to perform dynamic quantization and the module fusion.





Dynamic Quantization applied via `quantize_dynamic()` works better on the CPU. Therefore, please switch to CPU for this section to see the effects.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.quantization import quantize_dynamic, fuse_modules

# Define a simple CNN model with 3 convolutional blocks
class SimpleConvModel(nn.Module):
    def __init__(self):
        super(SimpleConvModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.conv3 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)
        self.relu3 = nn.ReLU()

    def forward(self, x):
        x = self.relu1(self.conv1(x))
        x = self.relu2(self.conv2(x))
        x = self.relu3(self.conv3(x))
        return x

# Initialize the model
model = SimpleConvModel().cpu()  # Move to CPU for quantization

# Set the model to evaluation mode before quantization
model.eval()

# Fuse Conv2d + ReLU layers for each convolutional block
model_fused = fuse_modules(model, [
    ['conv1', 'relu1'],  # Fuse conv1 and relu1
    ['conv2', 'relu2'],  # Fuse conv2 and relu2
    ['conv3', 'relu3'],  # Fuse conv3 and relu3
])

# Apply 8-bit dynamic quantization to the model (only for layers that support it)
quantized_model = quantize_dynamic(model_fused, {torch.nn.Conv2d}, dtype=torch.qint8)

With both the original model and the quantized model, let's run an inference to compare the latency.

In [None]:
import torch
import time

# Create a random input tensor (3x224x224 image)
input_data = torch.randn(1, 3, 224, 224)

# Ensure both models are in evaluation mode
model.eval()
quantized_model.eval()

# Test latency of the original model
start_time = time.time()
with torch.no_grad():
    original_output = model(input_data)  # Forward pass
original_latency = time.time() - start_time

# Test latency of the quantized model
start_time = time.time()
with torch.no_grad():
    quantized_output = quantized_model(input_data)  # Forward pass
quantized_latency = time.time() - start_time

# Print the latency results
print(f"Original model latency: {original_latency:.6f} seconds")
print(f"Quantized model latency: {quantized_latency:.6f} seconds")

Original model latency: 0.919242 seconds
Quantized model latency: 0.698787 seconds


## Auto-Fusion with `torch.compile()`

Instead of fusing modules manually using `fuse_modules`, you can instead use `torch.compile()`, which is a high-level API that automatically applies fusion and other optimizations to the model during execution. It looks at the computational graph and decides which operations can be fused or optimized based on the model's structure and the hardware you're running it on.

In this section, we will test the GPT-like Architecture using this approach to demonstrate running a GPT model on a **CPU**.

The blocks in the GPT-like model contains a CausalSelfAttention layer and a MLP.

FlashAttention is used in the CausalSelfAttention layer to speed up the attention computation. FlashAttention replaces the traditional attention mechanism and uses a highly optimized CUDA kernel.

Code reference: https://github.com/karpathy/build-nanogpt/


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time
from torch import fx
from torch.nn import functional as F
import torch._dynamo

# CausalSelfAttention with FlashAttention
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        q, k, v = [x.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) for x in [q, k, v]]
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.c_proj(y)
        return y

# MLP
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

# GPT Block
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

# GPT Model
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transformer = nn.ModuleDict({
            'wte': nn.Embedding(config.vocab_size, config.n_embd),
            'wpe': nn.Embedding(config.block_size, config.n_embd),
            'h': nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            'ln_f': nn.LayerNorm(config.n_embd),
        })
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx):
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        pos_emb = self.transformer.wpe(pos)
        tok_emb = self.transformer.wte(idx)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)
        return logits

# Example GPTConfig
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768

# Initialize the model
config = GPTConfig()
model = GPT(config).cpu()

# Apply torch.compile() for operator fusion
compiled_model = torch.compile(model)

# FlashAttention integration - optimizing attention computation with FlashAttention
# This is done in the CausalSelfAttention module where `F.scaled_dot_product_attention` is used.

# Example of running a forward pass with torch.compile() and FlashAttention
dummy_input = torch.randint(0, config.vocab_size, (1, config.block_size))

start_time = time.time()
with torch.no_grad():
    output = compiled_model(dummy_input)
end_time = time.time()

print(f"Inference time with operator fusion and FlashAttention: {end_time - start_time:.6f} seconds")

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


Inference time with operator fusion and FlashAttention: 61.389983 seconds


## vLLM

vLLM is an optimized library for faster inference of large language models. It supports distributed execution, optimized computation, and improved memory usage.
- [Github](https://github.com/vllm-project/vllm)
- [Documentation](https://docs.vllm.ai/en/latest/)

To install `vllm`, run the command:
```
!pip install vllm
```
To install the package, make sure that your environment fits the [requirements](https://docs.vllm.ai/en/latest/getting_started/installation.html) and that the libraries are installed with correct [versions](https://github.com/vllm-project/vllm/tree/main/requirements). **The default environment on Colab might not fit the requirements**. Completing all required installations, including gcc/g++ and torch, might be time-consuming or exceed the storage or memory limit if using a default free runtime. We are unable to demo in this section, but we encourage you to try the installation on higher-level runtime or other platforms.

vLLM supports generative and pooling models across various tasks. See this [page](https://docs.vllm.ai/en/latest/models/supported_models.html#supported-models) for supported models. Sample code is in the cell below.

In [None]:
# import vllm

# # Load model with vLLM (assuming you have a transformer model)
# model = vllm.LLM("gpt2")

# # Run inference
# input_text = "Once upon a time, in a land far away,"
# output = model.generate(input_text)
# print(output)

## Review

In this tutorial, we implemented acceleration methods using PyTorch. As experienced in many tutorials and lab assignments, computing resources can be the limiting factor for deep learning projects. With the skills we learned in this tutorial, we are able to better utilize available CPU/GPU resources.

For any questions and discussions regarding this tutorial, attend [TA office hours](https://docs.google.com/spreadsheets/d/1fzfTJpEF7RaUYRA_NGa3DkiazdQXVj7QNBbp6DrEZ3I/edit?usp=sharing) or create a post on [Piazza](https://piazza.com/colorado/spring2025/csci5922/home) :) See you in the next tutorial!

\- Everley