## 1. The Roofline Model Concept

### What is the Roofline Model?

The roofline model is a **visual performance model** that shows:
- Maximum achievable performance for a given arithmetic intensity
- Whether a kernel is compute-bound or memory-bound
- How far from peak performance you are

### Key Concepts

```
          ^
  GFLOPS  |                    _______________  Peak Compute
          |                  /
          |                /
          |              /   Memory-Bound Region
          |            /
          |          /
          |        /       Ridge Point
          |      /        â†“
          |    /     * Kernel A (memory-bound)
          |  /
          |/______________ Compute-Bound Region ___________
          |                           * Kernel B (compute-bound)
          +---------------------------------------------> 
                    Arithmetic Intensity (FLOP/Byte)
```

### Definitions

| Term | Definition | Formula |
|------|------------|--------|
| **Arithmetic Intensity (AI)** | Compute work per byte transferred | FLOP / Bytes |
| **Peak Compute** | Max FLOPS the GPU can perform | From spec |
| **Peak Bandwidth** | Max memory throughput | From spec |
| **Ridge Point** | AI where compute = memory bound | Peak_FLOPS / Peak_BW |

### The Roofline Equation

$$\text{Achievable FLOPS} = \min(\text{Peak FLOPS}, \text{Peak BW} \times \text{AI})$$

### ðŸ”· CUDA C++ Implementation (Primary)

## 2. Calculating Arithmetic Intensity

### CUDA C++ Examples with AI Calculations

The following kernels demonstrate different arithmetic intensities:
- **Vector Add**: AI â‰ˆ 0.083 (very memory-bound)
- **SAXPY**: AI â‰ˆ 0.167 (memory-bound)
- **Dot Product**: AI â‰ˆ 0.375 (memory-bound)
- **Matrix Multiply**: AI scales with N (compute-bound for large N)
- **3D Stencil**: AI â‰ˆ 0.75-3 (balanced)

In [None]:
%%writefile arithmetic_intensity.cu
// arithmetic_intensity.cu - Calculate AI for different kernels
#include <cuda_runtime.h>
#include <stdio.h>

//=============================================================================
// Kernel 1: Vector Add
// FLOP: N (one add per element)
// Bytes: 3N * 4 (read 2 vectors, write 1 vector, float32)
// AI = N / (12N) = 1/12 â‰ˆ 0.083 FLOP/Byte  â†’ VERY memory-bound
//=============================================================================
__global__ void vectorAdd(float* c, const float* a, const float* b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];  // 1 FLOP, 12 bytes
    }
}

//=============================================================================
// Kernel 2: SAXPY (y = a*x + y)
// FLOP: 2N (mul + add per element)
// Bytes: 3N * 4 (read x, read y, write y) - ignoring scalar a (cached)
// AI = 2N / (12N) = 1/6 â‰ˆ 0.167 FLOP/Byte  â†’ Memory-bound
//=============================================================================
__global__ void saxpy(float* y, const float* x, float a, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        y[idx] = a * x[idx] + y[idx];  // 2 FLOP, 12 bytes
    }
}

//=============================================================================
// Kernel 3: Dot Product (reduction)
// FLOP: 2N (mul + add per element) + N (reduction adds)
// Bytes: 2N * 4 (read 2 vectors)
// AI = 3N / (8N) â‰ˆ 0.375 FLOP/Byte  â†’ Memory-bound
//=============================================================================
__global__ void dotProduct(float* result, const float* a, const float* b, 
                           int n) {
    __shared__ float sdata[256];
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    sdata[tid] = (idx < n) ? a[idx] * b[idx] : 0.0f;
    __syncthreads();
    
    // Reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }
    
    if (tid == 0) atomicAdd(result, sdata[0]);
}

//=============================================================================
// Kernel 4: Matrix Multiply (naive)
// For NÃ—N matrices:
// FLOP: 2NÂ³ (NÂ³ multiplies + NÂ³ adds)
// Bytes: 3NÂ² * 4 (read A, B; write C) - minimum
// AI = 2NÂ³ / (12NÂ²) = N/6 FLOP/Byte  â†’ Scales with N!
// For N=1024: AI â‰ˆ 170 FLOP/Byte â†’ Compute-bound
//=============================================================================
__global__ void matrixMul(float* C, const float* A, const float* B, 
                          int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

//=============================================================================
// Kernel 5: Stencil (7-point 3D)
// FLOP per point: 7 reads, 6 adds, 1 write = 6 FLOP
// Bytes per point: ~8 bytes (with caching) to 32 bytes (no cache)
// AI â‰ˆ 0.75 - 3 FLOP/Byte â†’ Memory-bound to balanced
//=============================================================================
__global__ void stencil3D(float* out, const float* in, 
                          int nx, int ny, int nz) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;
    int k = blockIdx.z * blockDim.z + threadIdx.z + 1;
    
    if (i < nx-1 && j < ny-1 && k < nz-1) {
        int idx = i + j * nx + k * nx * ny;
        
        out[idx] = in[idx] +
                   in[idx - 1] + in[idx + 1] +
                   in[idx - nx] + in[idx + nx] +
                   in[idx - nx*ny] + in[idx + nx*ny];
    }
}

int main() {
    printf("Arithmetic Intensity Examples:\n");
    printf("================================\n");
    printf("Vector Add:    AI â‰ˆ 0.08 FLOP/Byte (memory-bound)\n");
    printf("SAXPY:         AI â‰ˆ 0.17 FLOP/Byte (memory-bound)\n");
    printf("Dot Product:   AI â‰ˆ 0.38 FLOP/Byte (memory-bound)\n");
    printf("MatMul 1024:   AI â‰ˆ 170 FLOP/Byte  (compute-bound)\n");
    printf("3D Stencil:    AI â‰ˆ 0.75-3 FLOP/Byte (balanced)\n");
    printf("\nRidge point (typical GPU): ~10-20 FLOP/Byte\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -O3 -o arithmetic_intensity arithmetic_intensity.cu
!./arithmetic_intensity

## 3. Building a Roofline Chart

### GPU Specifications

| GPU | Peak FP32 | Peak BW | Ridge Point |
|-----|-----------|---------|-------------|
| RTX 3090 | 35.6 TFLOPS | 936 GB/s | 38 FLOP/Byte |
| A100 | 19.5 TFLOPS | 2039 GB/s | 9.6 FLOP/Byte |
| V100 | 15.7 TFLOPS | 900 GB/s | 17.4 FLOP/Byte |
| T4 | 8.1 TFLOPS | 320 GB/s | 25.3 FLOP/Byte |

### Calculate Ridge Point

```
Ridge Point = Peak Compute / Peak Bandwidth

Example for A100:
Ridge = 19.5 TFLOPS / 2039 GB/s = 9.6 FLOP/Byte

Interpretation:
- AI < 9.6: Memory-bound
- AI > 9.6: Compute-bound
```

## 4. Nsight Compute Roofline Analysis

### Generate Roofline with ncu

```bash
# Generate roofline analysis
ncu --set roofline -o roofline_report ./my_program

# Open in GUI to see visual roofline
ncu-ui roofline_report.ncu-rep
```

### Interpreting the Roofline Chart

```
In Nsight Compute GUI:

1. Look for your kernel's position (dot on the chart)
2. Check which roof it's closest to:
   - Below slanted line (memory roof) â†’ Memory-bound
   - Below horizontal line (compute roof) â†’ Compute-bound
3. Distance from roof = optimization potential

Multiple roofs may appear:
- L1 cache roofline (highest bandwidth)
- L2 cache roofline
- DRAM roofline (lowest bandwidth)
- FP32 roofline
- FP64 roofline
```

### Command-Line Roofline Metrics

```bash
# Get metrics for manual roofline calculation
ncu --metrics \
  sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
  sm__sass_thread_inst_executed_op_fmul_pred_on.sum,\
  sm__sass_thread_inst_executed_op_ffma_pred_on.sum,\
  dram__bytes.sum,\
  gpu__time_duration.avg \
  ./my_program

# Calculate:
# FLOPS = FADD + FMUL + 2*FFMA
# AI = FLOPS / dram__bytes.sum
# Performance = FLOPS / time
```

### ðŸ”· CUDA C++ Implementation (Primary)

## 5. CUDA C++ Roofline Test Program

This program measures actual performance across different arithmetic intensities to verify the roofline model. By varying the number of compute iterations per memory access, we can sweep from memory-bound to compute-bound regimes.

In [None]:
%%writefile roofline_test.cu
// roofline_test.cu - Measure roofline position
#include <cuda_runtime.h>
#include <stdio.h>
#include <chrono>

#define CHECK_CUDA(call) { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        printf("CUDA error: %s\n", cudaGetErrorString(err)); \
        exit(1); \
    } \
}

// SAXPY kernel - known AI
__global__ void saxpy(float* y, const float* x, float a, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        y[idx] = a * x[idx] + y[idx];
    }
}

// Compute-heavy kernel - adjustable AI
__global__ void computeHeavy(float* output, const float* input, 
                              int n, int compute_iters) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = input[idx];
        
        // Increase compute per byte by iterating
        for (int i = 0; i < compute_iters; i++) {
            val = val * 1.00001f + 0.00001f;  // 2 FLOP
        }
        
        output[idx] = val;
    }
}

void measureKernel(const char* name, int n, int compute_iters, 
                   float* d_in, float* d_out) {
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;
    
    // Warm up
    computeHeavy<<<gridSize, blockSize>>>(d_out, d_in, n, compute_iters);
    CHECK_CUDA(cudaDeviceSynchronize());
    
    // Timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    int iterations = 100;
    
    cudaEventRecord(start);
    for (int i = 0; i < iterations; i++) {
        computeHeavy<<<gridSize, blockSize>>>(d_out, d_in, n, compute_iters);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop);
    float avgTime = ms / iterations;
    
    // Calculate metrics
    long long flops = (long long)n * compute_iters * 2;  // 2 FLOP per iter
    long long bytes = (long long)n * 2 * sizeof(float);  // Read + Write
    
    float ai = (float)flops / bytes;
    float gflops = flops / (avgTime * 1e6);
    float bandwidth = bytes / (avgTime * 1e6);  // GB/s
    
    printf("%s (iters=%d):\n", name, compute_iters);
    printf("  Time: %.3f ms\n", avgTime);
    printf("  AI: %.2f FLOP/Byte\n", ai);
    printf("  Performance: %.2f GFLOPS\n", gflops);
    printf("  Bandwidth: %.2f GB/s\n", bandwidth);
    printf("\n");
    
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

int main() {
    const int N = 1 << 24;  // 16M elements
    
    float *d_in, *d_out;
    CHECK_CUDA(cudaMalloc(&d_in, N * sizeof(float)));
    CHECK_CUDA(cudaMalloc(&d_out, N * sizeof(float)));
    
    // Initialize
    float* h_in = new float[N];
    for (int i = 0; i < N; i++) h_in[i] = 1.0f;
    CHECK_CUDA(cudaMemcpy(d_in, h_in, N * sizeof(float), 
                          cudaMemcpyHostToDevice));
    
    printf("Roofline Test - Varying Arithmetic Intensity\n");
    printf("=============================================\n\n");
    
    // Test different AI levels
    // AI = (2 * iters) / 8 = iters / 4
    measureKernel("Low AI", N, 1, d_in, d_out);      // AI = 0.25
    measureKernel("Medium AI", N, 10, d_in, d_out);  // AI = 2.5
    measureKernel("High AI", N, 100, d_in, d_out);   // AI = 25
    measureKernel("Very High AI", N, 1000, d_in, d_out);  // AI = 250
    
    printf("Observations:\n");
    printf("- Low AI: Limited by memory bandwidth\n");
    printf("- High AI: Limited by compute throughput\n");
    printf("- Performance scales with AI until hitting compute roof\n");
    
    delete[] h_in;
    cudaFree(d_in);
    cudaFree(d_out);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -O3 -lineinfo -o roofline_test roofline_test.cu
!./roofline_test

### ðŸ”¶ Python/Numba (Optional - Quick Testing)

## 6. Python/Numba Optional Backup

Let's visualize the roofline model in Python:

In [None]:
!pip install numpy matplotlib numba -q

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot_roofline(peak_compute_gflops, peak_bandwidth_gb_s, kernels=None):
    """
    Plot a roofline model.
    
    Parameters:
    - peak_compute_gflops: Peak compute throughput in GFLOPS
    - peak_bandwidth_gb_s: Peak memory bandwidth in GB/s
    - kernels: List of (name, AI, achieved_gflops) tuples
    """
    # Calculate ridge point
    ridge_point = peak_compute_gflops / peak_bandwidth_gb_s
    
    # AI range for plotting
    ai = np.logspace(-2, 3, 1000)  # 0.01 to 1000 FLOP/Byte
    
    # Roofline: min(peak_compute, peak_bw * AI)
    roofline = np.minimum(peak_compute_gflops, peak_bandwidth_gb_s * ai)
    
    # Plot
    plt.figure(figsize=(12, 8))
    
    # Roofline
    plt.loglog(ai, roofline, 'b-', linewidth=3, label='Roofline')
    
    # Ridge point
    plt.axvline(x=ridge_point, color='g', linestyle='--', linewidth=1,
                label=f'Ridge Point ({ridge_point:.1f} FLOP/Byte)')
    
    # Peak lines
    plt.axhline(y=peak_compute_gflops, color='r', linestyle=':', linewidth=1,
                label=f'Peak Compute ({peak_compute_gflops:.0f} GFLOPS)')
    
    # Plot kernel points if provided
    colors = ['red', 'orange', 'purple', 'brown', 'pink']
    if kernels:
        for i, (name, kernel_ai, kernel_perf) in enumerate(kernels):
            color = colors[i % len(colors)]
            plt.scatter(kernel_ai, kernel_perf, s=200, c=color, 
                       marker='*', zorder=5)
            plt.annotate(name, (kernel_ai, kernel_perf), 
                        textcoords="offset points", 
                        xytext=(10, 10), fontsize=10, color=color)
    
    # Shading for regions
    plt.fill_between(ai[ai < ridge_point], 0.1, roofline[ai < ridge_point],
                     alpha=0.1, color='blue', label='Memory-bound region')
    plt.fill_between(ai[ai >= ridge_point], 0.1, roofline[ai >= ridge_point],
                     alpha=0.1, color='red', label='Compute-bound region')
    
    plt.xlabel('Arithmetic Intensity (FLOP/Byte)', fontsize=12)
    plt.ylabel('Performance (GFLOPS)', fontsize=12)
    plt.title(f'Roofline Model\nPeak: {peak_compute_gflops} GFLOPS, '
              f'Bandwidth: {peak_bandwidth_gb_s} GB/s', fontsize=14)
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(True, which="both", ls="-", alpha=0.3)
    plt.xlim(0.01, 1000)
    plt.ylim(1, peak_compute_gflops * 2)
    
    plt.tight_layout()
    plt.show()
    
    return ridge_point

In [None]:
# Example: T4 GPU (Google Colab)
PEAK_COMPUTE = 8100  # GFLOPS (FP32)
PEAK_BANDWIDTH = 320  # GB/s

# Example kernels with their AI and measured performance
kernels = [
    ("Vector Add", 0.083, 50),      # Very memory-bound
    ("SAXPY", 0.167, 100),          # Memory-bound
    ("Dot Product", 0.375, 200),    # Memory-bound
    ("3D Stencil", 1.5, 400),       # Balanced
    ("MatMul 1024", 170, 5000),     # Compute-bound
]

ridge = plot_roofline(PEAK_COMPUTE, PEAK_BANDWIDTH, kernels)
print(f"Ridge Point: {ridge:.2f} FLOP/Byte")
print(f"\nKernels below ridge point are memory-bound")
print(f"Kernels above ridge point are compute-bound")

In [None]:
# Calculate AI for common operations

def calculate_ai(name, flops, bytes_transferred):
    ai = flops / bytes_transferred
    print(f"{name}:")
    print(f"  FLOP: {flops}")
    print(f"  Bytes: {bytes_transferred}")
    print(f"  AI: {ai:.3f} FLOP/Byte")
    print()
    return ai

N = 1024  # Vector/matrix size

print("Arithmetic Intensity Calculations")
print("=" * 40)

# Vector Add: c = a + b
calculate_ai("Vector Add (N elements)",
             flops=N,  # N additions
             bytes_transferred=3 * N * 4)  # 3 vectors Ã— N Ã— 4 bytes

# SAXPY: y = a*x + y  
calculate_ai("SAXPY",
             flops=2 * N,  # N muls + N adds
             bytes_transferred=3 * N * 4)  # x, y(read), y(write)

# Matrix Multiply: C = A Ã— B (naive)
calculate_ai(f"Matrix Multiply ({N}Ã—{N})",
             flops=2 * N**3,  # NÂ³ muls + NÂ³ adds
             bytes_transferred=3 * N**2 * 4)  # A, B, C

# Convolution 3Ã—3
calculate_ai("3Ã—3 Convolution (per pixel)",
             flops=9 * 2,  # 9 muls + 9 adds
             bytes_transferred=10 * 4)  # 9 reads + 1 write

In [None]:
# Simulate kernel performance across AI spectrum
from numba import cuda
import time

@cuda.jit
def variable_ai_kernel(output, input_arr, iters):
    """Kernel with variable AI based on iteration count"""
    idx = cuda.grid(1)
    if idx < input_arr.size:
        val = input_arr[idx]
        for i in range(iters):
            val = val * 1.00001 + 0.00001  # 2 FLOP per iteration
        output[idx] = val

if cuda.is_available():
    N = 1 << 22  # 4M elements
    h_input = np.ones(N, dtype=np.float32)
    d_input = cuda.to_device(h_input)
    d_output = cuda.device_array_like(h_input)
    
    threads = 256
    blocks = (N + threads - 1) // threads
    
    results = []
    
    for iters in [1, 5, 10, 50, 100, 500, 1000]:
        # Warm up
        variable_ai_kernel[blocks, threads](d_output, d_input, iters)
        cuda.synchronize()
        
        # Time it
        start = time.perf_counter()
        for _ in range(10):
            variable_ai_kernel[blocks, threads](d_output, d_input, iters)
        cuda.synchronize()
        elapsed = (time.perf_counter() - start) / 10
        
        # Calculate metrics
        flops = N * iters * 2
        bytes_trans = N * 2 * 4  # Read + Write, float32
        ai = flops / bytes_trans
        gflops = flops / elapsed / 1e9
        
        results.append((ai, gflops))
        print(f"Iterations: {iters:4d}, AI: {ai:7.2f}, Performance: {gflops:8.2f} GFLOPS")
    
    # Plot measured vs roofline
    ais, perfs = zip(*results)
    plt.figure(figsize=(10, 6))
    
    # Theoretical roofline (T4 specs)
    ai_range = np.logspace(-1, 3, 100)
    roofline = np.minimum(8100, 320 * ai_range)
    plt.loglog(ai_range, roofline, 'b-', linewidth=2, label='Roofline')
    
    # Measured points
    plt.scatter(ais, perfs, s=100, c='red', marker='o', label='Measured', zorder=5)
    
    plt.xlabel('Arithmetic Intensity (FLOP/Byte)')
    plt.ylabel('Performance (GFLOPS)')
    plt.title('Measured Performance vs Roofline')
    plt.legend()
    plt.grid(True, which='both', alpha=0.3)
    plt.show()
else:
    print("No CUDA GPU available for testing")

## 7. Optimization Guidance from Roofline

### If Memory-Bound (below ridge point)

1. **Improve memory access patterns**
   - Ensure coalescing
   - Reduce bank conflicts
   
2. **Use memory hierarchy**
   - Cache in shared memory
   - Use texture/constant memory
   
3. **Reduce memory traffic**
   - Compute redundant values
   - Compress data
   
4. **Increase arithmetic intensity**
   - Fuse kernels
   - Compute more per load

### If Compute-Bound (above ridge point)

1. **Increase parallelism**
   - More threads
   - Instruction-level parallelism
   
2. **Reduce instruction latency**
   - Use intrinsics
   - Avoid divergence
   
3. **Use specialized units**
   - Tensor cores for matrix ops
   - FP16 for doubled throughput

## 8. Key Takeaways

### Roofline Essentials

1. **Arithmetic Intensity (AI)** = FLOP / Bytes transferred
2. **Ridge Point** = Peak Compute / Peak Bandwidth
3. **Below ridge** = Memory-bound, optimize memory access
4. **Above ridge** = Compute-bound, optimize compute

### Common AI Values

| Operation | AI (FLOP/Byte) | Bound |
|-----------|----------------|-------|
| Vector copy | 0 | Memory |
| Vector add | 0.08 | Memory |
| SAXPY | 0.17 | Memory |
| Dot product | 0.38 | Memory |
| Stencil | 0.5-3 | Memory/Balanced |
| GEMM (large) | 50-200 | Compute |

### Best Practices

1. Calculate theoretical AI before optimizing
2. Use `ncu --set roofline` for measured roofline
3. Focus optimization on actual bottleneck
4. Kernel fusion can increase AI

## 9. Exercises

### Exercise 1: Calculate AI
Calculate the arithmetic intensity for a 5-point 2D stencil:
```cpp
out[i][j] = in[i][j] + in[i-1][j] + in[i+1][j] + in[i][j-1] + in[i][j+1];
```

### Exercise 2: Roofline Position
Given a GPU with:
- Peak compute: 15 TFLOPS
- Peak bandwidth: 500 GB/s

Where on the roofline would a kernel with AI=5 FLOP/Byte appear?

### Exercise 3: Profile with ncu
Use Nsight Compute to generate a roofline analysis for one of your kernels.

### Exercise 4: Increase AI
Modify a memory-bound kernel to increase its arithmetic intensity by fusing multiple operations.

## Summary

Today you learned:
- The **roofline model** visualizes performance bounds
- **Arithmetic intensity** determines if you're memory or compute bound
- **Ridge point** is the boundary between regions
- Use `ncu --set roofline` for visual analysis
- Optimization strategy depends on which roof limits you

**Next**: Day 3 - Nsight Systems for application-level profiling