# üöÄ Day 1: Memory Coalescing - The Key to GPU Performance

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-02/day-1-memory-coalescing.ipynb)

---

## üé£ The Hook: Why is My GPU Code So Slow?

> *You bought a Ferrari, but you're stuck in traffic.*

You've written a GPU kernel. It should be 100x faster than CPU. But it's only 3x faster‚Äîor worse. What's going on?

**The answer is almost always: memory access patterns.**

Your GPU has incredible compute power (the Ferrari), but if you're accessing memory inefficiently, you're creating a traffic jam at the memory controller. Today, we'll learn how to clear that traffic jam with **memory coalescing**.

---

## Learning Objectives

By the end of this notebook, you will understand:
- üéØ **Why** memory bandwidth is the #1 GPU bottleneck
- üîß **How** the GPU memory system actually works (transactions, bursts)
- ‚úÖ **What** makes an access pattern coalesced vs. non-coalesced  
- üìä **How much** performance you gain (hint: often 10-30x!)

---

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")

## 1. How GPU Memory Access Works

<details open>
<summary>üí° <b>Concept Card: The Delivery Truck Analogy</b></summary>

### üéØ The Problem
When your GPU kernel reads data from global memory, it doesn't fetch individual bytes. Understanding this is the key to 10-30x performance gains.

### üöö The Delivery Truck Analogy
Think of the GPU memory controller like a **delivery truck with a minimum package size**.

- The truck (memory controller) can only deliver packages of **32, 64, or 128 bytes**
- Even if you only want **4 bytes** (one `float`), the truck delivers a full **128-byte package**
- If 32 threads in a warp request **adjacent addresses** (0, 1, 2, ... 31), the truck makes **ONE trip** with one 128-byte package
- If 32 threads request **scattered addresses** (0, 32, 64, ... 992), the truck must make **32 separate trips**!

**Same data. Same computation. 32x more memory traffic.**

### üîß Hardware Reality
When a warp (32 threads) accesses global memory, the hardware:
1. Collects all memory addresses from all 32 threads
2. Groups addresses that fall within the same **128-byte aligned segment**
3. Issues one **memory transaction** per segment needed
4. Threads receive their data when the transaction completes

**Best case:** All 32 threads access one 128-byte segment ‚Üí **1 transaction**  
**Worst case:** All 32 threads access different segments ‚Üí **32 transactions**

### ‚úÖ The Pattern
```
COALESCED (Good) - 1 transaction:
Thread:    0    1    2    3   ...  31
Address:  [0]  [1]  [2]  [3] ... [31]
           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                 All in same 128-byte segment

NON-COALESCED (Bad) - up to 32 transactions:
Thread:    0    1    2    3   ...  31  
Address:  [0] [32] [64] [96] ...[992]
           ‚îÇ    ‚îÇ    ‚îÇ    ‚îÇ       ‚îÇ
           Different segments ‚Üí separate transactions
```

</details>

---

### üî∑ CUDA C++ Implementation (Primary)

Let's see this in action with a benchmark comparing coalesced vs. strided access:

In [None]:
%%writefile coalescing_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

// GOOD: Coalesced access - adjacent threads access adjacent memory
__global__ void coalescedCopy(const float* src, float* dst, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dst[idx] = src[idx];  // Thread 0‚Üíaddr 0, Thread 1‚Üíaddr 1, ...
    }
}

// BAD: Strided access - adjacent threads access scattered memory
__global__ void stridedCopy(const float* src, float* dst, int n, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int strided_idx = idx * stride;  // Thread 0‚Üíaddr 0, Thread 1‚Üíaddr 32, ...
    if (strided_idx < n) {
        dst[strided_idx] = src[strided_idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    const int bytes = N * sizeof(float);
    
    // Allocate host memory
    float *h_src = (float*)malloc(bytes);
    float *h_dst = (float*)malloc(bytes);
    
    // Initialize source data
    for (int i = 0; i < N; i++) {
        h_src[i] = (float)i;
    }
    
    // Allocate device memory
    float *d_src, *d_dst;
    cudaMalloc(&d_src, bytes);
    cudaMalloc(&d_dst, bytes);
    
    // Copy to device
    cudaMemcpy(d_src, h_src, bytes, cudaMemcpyHostToDevice);
    
    // Launch coalesced kernel
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    
    printf("Testing coalesced vs strided access patterns:\n");
    printf("Array size: %d elements\n\n", N);
    
    // Time coalesced access
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        coalescedCopy<<<blocks, threads>>>(d_src, d_dst, N);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float coalesced_ms;
    cudaEventElapsedTime(&coalesced_ms, start, stop);
    printf("Coalesced access (stride=1): %.3f ms (avg per iteration)\n", coalesced_ms / 100);
    
    // Time strided access with different strides
    for (int stride = 2; stride <= 32; stride *= 2) {
        cudaMemset(d_dst, 0, bytes);
        
        cudaEventRecord(start);
        for (int i = 0; i < 100; i++) {
            stridedCopy<<<blocks, threads>>>(d_src, d_dst, N, stride);
        }
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);
        
        float strided_ms;
        cudaEventElapsedTime(&strided_ms, start, stop);
        printf("Strided access (stride=%d): %.3f ms (%.1fx slower)\n", 
               stride, strided_ms / 100, strided_ms / coalesced_ms);
    }
    
    // Cleanup
    cudaFree(d_src);
    cudaFree(d_dst);
    free(h_src);
    free(h_dst);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o coalescing_demo coalescing_demo.cu
!./coalescing_demo

In [None]:
# üìä Visualization: Memory Access Patterns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Left: Coalesced Access
ax1 = axes[0]
ax1.set_title("‚úÖ Coalesced Access\n(1 Memory Transaction)", fontsize=12, fontweight='bold', color='green')

# Draw memory segment
ax1.add_patch(plt.Rectangle((0, 0), 32, 1, fill=True, color='lightgreen', edgecolor='green', linewidth=2))
ax1.text(16, 0.5, "128-byte Memory Segment", ha='center', va='center', fontsize=10, fontweight='bold')

# Draw threads
for i in range(8):
    ax1.annotate('', xy=(i*4 + 2, 0), xytext=(i*4 + 2, -0.8),
                arrowprops=dict(arrowstyle='->', color='blue', lw=1.5))
    ax1.text(i*4 + 2, -1, f'T{i}', ha='center', fontsize=8)

ax1.text(16, -1.5, "Threads 0-31 access addresses 0-31", ha='center', fontsize=9, style='italic')
ax1.set_xlim(-1, 33)
ax1.set_ylim(-2, 1.5)
ax1.axis('off')

# Right: Strided Access
ax2 = axes[1]
ax2.set_title("‚ùå Strided Access (stride=32)\n(32 Memory Transactions!)", fontsize=12, fontweight='bold', color='red')

# Draw multiple memory segments
for seg in range(4):
    ax2.add_patch(plt.Rectangle((seg*8, seg*0.3), 6, 0.8, fill=True, 
                                 color='lightsalmon', edgecolor='red', linewidth=1.5))
    ax2.annotate('', xy=(seg*8 + 3, seg*0.3), xytext=(seg*8 + 3, -0.8),
                arrowprops=dict(arrowstyle='->', color='blue', lw=1.5))
    ax2.text(seg*8 + 3, -1, f'T{seg}', ha='center', fontsize=8)
    ax2.text(seg*8 + 3, seg*0.3 + 0.4, f'Seg {seg}', ha='center', fontsize=8)

ax2.text(14, 2.2, "...", fontsize=16, ha='center')
ax2.text(14, -1.5, "Each thread hits a different 128-byte segment!", ha='center', fontsize=9, style='italic')
ax2.set_xlim(-1, 33)
ax2.set_ylim(-2, 2.5)
ax2.axis('off')

plt.tight_layout()
plt.savefig('coalescing_diagram.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()
print("üíæ Diagram saved as coalescing_diagram.png")

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
import numpy as np
from numba import cuda
import math
import time

print("CUDA device:", cuda.get_current_device().name.decode())

# Get memory bandwidth info
device = cuda.get_current_device()
print(f"Warp size: {device.WARP_SIZE}")
print(f"Max threads per block: {device.MAX_THREADS_PER_BLOCK}")

---

## 2. The Coalescing Rules

<details open>
<summary>üí° <b>Concept Card: The Golden Rule of GPU Memory</b></summary>

### üéØ The Golden Rule
> **Adjacent threads (within a warp) should access adjacent memory locations.**

If you remember only one thing from this notebook, remember this. It applies to:
- 1D arrays: `arr[threadIdx.x]` ‚úÖ vs `arr[threadIdx.x * stride]` ‚ùå
- 2D arrays: iterate over columns (fast dimension) with `threadIdx.x`
- Structs: prefer Structure of Arrays (SoA) over Array of Structures (AoS)

### üîß Transaction Sizes
The memory controller issues transactions in fixed sizes:
| Transaction Size | Elements (`float32`) | When Used |
|-----------------|---------------------|-----------|
| 32 bytes | 8 floats | Partial warp access |
| 64 bytes | 16 floats | Half warp aligned |
| **128 bytes** | **32 floats** | **Full warp aligned** ‚Üê ideal! |

A perfectly coalesced warp access loads 32 floats in a single 128-byte transaction.

### ‚ö†Ô∏è Common Gotchas
1. **Stride > 1**: `arr[tid * 2]` doubles transactions
2. **Column-major access**: Reading down columns in a row-major array
3. **Misaligned base**: Starting address not 128-byte aligned
4. **Random access**: Hash tables, gather operations

</details>

### Quick Reference: Access Pattern Cheat Sheet

| Pattern | Example | Coalesced? | Why? |
|---------|---------|------------|------|
| Sequential | `arr[threadIdx.x]` | ‚úÖ Yes | Adjacent threads ‚Üí adjacent addresses |
| Strided | `arr[threadIdx.x * stride]` | ‚ùå No | Threads skip addresses |
| Random | `arr[hash(threadIdx.x)]` | ‚ùå No | Unpredictable addresses |
| Row-major 2D | `arr[row][col]` where `col = threadIdx.x` | ‚úÖ Yes | Threads vary fast dimension |
| Column-major 2D | `arr[row][col]` where `row = threadIdx.x` | ‚ùå No | Threads vary slow dimension |

---

### üî∑ CUDA C++ 2D Access Patterns (Primary)

<details open>
<summary>üí° <b>Concept Card: Row-Major Layout and Thread Mapping</b></summary>

### üéØ The Problem
2D arrays in C/C++/NumPy are stored in **row-major order**. This means:
- Elements in the same row are contiguous in memory
- Moving to the next row jumps by `num_columns` elements

### üîß Memory Layout Visualization
```
Logical 2D View:           Physical 1D Memory:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         
‚îÇ [0,0][0,1][0,2] ‚îÇ Row 0   ‚Üí [0,0][0,1][0,2][1,0][1,1][1,2][2,0]...
‚îÇ [1,0][1,1][1,2] ‚îÇ Row 1         ‚Üë         ‚Üë
‚îÇ [2,0][2,1][2,2] ‚îÇ Row 2     Contiguous  Row boundary (jump!)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### ‚úÖ The Pattern
```cuda
// ‚úÖ COALESCED: threadIdx.x varies the column (fast dimension)
int col = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí column
int row = blockIdx.y * blockDim.y + threadIdx.y;
int idx = row * num_cols + col;  // Adjacent threads ‚Üí adjacent addresses

// ‚ùå NON-COALESCED: threadIdx.x varies the row (slow dimension)  
int row = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí row (WRONG!)
int col = blockIdx.y * blockDim.y + threadIdx.y;
int idx = row * num_cols + col;  // Adjacent threads ‚Üí addresses num_cols apart
```

### ‚ö†Ô∏è Key Insight
The first dimension of your CUDA grid (`blockIdx.x`, `threadIdx.x`) should map to the **last dimension** of your array (columns in row-major).

</details>

In [None]:
%%writefile coalescing_2d.cu
#include <stdio.h>
#include <cuda_runtime.h>

// GOOD: Row-major access - threads in a warp access adjacent columns
__global__ void rowMajorAccess(float* matrix, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // Fast dimension
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;  // Row-major indexing
        output[idx] = matrix[idx] * 2.0f;
    }
}

// BAD: Column-major access - threads in a warp access scattered rows
__global__ void colMajorAccess(float* matrix, float* output, int rows, int cols) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;  // Wrong! Fast dimension on rows
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = matrix[idx] * 2.0f;
    }
}

int main() {
    const int ROWS = 4096;
    const int COLS = 4096;
    const int SIZE = ROWS * COLS;
    const size_t bytes = SIZE * sizeof(float);
    
    float *d_matrix, *d_output;
    cudaMalloc(&d_matrix, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize
    float* h_matrix = (float*)malloc(bytes);
    for (int i = 0; i < SIZE; i++) h_matrix[i] = 1.0f;
    cudaMemcpy(d_matrix, h_matrix, bytes, cudaMemcpyHostToDevice);
    
    dim3 threads(16, 16);
    dim3 blocks_row((COLS + 15) / 16, (ROWS + 15) / 16);
    dim3 blocks_col((ROWS + 15) / 16, (COLS + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== 2D Access Pattern Benchmark ===\n");
    printf("Matrix: %d x %d (%.1f MB)\n\n", ROWS, COLS, bytes / 1e6);
    
    // Benchmark row-major (coalesced)
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        rowMajorAccess<<<blocks_row, threads>>>(d_matrix, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float row_ms;
    cudaEventElapsedTime(&row_ms, start, stop);
    float row_bw = (2 * bytes * 100) / (row_ms / 1000) / 1e9;
    
    // Benchmark column-major (non-coalesced)
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        colMajorAccess<<<blocks_col, threads>>>(d_matrix, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float col_ms;
    cudaEventElapsedTime(&col_ms, start, stop);
    float col_bw = (2 * bytes * 100) / (col_ms / 1000) / 1e9;
    
    printf("Row-major (coalesced):    %.2f ms, %.1f GB/s\n", row_ms / 100, row_bw);
    printf("Column-major (strided):   %.2f ms, %.1f GB/s\n", col_ms / 100, col_bw);
    printf("Speedup from coalescing:  %.2fx\n", col_ms / row_ms);
    
    cudaFree(d_matrix);
    cudaFree(d_output);
    free(h_matrix);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -o coalescing_2d coalescing_2d.cu && ./coalescing_2d

---

## 3. See the Impact: Coalesced vs Non-Coalesced Benchmarks

*Now that we understand the theory, let's **prove it with numbers**.*

The following Python/Numba code demonstrates the same principles for quick interactive testing.
Run both cells to see the dramatic performance difference.

In [None]:
@cuda.jit
def coalesced_copy(src, dst, n):
    """
    COALESCED: Adjacent threads access adjacent elements.
    Thread 0 ‚Üí src[0], Thread 1 ‚Üí src[1], ...
    """
    idx = cuda.grid(1)
    if idx < n:
        dst[idx] = src[idx]

@cuda.jit
def strided_copy(src, dst, n, stride):
    """
    NON-COALESCED: Threads access memory with stride.
    Thread 0 ‚Üí src[0], Thread 1 ‚Üí src[stride], Thread 2 ‚Üí src[2*stride]...
    """
    tid = cuda.grid(1)
    if tid < n // stride:
        # Calculate strided index
        idx = tid * stride
        if idx < n:
            dst[idx] = src[idx]

def benchmark_access_pattern(n, stride=1, iterations=100):
    """Benchmark different access patterns"""
    src = cuda.to_device(np.random.randn(n).astype(np.float32))
    dst = cuda.device_array(n, dtype=np.float32)
    
    threads = 256
    blocks = math.ceil(n / threads)
    
    # Warmup
    if stride == 1:
        coalesced_copy[blocks, threads](src, dst, n)
    else:
        strided_copy[blocks, threads](src, dst, n, stride)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        if stride == 1:
            coalesced_copy[blocks, threads](src, dst, n)
        else:
            strided_copy[blocks, threads](src, dst, n, stride)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    # Calculate bandwidth
    bytes_transferred = 2 * n * 4  # Read + write, float32
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Compare patterns
n = 100_000_000  # 100M elements

print(f"Array size: {n:,} elements ({n * 4 / 1e9:.2f} GB)")
print("=" * 60)
print(f"{'Pattern':<25} | {'Time (ms)':<12} | {'Bandwidth (GB/s)'}")
print("-" * 60)

time_coal, bw_coal = benchmark_access_pattern(n, stride=1)
print(f"{'Coalesced (stride=1)':<25} | {time_coal:<12.3f} | {bw_coal:.1f}")

for stride in [2, 4, 8, 16, 32]:
    time_s, bw_s = benchmark_access_pattern(n, stride=stride)
    slowdown = time_s / time_coal
    print(f"{'Strided (stride=' + str(stride) + ')':<25} | {time_s:<12.3f} | {bw_s:.1f} ({slowdown:.1f}x slower)")

### üî∂ Python/Numba Version (Optional)

The following demonstrates the same concept using Numba for quick interactive testing.
Note how the thread-to-index mapping determines whether access is coalesced.

In [None]:
@cuda.jit
def row_major_read(matrix, output, rows, cols):
    """
    COALESCED: Each warp reads along a row.
    threadIdx.x corresponds to column (fast-changing dimension)
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Reading matrix[row][col] - threads in a warp read adjacent columns
        output[row, col] = matrix[row, col] * 2.0

@cuda.jit
def col_major_read(matrix, output, rows, cols):
    """
    NON-COALESCED: Each warp reads down a column.
    threadIdx.x corresponds to row (slow-changing dimension)
    """
    row, col = cuda.grid(2)  # Note: swapped!
    
    if row < rows and col < cols:
        # Same operation, but different thread mapping
        output[row, col] = matrix[row, col] * 2.0

def benchmark_2d_pattern(rows, cols, pattern='row', iterations=100):
    """Benchmark 2D access patterns"""
    matrix = cuda.to_device(np.random.randn(rows, cols).astype(np.float32))
    output = cuda.device_array((rows, cols), dtype=np.float32)
    
    threads = (16, 16)  # 256 threads per block
    
    if pattern == 'row':
        blocks = (math.ceil(cols / 16), math.ceil(rows / 16))
        kernel = row_major_read
    else:
        blocks = (math.ceil(rows / 16), math.ceil(cols / 16))
        kernel = col_major_read
    
    # Warmup
    kernel[blocks, threads](matrix, output, rows, cols)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        kernel[blocks, threads](matrix, output, rows, cols)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    bytes_transferred = 2 * rows * cols * 4
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Benchmark 2D patterns
rows, cols = 4096, 4096  # 64MB matrix

print(f"Matrix size: {rows} √ó {cols} ({rows * cols * 4 / 1e6:.1f} MB)")
print("=" * 55)

time_row, bw_row = benchmark_2d_pattern(rows, cols, 'row')
time_col, bw_col = benchmark_2d_pattern(rows, cols, 'col')

print(f"Row-major (coalesced):     {time_row:.3f} ms, {bw_row:.1f} GB/s")
print(f"Column-major (strided):    {time_col:.3f} ms, {bw_col:.1f} GB/s")
print(f"\nSpeedup from coalescing: {time_col/time_row:.2f}x")

In [None]:
# üìä Visualize the Performance Difference
import matplotlib.pyplot as plt

# Use the results from the benchmark above (or example values if not run)
try:
    patterns = ['Row-Major\n(Coalesced)', 'Column-Major\n(Strided)']
    times = [time_row, time_col]
    bandwidths = [bw_row, bw_col]
except NameError:
    # Example values if benchmark wasn't run
    patterns = ['Row-Major\n(Coalesced)', 'Column-Major\n(Strided)']
    times = [0.15, 1.5]  # Example: 10x difference
    bandwidths = [400, 40]
    print("‚ö†Ô∏è Using example values. Run the benchmark above to see actual results.\n")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Time comparison
colors = ['#2ecc71', '#e74c3c']  # Green for good, red for bad
bars1 = ax1.bar(patterns, times, color=colors)
ax1.set_ylabel('Time (ms)', fontsize=11)
ax1.set_title('‚è±Ô∏è Execution Time\n(Lower is Better)', fontsize=12, fontweight='bold')
for bar, val in zip(bars1, times):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(times)*0.02,
            f'{val:.2f} ms', ha='center', fontsize=10, fontweight='bold')

# Bandwidth comparison  
bars2 = ax2.bar(patterns, bandwidths, color=colors)
ax2.set_ylabel('Bandwidth (GB/s)', fontsize=11)
ax2.set_title('üìà Memory Bandwidth Achieved\n(Higher is Better)', fontsize=12, fontweight='bold')
for bar, val in zip(bars2, bandwidths):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(bandwidths)*0.02,
            f'{val:.0f} GB/s', ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('coalescing_benchmark.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()

speedup = times[1] / times[0] if times[0] > 0 else 10
print(f"\nüéØ Key Result: Coalesced access is {speedup:.1f}x faster!")
print(f"   Same computation, same data size‚Äîjust a different memory access pattern.")

---

## 5. Matrix Transpose: A Classic Coalescing Challenge

<details open>
<summary>üí° <b>Concept Card: The Transpose Dilemma</b></summary>

### üéØ The Problem
Matrix transpose seems simple: swap `A[i][j]` with `A[j][i]`. But there's a fundamental conflict:

- **Reading rows** (coalesced) ‚Üí **Writing columns** (non-coalesced)
- **Reading columns** (non-coalesced) ‚Üí **Writing rows** (coalesced)

**You can't have both coalesced reads AND writes with a naive approach!**

### üîß Why This Happens
```
Input Matrix (read):        Output Matrix (write):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê             ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ A B C D ... ‚îÇ ‚Üê Thread 0-3 read here     ‚îÇ A E I M ... ‚îÇ ‚Üê But must write scattered!
‚îÇ E F G H ... ‚îÇ                            ‚îÇ B F J N ... ‚îÇ
‚îÇ I J K L ... ‚îÇ                            ‚îÇ C G K O ... ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

If we read row 0 (A,B,C,D) contiguously...
We must write to column 0 (A,E,I,M) which is strided in memory!
```

### ‚úÖ The Solution (Preview)
In the next notebook (Day 2), we'll use **shared memory** to fix this:
1. Threads read coalesced into shared memory
2. Synchronize (shared memory has no coalescing requirement!)
3. Threads write coalesced from shared memory

This transforms a 10x slowdown into nearly optimal performance.

### ‚ö†Ô∏è Key Takeaway
Matrix transpose is the canonical example of why coalescing matters. Naive code can be **10-32x slower** than optimized code doing the exact same computation.

</details>

### üî∑ CUDA C++ Matrix Transpose (Primary)

In [None]:
%%writefile transpose_naive.cu
#include <stdio.h>
#include <cuda_runtime.h>

// Naive transpose: coalesced reads, non-coalesced writes
__global__ void transposeReadCoalesced(float* input, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        // Read: input[row][col] - coalesced (threads read along row)
        // Write: output[col][row] - non-coalesced (threads write to scattered cols)
        output[col * rows + row] = input[row * cols + col];
    }
}

// Alternative: coalesced writes, non-coalesced reads
__global__ void transposeWriteCoalesced(float* input, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        // Read: input[col][row] - non-coalesced (scattered reads)
        // Write: output[row][col] - coalesced (threads write along row)
        output[row * cols + col] = input[col * rows + row];
    }
}

int main() {
    const int ROWS = 4096;
    const int COLS = 4096;
    const size_t bytes = ROWS * COLS * sizeof(float);
    
    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize
    float* h_input = (float*)malloc(bytes);
    for (int i = 0; i < ROWS * COLS; i++) h_input[i] = (float)i;
    cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice);
    
    dim3 threads(16, 16);
    dim3 blocks((COLS + 15) / 16, (ROWS + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== Matrix Transpose Coalescing Demo ===\n");
    printf("Matrix: %d x %d\n\n", ROWS, COLS);
    
    // Benchmark read-coalesced
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        transposeReadCoalesced<<<blocks, threads>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float read_ms;
    cudaEventElapsedTime(&read_ms, start, stop);
    
    // Benchmark write-coalesced
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        transposeWriteCoalesced<<<blocks, threads>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float write_ms;
    cudaEventElapsedTime(&write_ms, start, stop);
    
    printf("Read-coalesced transpose:  %.2f ms\n", read_ms / 100);
    printf("Write-coalesced transpose: %.2f ms\n", write_ms / 100);
    printf("\nNeither is optimal - shared memory fixes this (Day 2)!\n");
    
    cudaFree(d_input);
    cudaFree(d_output);
    free(h_input);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -o transpose_naive transpose_naive.cu && ./transpose_naive

In [None]:
@cuda.jit
def transpose_naive(input_matrix, output_matrix, rows, cols):
    """
    Naive transpose: coalesced reads, non-coalesced writes
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Read from input[row][col] (coalesced - threads read along row)
        # Write to output[col][row] (non-coalesced - threads write to scattered locations)
        output_matrix[col, row] = input_matrix[row, col]

@cuda.jit
def transpose_read_coalesced(input_matrix, output_matrix, rows, cols):
    """
    Same as naive - prioritize coalesced reads
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        output_matrix[col, row] = input_matrix[row, col]

@cuda.jit
def transpose_write_coalesced(input_matrix, output_matrix, rows, cols):
    """
    Prioritize coalesced writes (non-coalesced reads)
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Read from input[col][row] (non-coalesced - scattered reads)
        # Write to output[row][col] (coalesced - threads write along row)
        output_matrix[row, col] = input_matrix[col, row]

def benchmark_transpose(rows, cols, kernel, iterations=100):
    """Benchmark transpose kernel"""
    input_mat = cuda.to_device(np.random.randn(rows, cols).astype(np.float32))
    output_mat = cuda.device_array((cols, rows), dtype=np.float32)
    
    threads = (16, 16)
    blocks = (math.ceil(cols / 16), math.ceil(rows / 16))
    
    # Warmup
    kernel[blocks, threads](input_mat, output_mat, rows, cols)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        kernel[blocks, threads](input_mat, output_mat, rows, cols)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    bytes_transferred = 2 * rows * cols * 4
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Benchmark transpose patterns
rows, cols = 4096, 4096

print(f"Matrix transpose: {rows} √ó {cols}")
print("=" * 55)

time_read, bw_read = benchmark_transpose(rows, cols, transpose_read_coalesced)
time_write, bw_write = benchmark_transpose(rows, cols, transpose_write_coalesced)

print(f"Coalesced reads:   {time_read:.3f} ms, {bw_read:.1f} GB/s")
print(f"Coalesced writes:  {time_write:.3f} ms, {bw_write:.1f} GB/s")
print(f"\nüí° Neither is optimal! We need shared memory (Day 2) to fix this.")

## üéØ Exercises

### üî∑ CUDA C++ Exercises (Primary)

Practice memory coalescing concepts with these hands-on exercises.

### Exercise 1: Identify the Access Pattern

For each kernel below, identify if the access is coalesced or not.

### üî∑ CUDA C++ Version (Primary)

In [None]:
%%writefile exercise_patterns.cu
// exercise_patterns.cu - Identify coalesced vs non-coalesced access
#include <stdio.h>
#include <cuda_runtime.h>

// Pattern A: Sequential access
__global__ void patternA(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        arr[idx] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern B: Reverse sequential access
__global__ void patternB(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        arr[n - 1 - idx] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern C: Strided access (every other element)
__global__ void patternC(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n / 2) {
        arr[idx * 2] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern D: 2D access with correct mapping
__global__ void patternD(int* matrix, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (row < rows && col < cols) {
        matrix[row * cols + col] = row + col;  // TODO: Coalesced or not?
    }
}

int main() {
    printf("=== Exercise 1: Identify Access Patterns ===\n\n");
    
    printf("Pattern A: arr[idx] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices 0,1,2,3,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern B: arr[n - 1 - idx] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices n-1,n-2,n-3,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern C: arr[idx * 2] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices 0,2,4,6,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern D: matrix[row * cols + col] with threadIdx.x ‚Üí col\n");
    printf("  ‚Üí Adjacent threads access adjacent columns\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("-------------------------------------------\n");
    printf("ANSWERS:\n");
    printf("A: COALESCED (sequential access)\n");
    printf("B: COALESCED (reverse is still contiguous within warp)\n");
    printf("C: NON-COALESCED (stride of 2, 50%% efficiency)\n");
    printf("D: COALESCED (threadIdx.x maps to fast dimension)\n");
    
    return 0;
}

In [None]:
!nvcc -o exercise_patterns exercise_patterns.cu && ./exercise_patterns

### üî∂ Python/Numba Version (Optional)

In [None]:
# Exercise 1: Identify coalesced vs non-coalesced

# Pattern A
@cuda.jit
def pattern_a(arr, n):
    idx = cuda.grid(1)
    if idx < n:
        arr[idx] = idx  # TODO: Coalesced or not?

# Pattern B
@cuda.jit
def pattern_b(arr, n):
    idx = cuda.grid(1)
    if idx < n:
        arr[n - 1 - idx] = idx  # TODO: Coalesced or not?

# Pattern C
@cuda.jit
def pattern_c(arr, n):
    idx = cuda.grid(1)
    if idx < n // 2:
        arr[idx * 2] = idx  # TODO: Coalesced or not?

# Pattern D
@cuda.jit
def pattern_d(matrix, rows, cols):
    col, row = cuda.grid(2)
    if row < rows and col < cols:
        matrix[row, col] = row + col  # TODO: Coalesced or not?

print("Analyze each pattern and answer:")
print("Pattern A: ?")
print("Pattern B: ?")
print("Pattern C: ?")
print("Pattern D: ?")

# Answers:
# A: Coalesced (sequential access)
# B: Coalesced (reverse sequential is still contiguous within warp)
# C: Non-coalesced (stride of 2)
# D: Coalesced (threadIdx.x maps to col, which is the fast dimension)

### Exercise 2: Fix the Non-Coalesced Access

The kernel below processes a 2D array but has non-coalesced access. Fix it!

### üî∑ CUDA C++ Version (Primary)

In [None]:
%%writefile fix_coalescing.cu
// fix_coalescing.cu - Fix the non-coalesced access pattern
#include <stdio.h>
#include <cuda_runtime.h>

#define ROWS 2048
#define COLS 2048

// BAD: Non-coalesced access
// Problem: threadIdx.x maps to row, causing strided column access
__global__ void processMatrixBad(const float* input, float* output, int rows, int cols) {
    // This mapping is WRONG for row-major memory!
    int row = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí row
    int col = blockIdx.y * blockDim.y + threadIdx.y;  // threadIdx.y ‚Üí col
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = input[idx] * 2.0f;
    }
}

// GOOD: Coalesced access
// TODO: Fix the thread-to-index mapping so adjacent threads access adjacent memory
__global__ void processMatrixGood(const float* input, float* output, int rows, int cols) {
    // FIX: threadIdx.x should map to the COLUMN (fast dimension in row-major)
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí col
    int row = blockIdx.y * blockDim.y + threadIdx.y;  // threadIdx.y ‚Üí row
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = input[idx] * 2.0f;
    }
}

int main() {
    const size_t bytes = ROWS * COLS * sizeof(float);
    
    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize with dummy data
    cudaMemset(d_input, 0, bytes);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== Exercise 2: Fix Non-Coalesced Access ===\n\n");
    printf("Matrix size: %d x %d\n\n", ROWS, COLS);
    
    // BAD version: threadIdx.x ‚Üí row (WRONG for row-major!)
    dim3 blocksBad(ROWS / 16, COLS / 16);
    dim3 threadsBad(16, 16);  // x=16 threads for rows
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        processMatrixBad<<<blocksBad, threadsBad>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float badTime;
    cudaEventElapsedTime(&badTime, start, stop);
    float badBW = (2 * bytes * 100) / (badTime / 1000) / 1e9;
    
    // GOOD version: threadIdx.x ‚Üí col (CORRECT for row-major!)
    dim3 blocksGood(COLS / 16, ROWS / 16);
    dim3 threadsGood(16, 16);  // x=16 threads for cols
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        processMatrixGood<<<blocksGood, threadsGood>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float goodTime;
    cudaEventElapsedTime(&goodTime, start, stop);
    float goodBW = (2 * bytes * 100) / (goodTime / 1000) / 1e9;
    
    printf("BAD  (threadIdx.x ‚Üí row): %.2f ms, %.1f GB/s\n", badTime / 100, badBW);
    printf("GOOD (threadIdx.x ‚Üí col): %.2f ms, %.1f GB/s\n", goodTime / 100, goodBW);
    printf("\nSpeedup: %.1fx\n", badTime / goodTime);
    
    printf("\n-------------------------------------------\n");
    printf("KEY INSIGHT:\n");
    printf("In row-major layout, adjacent columns are adjacent in memory.\n");
    printf("threadIdx.x should map to the COLUMN index for coalescing!\n");
    
    cudaFree(d_input);
    cudaFree(d_output);
    return 0;
}

In [None]:
!nvcc -o fix_coalescing fix_coalescing.cu && ./fix_coalescing

### üî∂ Python/Numba Version (Optional)

In [None]:
# Exercise 2: Fix the access pattern

@cuda.jit
def process_matrix_bad(matrix, output, rows, cols):
    """BAD: Non-coalesced access"""
    # Problem: threadIdx.x maps to row, causing non-coalesced column access
    row, col = cuda.grid(2)  # This mapping is wrong!
    
    if row < rows and col < cols:
        output[row, col] = matrix[row, col] * 2.0

@cuda.jit
def process_matrix_good(matrix, output, rows, cols):
    """TODO: Fix to be coalesced"""
    # TODO: Change the thread-to-index mapping
    row, col = cuda.grid(2)  # FIX THIS LINE
    
    if row < rows and col < cols:
        output[row, col] = matrix[row, col] * 2.0

# Test your fix
# rows, cols = 2048, 2048
# ... benchmark both versions

### Exercise 3: Advanced Coalescing Challenges

### üî∑ CUDA C++ Exercises (Primary)

Complete these coalescing exercises to solidify your understanding.

In [None]:
%%writefile coalescing_exercises.cu
// coalescing_exercises.cu - Advanced Memory Coalescing Exercises
#include <stdio.h>
#include <cuda_runtime.h>

#define CHECK_CUDA(call) { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        printf("CUDA Error: %s at line %d\n", cudaGetErrorString(err), __LINE__); \
        exit(1); \
    } \
}

//=============================================================================
// EXERCISE 1: Analyze Access Patterns
// Determine if each kernel has coalesced memory access
//=============================================================================

// Pattern A: Direct indexing
__global__ void patternDirect(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = in[idx] * 2.0f;  // Coalesced? _____
    }
}

// Pattern B: Offset indexing
__global__ void patternOffset(float* out, const float* in, int n, int offset) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int offsetIdx = idx + offset;
    if (offsetIdx < n) {
        out[offsetIdx] = in[offsetIdx] * 2.0f;  // Coalesced? _____
    }
}

// Pattern C: Warp-strided indexing
__global__ void patternWarpStrided(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stridedIdx = (idx / 32) * 64 + (idx % 32);  // Skip every other warp-sized chunk
    if (stridedIdx < n) {
        out[stridedIdx] = in[stridedIdx] * 2.0f;  // Coalesced? _____
    }
}

// Pattern D: Interleaved access
__global__ void patternInterleaved(float* out, const float* in, int n, int numArrays) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Access pattern: thread 0->arr0[0], thread 1->arr1[0], thread 2->arr2[0]...
    int arrayIdx = idx % numArrays;
    int elemIdx = idx / numArrays;
    int actualIdx = arrayIdx * (n / numArrays) + elemIdx;
    if (actualIdx < n) {
        out[actualIdx] = in[actualIdx] * 2.0f;  // Coalesced? _____
    }
}

//=============================================================================
// EXERCISE 2: Fix Non-Coalesced Access
// Rewrite these kernels to achieve coalesced memory access
//=============================================================================

// BAD: Column-major access in row-major array
__global__ void copyColumnMajorBad(float* dst, const float* src, int rows, int cols) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        // Adjacent threads (in x) access rows -> non-coalesced!
        dst[row * cols + col] = src[row * cols + col];
    }
}

// TODO: Fix this kernel for coalesced access
__global__ void copyColumnMajorFixed(float* dst, const float* src, int rows, int cols) {
    // HINT: Map threadIdx.x to columns instead of rows
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // Fixed: x -> columns
    int row = blockIdx.y * blockDim.y + threadIdx.y;  // Fixed: y -> rows
    
    if (row < rows && col < cols) {
        dst[row * cols + col] = src[row * cols + col];
    }
}

//=============================================================================
// EXERCISE 3: Structure of Arrays (SoA) vs Array of Structures (AoS)
// Compare memory access patterns
//=============================================================================

// Array of Structures (AoS) - Poor coalescing
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int id;  // 8 floats = 32 bytes per particle
};

// Structure of Arrays (SoA) - Better coalescing
struct ParticlesSoA {
    float* x;
    float* y;
    float* z;
    float* vx;
    float* vy;
    float* vz;
    float* mass;
    int* id;
};

// BAD: AoS access pattern
__global__ void updateParticlesAoS(ParticleAoS* particles, int n, float dt) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Each thread accesses 32-byte struct, stride = 32 bytes between threads
        particles[idx].x += particles[idx].vx * dt;
        particles[idx].y += particles[idx].vy * dt;
        particles[idx].z += particles[idx].vz * dt;
    }
}

// GOOD: SoA access pattern
__global__ void updateParticlesSoA(float* x, float* y, float* z,
                                    float* vx, float* vy, float* vz,
                                    int n, float dt) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Adjacent threads access adjacent memory locations
        x[idx] += vx[idx] * dt;
        y[idx] += vy[idx] * dt;
        z[idx] += vz[idx] * dt;
    }
}

//=============================================================================
// EXERCISE 4: Benchmark Coalesced vs Non-Coalesced
//=============================================================================

__global__ void coalescedSum(float* output, const float* input, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] + 1.0f;
    }
}

__global__ void nonCoalescedSum(float* output, const float* input, int n, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stridedIdx = idx * stride;
    if (stridedIdx < n) {
        output[stridedIdx] = input[stridedIdx] + 1.0f;
    }
}

int main() {
    printf("=== Memory Coalescing Exercises ===\n\n");
    
    //-------------------------------------------------------------------------
    // Exercise 1: Access Pattern Analysis
    //-------------------------------------------------------------------------
    printf("EXERCISE 1: Access Pattern Analysis\n");
    printf("===================================\n");
    printf("Analyze each pattern and determine if it's coalesced:\n\n");
    
    printf("Pattern A (Direct): out[idx] = in[idx]\n");
    printf("  ‚Üí Adjacent threads access adjacent memory\n");
    printf("  ‚Üí ANSWER: COALESCED ‚úì\n\n");
    
    printf("Pattern B (Offset): out[idx + offset] = in[idx + offset]\n");
    printf("  ‚Üí All threads offset by same amount, still contiguous\n");
    printf("  ‚Üí ANSWER: COALESCED ‚úì (if offset is aligned)\n\n");
    
    printf("Pattern C (Warp-strided): Skips every other 32-element chunk\n");
    printf("  ‚Üí Within a warp, threads still access contiguous memory\n");
    printf("  ‚Üí ANSWER: COALESCED ‚úì (per warp)\n\n");
    
    printf("Pattern D (Interleaved): Threads scattered across arrays\n");
    printf("  ‚Üí Adjacent threads access non-adjacent memory\n");
    printf("  ‚Üí ANSWER: NON-COALESCED ‚úó\n\n");
    
    //-------------------------------------------------------------------------
    // Exercise 2 & 4: Benchmark
    //-------------------------------------------------------------------------
    printf("EXERCISE 2 & 4: Performance Benchmark\n");
    printf("=====================================\n");
    
    const int N = 1 << 22;  // 4M elements
    const int bytes = N * sizeof(float);
    
    float *d_input, *d_output;
    CHECK_CUDA(cudaMalloc(&d_input, bytes));
    CHECK_CUDA(cudaMalloc(&d_output, bytes));
    
    // Initialize
    float* h_input = (float*)malloc(bytes);
    for (int i = 0; i < N; i++) h_input[i] = 1.0f;
    CHECK_CUDA(cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice));
    
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    
    // Benchmark coalesced access
    CHECK_CUDA(cudaEventRecord(start));
    for (int i = 0; i < 100; i++) {
        coalescedSum<<<numBlocks, blockSize>>>(d_output, d_input, N);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    
    float coalescedTime;
    CHECK_CUDA(cudaEventElapsedTime(&coalescedTime, start, stop));
    
    // Benchmark non-coalesced (stride = 2)
    CHECK_CUDA(cudaEventRecord(start));
    for (int i = 0; i < 100; i++) {
        nonCoalescedSum<<<numBlocks, blockSize>>>(d_output, d_input, N, 2);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    
    float stride2Time;
    CHECK_CUDA(cudaEventElapsedTime(&stride2Time, start, stop));
    
    // Benchmark non-coalesced (stride = 32)
    CHECK_CUDA(cudaEventRecord(start));
    for (int i = 0; i < 100; i++) {
        nonCoalescedSum<<<numBlocks/32, blockSize>>>(d_output, d_input, N, 32);
    }
    CHECK_CUDA(cudaEventRecord(stop));
    CHECK_CUDA(cudaEventSynchronize(stop));
    
    float stride32Time;
    CHECK_CUDA(cudaEventElapsedTime(&stride32Time, start, stop));
    
    printf("\nPerformance Results (100 iterations):\n");
    printf("  Coalesced (stride=1):     %.2f ms\n", coalescedTime);
    printf("  Non-coalesced (stride=2): %.2f ms (%.1fx slower)\n", 
           stride2Time, stride2Time / coalescedTime);
    printf("  Non-coalesced (stride=32):%.2f ms (%.1fx slower)\n", 
           stride32Time, stride32Time / coalescedTime);
    
    // Calculate bandwidth
    float coalescedBW = (2.0f * bytes * 100) / (coalescedTime / 1000.0f) / 1e9;
    printf("\nEffective Bandwidth:\n");
    printf("  Coalesced: %.1f GB/s\n", coalescedBW);
    
    //-------------------------------------------------------------------------
    // Exercise 3: SoA vs AoS
    //-------------------------------------------------------------------------
    printf("\nEXERCISE 3: SoA vs AoS Memory Layout\n");
    printf("====================================\n");
    printf("Array of Structures (AoS):\n");
    printf("  Memory: [x0,y0,z0,vx0,vy0,vz0,m0,id0][x1,y1,z1,...]\n");
    printf("  Thread 0 accesses byte 0, Thread 1 accesses byte 32\n");
    printf("  ‚Üí NON-COALESCED (32-byte stride)\n\n");
    
    printf("Structure of Arrays (SoA):\n");
    printf("  Memory: [x0,x1,x2,...][y0,y1,y2,...][z0,z1,z2,...]\n");
    printf("  Thread 0 accesses x[0], Thread 1 accesses x[1]\n");
    printf("  ‚Üí COALESCED (4-byte stride for floats)\n\n");
    
    printf("Recommendation: Use SoA layout for GPU-intensive code!\n");
    
    // Cleanup
    CHECK_CUDA(cudaFree(d_input));
    CHECK_CUDA(cudaFree(d_output));
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(stop));
    free(h_input);
    
    printf("\n=== Exercises Complete ===\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o coalescing_exercises coalescing_exercises.cu && ./coalescing_exercises

### üî∂ Python/Numba Version (Optional)

The exercises above cover the key coalescing concepts. For quick validation, you can use the Python/Numba kernels from earlier exercises.

---

## üéì Summary: What You Learned Today

<details open>
<summary><b>üìã Quick Reference Card</b></summary>

### The Golden Rule
> **Adjacent threads should access adjacent memory addresses.**

### The Delivery Truck Analogy
- Memory controller delivers in 128-byte packages (like a truck with minimum box size)
- Scattered requests = multiple trips = wasted bandwidth
- Contiguous requests = one trip = maximum efficiency

### Access Pattern Cheat Sheet
| Do This ‚úÖ | Not This ‚ùå |
|-----------|------------|
| `arr[threadIdx.x]` | `arr[threadIdx.x * stride]` |
| `matrix[row][col]` with col = threadIdx.x | `matrix[row][col]` with row = threadIdx.x |
| Structure of Arrays (SoA) | Array of Structures (AoS) |
| Sequential access | Random/scattered access |

### Performance Impact
| Access Pattern | Typical Bandwidth | Efficiency |
|---------------|-------------------|------------|
| Perfectly coalesced | 200-400 GB/s | ~100% |
| Stride of 2 | 100-200 GB/s | ~50% |
| Stride of 32 | 10-30 GB/s | ~3-10% |

</details>

---

### üîë Three Things to Remember

1. **The WHY**: GPU memory is optimized for bulk transfers. Scattered access wastes bandwidth because each request still fetches 128 bytes.

2. **The RULE**: Map `threadIdx.x` (the fast-changing dimension) to the last array index (columns in row-major C/NumPy arrays).

3. **THE LIMITATION**: Some patterns (like transpose) can't be fixed with coalescing alone‚Äîthat's why we need shared memory (Day 2).

---

### üìö What's Next?

**Day 2: Shared Memory** - You'll learn:
- How to use shared memory as a fast "scratch pad"
- The tiled algorithm pattern
- How to fix the transpose problem (10x ‚Üí 1.1x overhead!)
- Thread synchronization with `__syncthreads()`

---

### üîó Deep Dive Resources
- [Device Memory Access](../../cuda-programming-guide/03-advanced/device-memory-access.md)
- [Performance Optimization](../../cuda-programming-guide/03-advanced/performance-optimization.md)
- [NVIDIA Memory Coalescing Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory)