# üöÄ Day 1: Memory Coalescing - The Key to GPU Performance

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-02/day-1-memory-coalescing.ipynb)

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")

# Day 1: Memory Coalescing - The Key to GPU Performance

Memory bandwidth is the #1 bottleneck in most GPU programs. Today you'll learn:
- How GPUs access global memory
- What memory coalescing means
- How to write coalesced access patterns
- Measuring the performance impact

---

## 1. How GPU Memory Access Works

When a warp (32 threads) accesses global memory, the hardware:
1. Collects all memory addresses from all threads
2. Groups them into **memory transactions** (32, 64, or 128 bytes)
3. Fetches data in as few transactions as possible

### üî∑ CUDA C++ Implementation (Primary)

```
COALESCED ACCESS (Good):
Thread:    0    1    2    3   ...  31
Address:  [0]  [1]  [2]  [3] ... [31]
           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ONE 128-byte transaction!

NON-COALESCED ACCESS (Bad):
Thread:    0    1    2    3   ...  31
Address:  [0] [32] [64] [96] ...[992]
           ‚îÇ    ‚îÇ    ‚îÇ    ‚îÇ       ‚îÇ
           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           32 separate transactions! (32x slower)
```

In [None]:
%%writefile coalescing_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

// GOOD: Coalesced access - adjacent threads access adjacent memory
__global__ void coalescedCopy(const float* src, float* dst, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dst[idx] = src[idx];  // Thread 0‚Üíaddr 0, Thread 1‚Üíaddr 1, ...
    }
}

// BAD: Strided access - adjacent threads access scattered memory
__global__ void stridedCopy(const float* src, float* dst, int n, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int strided_idx = idx * stride;  // Thread 0‚Üíaddr 0, Thread 1‚Üíaddr 32, ...
    if (strided_idx < n) {
        dst[strided_idx] = src[strided_idx];
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    const int bytes = N * sizeof(float);
    
    // Allocate host memory
    float *h_src = (float*)malloc(bytes);
    float *h_dst = (float*)malloc(bytes);
    
    // Initialize source data
    for (int i = 0; i < N; i++) {
        h_src[i] = (float)i;
    }
    
    // Allocate device memory
    float *d_src, *d_dst;
    cudaMalloc(&d_src, bytes);
    cudaMalloc(&d_dst, bytes);
    
    // Copy to device
    cudaMemcpy(d_src, h_src, bytes, cudaMemcpyHostToDevice);
    
    // Launch coalesced kernel
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    
    printf("Testing coalesced vs strided access patterns:\n");
    printf("Array size: %d elements\n\n", N);
    
    // Time coalesced access
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        coalescedCopy<<<blocks, threads>>>(d_src, d_dst, N);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float coalesced_ms;
    cudaEventElapsedTime(&coalesced_ms, start, stop);
    printf("Coalesced access (stride=1): %.3f ms (avg per iteration)\n", coalesced_ms / 100);
    
    // Time strided access with different strides
    for (int stride = 2; stride <= 32; stride *= 2) {
        cudaMemset(d_dst, 0, bytes);
        
        cudaEventRecord(start);
        for (int i = 0; i < 100; i++) {
            stridedCopy<<<blocks, threads>>>(d_src, d_dst, N, stride);
        }
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);
        
        float strided_ms;
        cudaEventElapsedTime(&strided_ms, start, stop);
        printf("Strided access (stride=%d): %.3f ms (%.1fx slower)\n", 
               stride, strided_ms / 100, strided_ms / coalesced_ms);
    }
    
    // Cleanup
    cudaFree(d_src);
    cudaFree(d_dst);
    free(h_src);
    free(h_dst);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o coalescing_demo coalescing_demo.cu
!./coalescing_demo

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
import numpy as np
from numba import cuda
import math
import time

print("CUDA device:", cuda.get_current_device().name.decode())

# Get memory bandwidth info
device = cuda.get_current_device()
print(f"Warp size: {device.WARP_SIZE}")
print(f"Max threads per block: {device.MAX_THREADS_PER_BLOCK}")

## 2. The Coalescing Rule

### The Golden Rule:
**Adjacent threads (within a warp) should access adjacent memory locations.**

### Transaction Sizes:
- 32 bytes (8 √ó float32)
- 64 bytes (16 √ó float32)
- 128 bytes (32 √ó float32) ‚Üê ideal for a warp!

### Good vs Bad Patterns:

| Pattern | Example | Coalesced? |
|---------|---------|------------|
| Sequential | `arr[threadIdx.x]` | ‚úÖ Yes |
| Strided | `arr[threadIdx.x * stride]` | ‚ùå No (if stride > 1) |
| Random | `arr[random_index]` | ‚ùå No |
| Row-major 2D | `arr[row][col]` with col = threadIdx.x | ‚úÖ Yes |
| Column-major 2D | `arr[row][col]` with row = threadIdx.x | ‚ùå No |

### üî∑ CUDA C++ 2D Access Patterns (Primary)

In [None]:
%%writefile coalescing_2d.cu
#include <stdio.h>
#include <cuda_runtime.h>

// GOOD: Row-major access - threads in a warp access adjacent columns
__global__ void rowMajorAccess(float* matrix, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // Fast dimension
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;  // Row-major indexing
        output[idx] = matrix[idx] * 2.0f;
    }
}

// BAD: Column-major access - threads in a warp access scattered rows
__global__ void colMajorAccess(float* matrix, float* output, int rows, int cols) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;  // Wrong! Fast dimension on rows
    int col = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = matrix[idx] * 2.0f;
    }
}

int main() {
    const int ROWS = 4096;
    const int COLS = 4096;
    const int SIZE = ROWS * COLS;
    const size_t bytes = SIZE * sizeof(float);
    
    float *d_matrix, *d_output;
    cudaMalloc(&d_matrix, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize
    float* h_matrix = (float*)malloc(bytes);
    for (int i = 0; i < SIZE; i++) h_matrix[i] = 1.0f;
    cudaMemcpy(d_matrix, h_matrix, bytes, cudaMemcpyHostToDevice);
    
    dim3 threads(16, 16);
    dim3 blocks_row((COLS + 15) / 16, (ROWS + 15) / 16);
    dim3 blocks_col((ROWS + 15) / 16, (COLS + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== 2D Access Pattern Benchmark ===\n");
    printf("Matrix: %d x %d (%.1f MB)\n\n", ROWS, COLS, bytes / 1e6);
    
    // Benchmark row-major (coalesced)
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        rowMajorAccess<<<blocks_row, threads>>>(d_matrix, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float row_ms;
    cudaEventElapsedTime(&row_ms, start, stop);
    float row_bw = (2 * bytes * 100) / (row_ms / 1000) / 1e9;
    
    // Benchmark column-major (non-coalesced)
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        colMajorAccess<<<blocks_col, threads>>>(d_matrix, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float col_ms;
    cudaEventElapsedTime(&col_ms, start, stop);
    float col_bw = (2 * bytes * 100) / (col_ms / 1000) / 1e9;
    
    printf("Row-major (coalesced):    %.2f ms, %.1f GB/s\n", row_ms / 100, row_bw);
    printf("Column-major (strided):   %.2f ms, %.1f GB/s\n", col_ms / 100, col_bw);
    printf("Speedup from coalescing:  %.2fx\n", col_ms / row_ms);
    
    cudaFree(d_matrix);
    cudaFree(d_output);
    free(h_matrix);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -o coalescing_2d coalescing_2d.cu && ./coalescing_2d

## 3. Demonstrating Coalesced vs Non-Coalesced Access

Let's create two kernels that do the same work but with different access patterns:

In [None]:
@cuda.jit
def coalesced_copy(src, dst, n):
    """
    COALESCED: Adjacent threads access adjacent elements.
    Thread 0 ‚Üí src[0], Thread 1 ‚Üí src[1], ...
    """
    idx = cuda.grid(1)
    if idx < n:
        dst[idx] = src[idx]

@cuda.jit
def strided_copy(src, dst, n, stride):
    """
    NON-COALESCED: Threads access memory with stride.
    Thread 0 ‚Üí src[0], Thread 1 ‚Üí src[stride], Thread 2 ‚Üí src[2*stride]...
    """
    tid = cuda.grid(1)
    if tid < n // stride:
        # Calculate strided index
        idx = tid * stride
        if idx < n:
            dst[idx] = src[idx]

def benchmark_access_pattern(n, stride=1, iterations=100):
    """Benchmark different access patterns"""
    src = cuda.to_device(np.random.randn(n).astype(np.float32))
    dst = cuda.device_array(n, dtype=np.float32)
    
    threads = 256
    blocks = math.ceil(n / threads)
    
    # Warmup
    if stride == 1:
        coalesced_copy[blocks, threads](src, dst, n)
    else:
        strided_copy[blocks, threads](src, dst, n, stride)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        if stride == 1:
            coalesced_copy[blocks, threads](src, dst, n)
        else:
            strided_copy[blocks, threads](src, dst, n, stride)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    # Calculate bandwidth
    bytes_transferred = 2 * n * 4  # Read + write, float32
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Compare patterns
n = 100_000_000  # 100M elements

print(f"Array size: {n:,} elements ({n * 4 / 1e9:.2f} GB)")
print("=" * 60)
print(f"{'Pattern':<25} | {'Time (ms)':<12} | {'Bandwidth (GB/s)'}")
print("-" * 60)

time_coal, bw_coal = benchmark_access_pattern(n, stride=1)
print(f"{'Coalesced (stride=1)':<25} | {time_coal:<12.3f} | {bw_coal:.1f}")

for stride in [2, 4, 8, 16, 32]:
    time_s, bw_s = benchmark_access_pattern(n, stride=stride)
    slowdown = time_s / time_coal
    print(f"{'Strided (stride=' + str(stride) + ')':<25} | {time_s:<12.3f} | {bw_s:.1f} ({slowdown:.1f}x slower)")

## 4. 2D Array Access Patterns

For 2D arrays (matrices, images), access pattern matters even more!

```
Row-Major Layout (C/NumPy default):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ (0,0) (0,1) (0,2) (0,3) ... ‚îÇ Row 0 (contiguous in memory)
‚îÇ (1,0) (1,1) (1,2) (1,3) ... ‚îÇ Row 1
‚îÇ (2,0) (2,1) (2,2) (2,3) ... ‚îÇ Row 2
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Memory: [0,0][0,1][0,2][0,3]...[1,0][1,1][1,2]...

‚úÖ COALESCED: Threads read along a row (vary column)
   Thread 0 ‚Üí (row, 0), Thread 1 ‚Üí (row, 1), ...

‚ùå NON-COALESCED: Threads read down a column (vary row)
   Thread 0 ‚Üí (0, col), Thread 1 ‚Üí (1, col), ...
```

In [None]:
@cuda.jit
def row_major_read(matrix, output, rows, cols):
    """
    COALESCED: Each warp reads along a row.
    threadIdx.x corresponds to column (fast-changing dimension)
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Reading matrix[row][col] - threads in a warp read adjacent columns
        output[row, col] = matrix[row, col] * 2.0

@cuda.jit
def col_major_read(matrix, output, rows, cols):
    """
    NON-COALESCED: Each warp reads down a column.
    threadIdx.x corresponds to row (slow-changing dimension)
    """
    row, col = cuda.grid(2)  # Note: swapped!
    
    if row < rows and col < cols:
        # Same operation, but different thread mapping
        output[row, col] = matrix[row, col] * 2.0

def benchmark_2d_pattern(rows, cols, pattern='row', iterations=100):
    """Benchmark 2D access patterns"""
    matrix = cuda.to_device(np.random.randn(rows, cols).astype(np.float32))
    output = cuda.device_array((rows, cols), dtype=np.float32)
    
    threads = (16, 16)  # 256 threads per block
    
    if pattern == 'row':
        blocks = (math.ceil(cols / 16), math.ceil(rows / 16))
        kernel = row_major_read
    else:
        blocks = (math.ceil(rows / 16), math.ceil(cols / 16))
        kernel = col_major_read
    
    # Warmup
    kernel[blocks, threads](matrix, output, rows, cols)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        kernel[blocks, threads](matrix, output, rows, cols)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    bytes_transferred = 2 * rows * cols * 4
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Benchmark 2D patterns
rows, cols = 4096, 4096  # 64MB matrix

print(f"Matrix size: {rows} √ó {cols} ({rows * cols * 4 / 1e6:.1f} MB)")
print("=" * 55)

time_row, bw_row = benchmark_2d_pattern(rows, cols, 'row')
time_col, bw_col = benchmark_2d_pattern(rows, cols, 'col')

print(f"Row-major (coalesced):     {time_row:.3f} ms, {bw_row:.1f} GB/s")
print(f"Column-major (strided):    {time_col:.3f} ms, {bw_col:.1f} GB/s")
print(f"\nSpeedup from coalescing: {time_col/time_row:.2f}x")

## 5. Matrix Transpose: A Classic Coalescing Problem

Matrix transpose is tricky because:
- Reading rows (coalesced) means writing columns (non-coalesced)
- Reading columns (non-coalesced) means writing rows (coalesced)

You can't win with naive approach! (We'll fix this with shared memory in Day 2)

### üî∑ CUDA C++ Matrix Transpose (Primary)

In [None]:
%%writefile transpose_naive.cu
#include <stdio.h>
#include <cuda_runtime.h>

// Naive transpose: coalesced reads, non-coalesced writes
__global__ void transposeReadCoalesced(float* input, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        // Read: input[row][col] - coalesced (threads read along row)
        // Write: output[col][row] - non-coalesced (threads write to scattered cols)
        output[col * rows + row] = input[row * cols + col];
    }
}

// Alternative: coalesced writes, non-coalesced reads
__global__ void transposeWriteCoalesced(float* input, float* output, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (row < rows && col < cols) {
        // Read: input[col][row] - non-coalesced (scattered reads)
        // Write: output[row][col] - coalesced (threads write along row)
        output[row * cols + col] = input[col * rows + row];
    }
}

int main() {
    const int ROWS = 4096;
    const int COLS = 4096;
    const size_t bytes = ROWS * COLS * sizeof(float);
    
    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize
    float* h_input = (float*)malloc(bytes);
    for (int i = 0; i < ROWS * COLS; i++) h_input[i] = (float)i;
    cudaMemcpy(d_input, h_input, bytes, cudaMemcpyHostToDevice);
    
    dim3 threads(16, 16);
    dim3 blocks((COLS + 15) / 16, (ROWS + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== Matrix Transpose Coalescing Demo ===\n");
    printf("Matrix: %d x %d\n\n", ROWS, COLS);
    
    // Benchmark read-coalesced
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        transposeReadCoalesced<<<blocks, threads>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float read_ms;
    cudaEventElapsedTime(&read_ms, start, stop);
    
    // Benchmark write-coalesced
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        transposeWriteCoalesced<<<blocks, threads>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float write_ms;
    cudaEventElapsedTime(&write_ms, start, stop);
    
    printf("Read-coalesced transpose:  %.2f ms\n", read_ms / 100);
    printf("Write-coalesced transpose: %.2f ms\n", write_ms / 100);
    printf("\nNeither is optimal - shared memory fixes this (Day 2)!\n");
    
    cudaFree(d_input);
    cudaFree(d_output);
    free(h_input);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -o transpose_naive transpose_naive.cu && ./transpose_naive

In [None]:
@cuda.jit
def transpose_naive(input_matrix, output_matrix, rows, cols):
    """
    Naive transpose: coalesced reads, non-coalesced writes
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Read from input[row][col] (coalesced - threads read along row)
        # Write to output[col][row] (non-coalesced - threads write to scattered locations)
        output_matrix[col, row] = input_matrix[row, col]

@cuda.jit
def transpose_read_coalesced(input_matrix, output_matrix, rows, cols):
    """
    Same as naive - prioritize coalesced reads
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        output_matrix[col, row] = input_matrix[row, col]

@cuda.jit
def transpose_write_coalesced(input_matrix, output_matrix, rows, cols):
    """
    Prioritize coalesced writes (non-coalesced reads)
    """
    col, row = cuda.grid(2)
    
    if row < rows and col < cols:
        # Read from input[col][row] (non-coalesced - scattered reads)
        # Write to output[row][col] (coalesced - threads write along row)
        output_matrix[row, col] = input_matrix[col, row]

def benchmark_transpose(rows, cols, kernel, iterations=100):
    """Benchmark transpose kernel"""
    input_mat = cuda.to_device(np.random.randn(rows, cols).astype(np.float32))
    output_mat = cuda.device_array((cols, rows), dtype=np.float32)
    
    threads = (16, 16)
    blocks = (math.ceil(cols / 16), math.ceil(rows / 16))
    
    # Warmup
    kernel[blocks, threads](input_mat, output_mat, rows, cols)
    cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        kernel[blocks, threads](input_mat, output_mat, rows, cols)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / iterations
    
    bytes_transferred = 2 * rows * cols * 4
    bandwidth = bytes_transferred / elapsed / 1e9
    
    return elapsed * 1000, bandwidth

# Benchmark transpose patterns
rows, cols = 4096, 4096

print(f"Matrix transpose: {rows} √ó {cols}")
print("=" * 55)

time_read, bw_read = benchmark_transpose(rows, cols, transpose_read_coalesced)
time_write, bw_write = benchmark_transpose(rows, cols, transpose_write_coalesced)

print(f"Coalesced reads:   {time_read:.3f} ms, {bw_read:.1f} GB/s")
print(f"Coalesced writes:  {time_write:.3f} ms, {bw_write:.1f} GB/s")
print(f"\nüí° Neither is optimal! We need shared memory (Day 2) to fix this.")

## üéØ Exercises

### Exercise 1: Identify the Access Pattern

For each kernel below, identify if the access is coalesced or not.

### üî∑ CUDA C++ Version (Primary)

In [None]:
%%writefile exercise_patterns.cu
// exercise_patterns.cu - Identify coalesced vs non-coalesced access
#include <stdio.h>
#include <cuda_runtime.h>

// Pattern A: Sequential access
__global__ void patternA(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        arr[idx] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern B: Reverse sequential access
__global__ void patternB(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        arr[n - 1 - idx] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern C: Strided access (every other element)
__global__ void patternC(int* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n / 2) {
        arr[idx * 2] = idx;  // TODO: Coalesced or not?
    }
}

// Pattern D: 2D access with correct mapping
__global__ void patternD(int* matrix, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (row < rows && col < cols) {
        matrix[row * cols + col] = row + col;  // TODO: Coalesced or not?
    }
}

int main() {
    printf("=== Exercise 1: Identify Access Patterns ===\n\n");
    
    printf("Pattern A: arr[idx] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices 0,1,2,3,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern B: arr[n - 1 - idx] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices n-1,n-2,n-3,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern C: arr[idx * 2] = idx\n");
    printf("  ‚Üí Threads 0,1,2,3,... write to indices 0,2,4,6,...\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("Pattern D: matrix[row * cols + col] with threadIdx.x ‚Üí col\n");
    printf("  ‚Üí Adjacent threads access adjacent columns\n");
    printf("  ‚Üí ANSWER: ?\n\n");
    
    printf("-------------------------------------------\n");
    printf("ANSWERS:\n");
    printf("A: COALESCED (sequential access)\n");
    printf("B: COALESCED (reverse is still contiguous within warp)\n");
    printf("C: NON-COALESCED (stride of 2, 50%% efficiency)\n");
    printf("D: COALESCED (threadIdx.x maps to fast dimension)\n");
    
    return 0;
}

In [None]:
!nvcc -o exercise_patterns exercise_patterns.cu && ./exercise_patterns

### üî∂ Python/Numba Version (Optional)

In [None]:
# Exercise 1: Identify coalesced vs non-coalesced

# Pattern A
@cuda.jit
def pattern_a(arr, n):
    idx = cuda.grid(1)
    if idx < n:
        arr[idx] = idx  # TODO: Coalesced or not?

# Pattern B
@cuda.jit
def pattern_b(arr, n):
    idx = cuda.grid(1)
    if idx < n:
        arr[n - 1 - idx] = idx  # TODO: Coalesced or not?

# Pattern C
@cuda.jit
def pattern_c(arr, n):
    idx = cuda.grid(1)
    if idx < n // 2:
        arr[idx * 2] = idx  # TODO: Coalesced or not?

# Pattern D
@cuda.jit
def pattern_d(matrix, rows, cols):
    col, row = cuda.grid(2)
    if row < rows and col < cols:
        matrix[row, col] = row + col  # TODO: Coalesced or not?

print("Analyze each pattern and answer:")
print("Pattern A: ?")
print("Pattern B: ?")
print("Pattern C: ?")
print("Pattern D: ?")

# Answers:
# A: Coalesced (sequential access)
# B: Coalesced (reverse sequential is still contiguous within warp)
# C: Non-coalesced (stride of 2)
# D: Coalesced (threadIdx.x maps to col, which is the fast dimension)

### Exercise 2: Fix the Non-Coalesced Access

The kernel below processes a 2D array but has non-coalesced access. Fix it!

### üî∑ CUDA C++ Version (Primary)

In [None]:
%%writefile fix_coalescing.cu
// fix_coalescing.cu - Fix the non-coalesced access pattern
#include <stdio.h>
#include <cuda_runtime.h>

#define ROWS 2048
#define COLS 2048

// BAD: Non-coalesced access
// Problem: threadIdx.x maps to row, causing strided column access
__global__ void processMatrixBad(const float* input, float* output, int rows, int cols) {
    // This mapping is WRONG for row-major memory!
    int row = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí row
    int col = blockIdx.y * blockDim.y + threadIdx.y;  // threadIdx.y ‚Üí col
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = input[idx] * 2.0f;
    }
}

// GOOD: Coalesced access
// TODO: Fix the thread-to-index mapping so adjacent threads access adjacent memory
__global__ void processMatrixGood(const float* input, float* output, int rows, int cols) {
    // FIX: threadIdx.x should map to the COLUMN (fast dimension in row-major)
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // threadIdx.x ‚Üí col
    int row = blockIdx.y * blockDim.y + threadIdx.y;  // threadIdx.y ‚Üí row
    
    if (row < rows && col < cols) {
        int idx = row * cols + col;
        output[idx] = input[idx] * 2.0f;
    }
}

int main() {
    const size_t bytes = ROWS * COLS * sizeof(float);
    
    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);
    
    // Initialize with dummy data
    cudaMemset(d_input, 0, bytes);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== Exercise 2: Fix Non-Coalesced Access ===\n\n");
    printf("Matrix size: %d x %d\n\n", ROWS, COLS);
    
    // BAD version: threadIdx.x ‚Üí row (WRONG for row-major!)
    dim3 blocksBad(ROWS / 16, COLS / 16);
    dim3 threadsBad(16, 16);  // x=16 threads for rows
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        processMatrixBad<<<blocksBad, threadsBad>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float badTime;
    cudaEventElapsedTime(&badTime, start, stop);
    float badBW = (2 * bytes * 100) / (badTime / 1000) / 1e9;
    
    // GOOD version: threadIdx.x ‚Üí col (CORRECT for row-major!)
    dim3 blocksGood(COLS / 16, ROWS / 16);
    dim3 threadsGood(16, 16);  // x=16 threads for cols
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        processMatrixGood<<<blocksGood, threadsGood>>>(d_input, d_output, ROWS, COLS);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float goodTime;
    cudaEventElapsedTime(&goodTime, start, stop);
    float goodBW = (2 * bytes * 100) / (goodTime / 1000) / 1e9;
    
    printf("BAD  (threadIdx.x ‚Üí row): %.2f ms, %.1f GB/s\n", badTime / 100, badBW);
    printf("GOOD (threadIdx.x ‚Üí col): %.2f ms, %.1f GB/s\n", goodTime / 100, goodBW);
    printf("\nSpeedup: %.1fx\n", badTime / goodTime);
    
    printf("\n-------------------------------------------\n");
    printf("KEY INSIGHT:\n");
    printf("In row-major layout, adjacent columns are adjacent in memory.\n");
    printf("threadIdx.x should map to the COLUMN index for coalescing!\n");
    
    cudaFree(d_input);
    cudaFree(d_output);
    return 0;
}

In [None]:
!nvcc -o fix_coalescing fix_coalescing.cu && ./fix_coalescing

### üî∂ Python/Numba Version (Optional)

In [None]:
# Exercise 2: Fix the access pattern

@cuda.jit
def process_matrix_bad(matrix, output, rows, cols):
    """BAD: Non-coalesced access"""
    # Problem: threadIdx.x maps to row, causing non-coalesced column access
    row, col = cuda.grid(2)  # This mapping is wrong!
    
    if row < rows and col < cols:
        output[row, col] = matrix[row, col] * 2.0

@cuda.jit
def process_matrix_good(matrix, output, rows, cols):
    """TODO: Fix to be coalesced"""
    # TODO: Change the thread-to-index mapping
    row, col = cuda.grid(2)  # FIX THIS LINE
    
    if row < rows and col < cols:
        output[row, col] = matrix[row, col] * 2.0

# Test your fix
# rows, cols = 2048, 2048
# ... benchmark both versions

## üìù Key Takeaways

### Memory Coalescing Rules:

1. **Adjacent threads should access adjacent memory**
   - Thread 0 ‚Üí mem[0], Thread 1 ‚Üí mem[1], etc.

2. **For 2D arrays in row-major (C/NumPy):**
   - `threadIdx.x` should map to the column index
   - `col, row = cuda.grid(2)` is the correct order

3. **Strided access kills performance**
   - Stride of 2 ‚Üí ~50% efficiency
   - Stride of 32 ‚Üí ~3% efficiency

4. **Some patterns can't be fixed with coalescing alone**
   - Matrix transpose needs shared memory
   - We'll learn this tomorrow!

### Performance Impact:
- Coalesced: 200-400 GB/s on T4
- Non-coalesced: 10-50 GB/s
- **10-40x performance difference!**

---

### üìö Next Up: Day 2 - Shared Memory
- Using shared memory for tile-based algorithms
- Fixing the transpose problem
- Thread synchronization with `__syncthreads()`

---

### üîó Resources
- [Device Memory Access](../../cuda-programming-guide/03-advanced/device-memory-access.md)
- [Performance Optimization](../../cuda-programming-guide/03-advanced/performance-optimization.md)