# üöÄ Day 3: Bank Conflicts - The Hidden Performance Killer

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-02/day-3-bank-conflicts.ipynb)

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda, float32
import math
import time

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")

# Day 3: Bank Conflicts - The Hidden Performance Killer

Yesterday you learned how shared memory provides ~100x faster access than global memory. But there's a catch: **bank conflicts** can serialize your access and destroy that speedup!

## Learning Objectives
- Understand how shared memory is organized into banks
- Identify patterns that cause bank conflicts
- Apply techniques to avoid conflicts (padding, access patterns)
- Optimize the matrix transpose to avoid conflicts

---

## 1. Shared Memory Bank Architecture

Shared memory is divided into **32 banks** (one per warp thread). Each bank can serve one address per cycle.

```
SHARED MEMORY BANKS (32 banks):

Address:  0   1   2   3   4   5  ...  30  31  32  33  34  ...
Bank:     0   1   2   3   4   5  ...  30  31   0   1   2  ...
          ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ       ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ
          ‚ñº   ‚ñº   ‚ñº   ‚ñº   ‚ñº   ‚ñº       ‚ñº   ‚ñº   ‚ñº   ‚ñº   ‚ñº
        ‚îå‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îê
        ‚îÇ 0 ‚îÇ 1 ‚îÇ 2 ‚îÇ 3 ‚îÇ 4 ‚îÇ 5 ‚îÇ...‚îÇ30 ‚îÇ31 ‚îÇ 0 ‚îÇ 1 ‚îÇ 2 ‚îÇ
        ‚îî‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îò

Bank number = (byte_address / 4) % 32    (for 4-byte words)
```

### Bank Conflict Rules:

| Scenario | Result |
|----------|--------|
| Threads access different banks | Parallel (fast!) |
| Threads access same address (broadcast) | Parallel (fast!) |
| Threads access different addresses in same bank | **Serialized (slow!)** |

### üî∑ CUDA C++ Implementation (Primary)

In [None]:
%%writefile bank_conflict_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

// GOOD: Each thread accesses a different bank (no conflict)
__global__ void noBankConflict(float* out) {
    __shared__ float smem[256];
    int tid = threadIdx.x;
    
    // Thread 0‚ÜíBank 0, Thread 1‚ÜíBank 1, ...
    smem[tid] = (float)tid;
    __syncthreads();
    out[tid] = smem[tid];
}

// BAD: 32-way bank conflict! All threads hit same bank
__global__ void bankConflict32Way(float* out) {
    __shared__ float smem[8192];
    int tid = threadIdx.x;
    
    // All threads access Bank 0: addr 0, 32, 64, 96...
    int idx = tid * 32;
    smem[idx] = (float)tid;
    __syncthreads();
    out[tid] = smem[idx];
}

// 2-way bank conflict: stride of 2
__global__ void bankConflict2Way(float* out) {
    __shared__ float smem[512];
    int tid = threadIdx.x;
    
    // Thread 0‚ÜíBank 0, Thread 1‚ÜíBank 2, ..., Thread 16‚ÜíBank 0 (conflict!)
    int idx = tid * 2;
    smem[idx] = (float)tid;
    __syncthreads();
    out[tid] = smem[idx];
}

int main() {
    const int N = 256;
    float *d_out, *h_out;
    
    h_out = (float*)malloc(N * sizeof(float));
    cudaMalloc(&d_out, N * sizeof(float));
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    printf("=== Bank Conflict Demonstration ===\n\n");
    
    // Benchmark no conflict
    cudaEventRecord(start);
    for (int i = 0; i < 10000; i++) {
        noBankConflict<<<1, 256>>>(d_out);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float no_conflict_ms;
    cudaEventElapsedTime(&no_conflict_ms, start, stop);
    printf("No bank conflict (stride=1):     %.3f ms\n", no_conflict_ms);
    
    // Benchmark 2-way conflict
    cudaEventRecord(start);
    for (int i = 0; i < 10000; i++) {
        bankConflict2Way<<<1, 256>>>(d_out);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float conflict_2way_ms;
    cudaEventElapsedTime(&conflict_2way_ms, start, stop);
    printf("2-way bank conflict (stride=2):  %.3f ms (%.1fx slower)\n", 
           conflict_2way_ms, conflict_2way_ms / no_conflict_ms);
    
    // Benchmark 32-way conflict
    cudaEventRecord(start);
    for (int i = 0; i < 10000; i++) {
        bankConflict32Way<<<1, 256>>>(d_out);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float conflict_32way_ms;
    cudaEventElapsedTime(&conflict_32way_ms, start, stop);
    printf("32-way bank conflict (stride=32): %.3f ms (%.1fx slower)\n", 
           conflict_32way_ms, conflict_32way_ms / no_conflict_ms);
    
    printf("\nüí° 32-way conflict serializes all 32 threads in a warp!\n");
    
    cudaFree(d_out);
    free(h_out);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o bank_conflict_demo bank_conflict_demo.cu
!./bank_conflict_demo

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
import numpy as np
from numba import cuda, float32
import math
import time

print("GPU:", cuda.get_current_device().name.decode())

## 2. Common Bank Conflict Patterns

### ‚úÖ No Conflict: Sequential Access
```
Thread 0 ‚Üí Bank 0 (addr 0)
Thread 1 ‚Üí Bank 1 (addr 1)
Thread 2 ‚Üí Bank 2 (addr 2)
...all different banks, parallel access!
```

### ‚ùå 2-way Conflict: Stride of 2
```
Thread 0 ‚Üí Bank 0 (addr 0)
Thread 1 ‚Üí Bank 2 (addr 2)
Thread 2 ‚Üí Bank 4 (addr 4)
...
Thread 16 ‚Üí Bank 0 (addr 32)  ‚Üê CONFLICT with Thread 0!
```

### ‚ùå‚ùå 32-way Conflict: Stride of 32
```
Thread 0 ‚Üí Bank 0 (addr 0)
Thread 1 ‚Üí Bank 0 (addr 32)  ‚Üê All same bank!
Thread 2 ‚Üí Bank 0 (addr 64)
...32x serialization!
```

In [None]:
def visualize_bank_access(stride, num_threads=32):
    """Visualize which bank each thread accesses"""
    banks = [0] * 32  # Count accesses per bank
    
    print(f"Stride = {stride}")
    print("-" * 60)
    
    for tid in range(num_threads):
        addr = tid * stride
        bank = addr % 32
        banks[bank] += 1
        if tid < 8:
            print(f"  Thread {tid} ‚Üí Address {addr:3d} ‚Üí Bank {bank:2d}")
    
    if num_threads > 8:
        print("  ...")
    
    max_conflicts = max(banks)
    print(f"\nMax accesses to single bank: {max_conflicts} ({max_conflicts}-way conflict)")
    print(f"Banks with conflicts: {sum(1 for b in banks if b > 1)}")
    return max_conflicts

# Demonstrate different strides
print("=" * 60)
print("BANK CONFLICT ANALYSIS")
print("=" * 60)

for stride in [1, 2, 8, 16, 32]:
    print()
    visualize_bank_access(stride)

## 3. Measuring Bank Conflict Impact

Let's create kernels with different access patterns and measure the performance difference.

In [None]:
BLOCK_SIZE = 256
ITERATIONS_PER_THREAD = 1000  # Repeat to make timing measurable

@cuda.jit
def no_conflict_access(output):
    """Sequential access: No bank conflicts"""
    shared = cuda.shared.array(shape=256, dtype=float32)
    tid = cuda.threadIdx.x
    
    # Initialize
    shared[tid] = float(tid)
    cuda.syncthreads()
    
    # Repeated access pattern: stride 1 (no conflict)
    total = 0.0
    for i in range(ITERATIONS_PER_THREAD):
        idx = tid  # Stride 1: Thread 0‚Üí0, Thread 1‚Üí1, ...
        total += shared[idx]
    
    output[cuda.grid(1)] = total

@cuda.jit
def strided_access(output, stride):
    """Strided access: May cause bank conflicts"""
    shared = cuda.shared.array(shape=8192, dtype=float32)  # Large enough for any stride
    tid = cuda.threadIdx.x
    
    # Initialize more elements
    for i in range(32):
        if tid + i * 256 < 8192:
            shared[tid + i * 256] = float(tid)
    cuda.syncthreads()
    
    # Repeated access pattern with stride
    total = 0.0
    for i in range(ITERATIONS_PER_THREAD):
        idx = (tid * stride) % 8192
        total += shared[idx]
    
    output[cuda.grid(1)] = total

def benchmark_bank_conflicts():
    """Benchmark different stride patterns"""
    output = cuda.device_array(BLOCK_SIZE, dtype=np.float32)
    
    results = []
    
    for stride in [1, 2, 4, 8, 16, 32]:
        # Warmup
        strided_access[1, BLOCK_SIZE](output, stride)
        cuda.synchronize()
        
        # Benchmark
        start = time.perf_counter()
        for _ in range(100):
            strided_access[1, BLOCK_SIZE](output, stride)
        cuda.synchronize()
        elapsed = (time.perf_counter() - start) / 100
        
        results.append((stride, elapsed * 1000))
    
    return results

results = benchmark_bank_conflicts()

print("Bank Conflict Performance Impact")
print("=" * 50)
print(f"{'Stride':<10} | {'Time (ms)':<12} | {'Slowdown'}")
print("-" * 50)

baseline = results[0][1]
for stride, time_ms in results:
    slowdown = time_ms / baseline
    conflict_type = "none" if stride == 1 else f"{min(32, 32 // math.gcd(32, stride))}-way"
    print(f"{stride:<10} | {time_ms:<12.4f} | {slowdown:.2f}x ({conflict_type})")

## 4. Matrix Transpose: The Bank Conflict Problem

In yesterday's shared memory transpose, we had a hidden problem:

```
tile[ty][tx] = input[row][col]  // Write: ty varies, tx varies ‚Üí OK
output[...] = tile[tx][ty]      // Read: tx varies, ty fixed ‚Üí CONFLICT!
```

When we read `tile[tx][ty]` with fixed `ty`, all threads in a warp access the same column. In a 32√ó32 tile, consecutive columns are 32 elements apart = **32-way bank conflict!**

### The Fix: Padding

Add 1 extra element per row to shift the bank pattern:

```
Without padding (32 columns):        With padding (33 columns):
Row 0: Banks 0,1,2,...,31            Row 0: Banks 0,1,2,...,31,0
Row 1: Banks 0,1,2,...,31            Row 1: Banks 1,2,3,...,0,1
Row 2: Banks 0,1,2,...,31            Row 2: Banks 2,3,4,...,1,2
                                     
Column access: All Bank 0! ‚ùå        Column access: Banks 0,1,2...! ‚úÖ
```

In [None]:
TILE_DIM = 32

@cuda.jit
def transpose_shared_conflict(input_mat, output_mat):
    """
    Shared memory transpose WITH bank conflicts.
    tile[32][32] - column access causes 32-way conflicts!
    """
    tile = cuda.shared.array(shape=(TILE_DIM, TILE_DIM), dtype=float32)
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    row = cuda.blockIdx.y * TILE_DIM + ty
    col = cuda.blockIdx.x * TILE_DIM + tx
    
    rows, cols = input_mat.shape
    
    if row < rows and col < cols:
        tile[ty, tx] = input_mat[row, col]
    
    cuda.syncthreads()
    
    out_row = cuda.blockIdx.x * TILE_DIM + ty
    out_col = cuda.blockIdx.y * TILE_DIM + tx
    
    if out_row < cols and out_col < rows:
        # Reading tile[tx, ty] with tx varying = column access = CONFLICT!
        output_mat[out_row, out_col] = tile[tx, ty]

@cuda.jit
def transpose_shared_padded(input_mat, output_mat):
    """
    Shared memory transpose WITHOUT bank conflicts.
    tile[32][33] - padding eliminates conflicts!
    """
    # Add 1 to column dimension to avoid bank conflicts
    tile = cuda.shared.array(shape=(TILE_DIM, TILE_DIM + 1), dtype=float32)
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    row = cuda.blockIdx.y * TILE_DIM + ty
    col = cuda.blockIdx.x * TILE_DIM + tx
    
    rows, cols = input_mat.shape
    
    if row < rows and col < cols:
        tile[ty, tx] = input_mat[row, col]
    
    cuda.syncthreads()
    
    out_row = cuda.blockIdx.x * TILE_DIM + ty
    out_col = cuda.blockIdx.y * TILE_DIM + tx
    
    if out_row < cols and out_col < rows:
        # Now column access goes through different banks!
        output_mat[out_row, out_col] = tile[tx, ty]

def benchmark_transpose_conflicts(size):
    """Compare transpose with and without bank conflict avoidance"""
    input_mat = cuda.to_device(np.random.randn(size, size).astype(np.float32))
    output_conflict = cuda.device_array((size, size), dtype=np.float32)
    output_padded = cuda.device_array((size, size), dtype=np.float32)
    
    threads = (TILE_DIM, TILE_DIM)
    blocks = (math.ceil(size / TILE_DIM), math.ceil(size / TILE_DIM))
    
    iterations = 100
    
    # Warmup
    transpose_shared_conflict[blocks, threads](input_mat, output_conflict)
    transpose_shared_padded[blocks, threads](input_mat, output_padded)
    cuda.synchronize()
    
    # Benchmark with conflicts
    start = time.perf_counter()
    for _ in range(iterations):
        transpose_shared_conflict[blocks, threads](input_mat, output_conflict)
    cuda.synchronize()
    conflict_time = (time.perf_counter() - start) / iterations
    
    # Benchmark without conflicts (padded)
    start = time.perf_counter()
    for _ in range(iterations):
        transpose_shared_padded[blocks, threads](input_mat, output_padded)
    cuda.synchronize()
    padded_time = (time.perf_counter() - start) / iterations
    
    # Verify correctness
    input_host = input_mat.copy_to_host()
    conflict_result = output_conflict.copy_to_host()
    padded_result = output_padded.copy_to_host()
    
    correct = (np.allclose(conflict_result, input_host.T) and 
               np.allclose(padded_result, input_host.T))
    
    bytes_moved = 2 * size * size * 4
    conflict_bw = bytes_moved / conflict_time / 1e9
    padded_bw = bytes_moved / padded_time / 1e9
    
    return conflict_time, padded_time, conflict_bw, padded_bw, correct

# Benchmark
size = 4096
conflict_t, padded_t, conflict_bw, padded_bw, correct = benchmark_transpose_conflicts(size)

print(f"Matrix Transpose: {size} √ó {size}")
print("=" * 55)
print(f"With bank conflicts:    {conflict_t*1000:.3f} ms  ({conflict_bw:.1f} GB/s)")
print(f"Padded (no conflicts):  {padded_t*1000:.3f} ms  ({padded_bw:.1f} GB/s)")
print(f"Speedup from padding:   {conflict_t/padded_t:.2f}x")
print(f"Results correct: {correct}")

## 5. Bank Conflict Avoidance Techniques

### Technique 1: Padding
```python
# Instead of:
shared = cuda.shared.array((32, 32), float32)  # Conflicts!

# Use:
shared = cuda.shared.array((32, 33), float32)  # No conflicts
```

### Technique 2: Change Access Pattern
```python
# Instead of column access:
val = shared[col, row]  # Conflicts if row is fixed

# Restructure to row access:
val = shared[row, col]  # No conflicts if row varies
```

### Technique 3: Broadcast (same address = OK)
```python
# All threads reading the SAME address is fine:
val = shared[0]  # Broadcast, no conflict
```

In [None]:
# Example: Parallel reduction with bank conflict consideration

@cuda.jit
def reduce_sum_conflict_free(arr, partial_sums):
    """
    Reduction that avoids bank conflicts by using sequential addressing.
    """
    shared = cuda.shared.array(shape=256, dtype=float32)
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    
    # Load
    shared[tid] = arr[gid] if gid < arr.size else 0.0
    cuda.syncthreads()
    
    # Reduce with sequential addressing (conflict-free)
    stride = 128  # Start with half block size
    while stride > 0:
        if tid < stride:
            # Adjacent threads access adjacent memory = no conflict!
            shared[tid] += shared[tid + stride]
        cuda.syncthreads()
        stride //= 2
    
    # Thread 0 writes result
    if tid == 0:
        partial_sums[cuda.blockIdx.x] = shared[0]

# Test reduction
n = 1024
arr = np.ones(n, dtype=np.float32)
arr_d = cuda.to_device(arr)

blocks = n // 256
partial_sums = cuda.device_array(blocks, dtype=np.float32)

reduce_sum_conflict_free[blocks, 256](arr_d, partial_sums)
result = partial_sums.copy_to_host()

print(f"Reduction of {n} ones:")
print(f"  Partial sums: {result}")
print(f"  Total sum: {result.sum()} (expected: {n})")

## üéØ Exercises

### üî∑ CUDA C++ Exercises (Primary)

Complete the exercises below in CUDA C++. The code demonstrates bank conflict analysis and avoidance techniques.

In [None]:
%%writefile bank_conflicts_exercises.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 32

// =============================================================================
// Exercise 1: Analyze Bank Conflicts
// For each pattern, calculate the number of bank conflicts
// =============================================================================

// Pattern A: shared[threadIdx.x] - Stride 1
// Answer: No conflict (each thread accesses different bank)

// Pattern B: shared[threadIdx.x * 2] - Stride 2  
// Answer: 2-way conflict (threads 0,16 access bank 0; threads 1,17 access bank 2, etc.)

// Pattern C: shared[threadIdx.x * 33] - Stride 33
// Answer: No conflict! (33 % 32 = 1, so effective stride is 1)

// Pattern D: tile[32][32], accessing tile[threadIdx.y][threadIdx.x]
// Answer: No conflict (within a warp, tx varies 0-31, each accesses different bank)

// Pattern E: tile[32][32], accessing tile[threadIdx.x][threadIdx.y]
// Answer: 32-way conflict! (stride of 32, all threads access same bank)

__global__ void demonstrate_patterns() {
    __shared__ float shared[1024];
    __shared__ float tile[32][32];
    
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    
    // Pattern A: Stride 1 - NO CONFLICT
    shared[tx] = 1.0f;
    
    // Pattern B: Stride 2 - 2-WAY CONFLICT
    // shared[tx * 2] = 1.0f;
    
    // Pattern C: Stride 33 - NO CONFLICT (33 % 32 = 1)
    // shared[tx * 33] = 1.0f;
    
    // Pattern D: Row-major access - NO CONFLICT
    tile[ty][tx] = 1.0f;
    
    // Pattern E: Column-major - 32-WAY CONFLICT
    // tile[tx][ty] = 1.0f;
}

// =============================================================================
// Exercise 2: Fix Bank Conflicts with Padding
// The column sum kernel has conflicts - fix it!
// =============================================================================

// PROBLEM VERSION: Has bank conflicts on column access
__global__ void matrix_column_sum_conflict(float* matrix, float* col_sums, 
                                            int rows, int cols) {
    __shared__ float tile[32][32];  // CONFLICTS when accessing columns!
    
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int col = blockIdx.x * 32 + tx;
    
    float sum = 0.0f;
    
    for (int row_offset = 0; row_offset < rows; row_offset += 32) {
        int row = row_offset + ty;
        
        // Load tile (row-major: no conflict)
        if (row < rows && col < cols) {
            tile[ty][tx] = matrix[row * cols + col];
        } else {
            tile[ty][tx] = 0.0f;
        }
        __syncthreads();
        
        // Sum down columns - each thread reads tile[0..31][tx]
        // ty varies within warp, accessing same tx = CONFLICT!
        if (tx < cols) {
            for (int i = 0; i < 32 && (row_offset + i) < rows; i++) {
                sum += tile[i][tx];  // BANK CONFLICT: stride 32
            }
        }
        __syncthreads();
    }
    
    if (ty == 0 && col < cols) {
        col_sums[col] = sum;
    }
}

// FIXED VERSION: Use padding to avoid bank conflicts
__global__ void matrix_column_sum_fixed(float* matrix, float* col_sums, 
                                         int rows, int cols) {
    // PADDED: 33 columns instead of 32
    __shared__ float tile[32][33];  // +1 padding eliminates conflicts!
    
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int col = blockIdx.x * 32 + tx;
    
    float sum = 0.0f;
    
    for (int row_offset = 0; row_offset < rows; row_offset += 32) {
        int row = row_offset + ty;
        
        // Load tile
        if (row < rows && col < cols) {
            tile[ty][tx] = matrix[row * cols + col];
        } else {
            tile[ty][tx] = 0.0f;
        }
        __syncthreads();
        
        // Sum down columns - now stride is 33, which gives stride 1 (33%32=1)
        if (ty == 0 && col < cols) {
            for (int i = 0; i < 32 && (row_offset + i) < rows; i++) {
                sum += tile[i][tx];  // NO CONFLICT: stride 33 = stride 1
            }
        }
        __syncthreads();
    }
    
    if (ty == 0 && col < cols) {
        col_sums[col] = sum;
    }
}

// Helper function for timing
__global__ void warmup() {
    // Empty kernel for GPU warmup
}

int main() {
    printf("=== Bank Conflict Exercises ===\n\n");
    
    // Exercise 1: Print pattern analysis
    printf("Exercise 1: Bank Conflict Analysis\n");
    printf("==================================\n");
    printf("Pattern A: shared[threadIdx.x]\n");
    printf("  ‚Üí Stride = 1, Bank = tid %% 32\n");
    printf("  ‚Üí Answer: NO CONFLICT (each thread, different bank)\n\n");
    
    printf("Pattern B: shared[threadIdx.x * 2]\n");
    printf("  ‚Üí Stride = 2, Bank = (tid * 2) %% 32\n");
    printf("  ‚Üí Answer: 2-WAY CONFLICT (threads 0,16 ‚Üí bank 0)\n\n");
    
    printf("Pattern C: shared[threadIdx.x * 33]\n");
    printf("  ‚Üí Stride = 33, Bank = (tid * 33) %% 32 = tid %% 32\n");
    printf("  ‚Üí Answer: NO CONFLICT (33 %% 32 = 1)\n\n");
    
    printf("Pattern D: tile[32][32], access tile[threadIdx.y][threadIdx.x]\n");
    printf("  ‚Üí Within warp: ty constant, tx varies 0-31\n");
    printf("  ‚Üí Answer: NO CONFLICT (row-major access)\n\n");
    
    printf("Pattern E: tile[32][32], access tile[threadIdx.x][threadIdx.y]\n");
    printf("  ‚Üí Within warp: tx varies, ty constant\n");
    printf("  ‚Üí Address = tx * 32 + ty, stride = 32\n");
    printf("  ‚Üí Answer: 32-WAY CONFLICT (all threads, same bank!)\n\n");
    
    // Exercise 2: Benchmark conflict vs no-conflict
    printf("Exercise 2: Column Sum - Conflict vs Fixed\n");
    printf("==========================================\n");
    
    const int rows = 1024;
    const int cols = 1024;
    
    // Allocate memory
    float *h_matrix = new float[rows * cols];
    float *h_col_sums = new float[cols];
    float *d_matrix, *d_col_sums;
    
    cudaMalloc(&d_matrix, rows * cols * sizeof(float));
    cudaMalloc(&d_col_sums, cols * sizeof(float));
    
    // Initialize matrix
    for (int i = 0; i < rows * cols; i++) {
        h_matrix[i] = 1.0f;  // Each element is 1, so column sum = rows
    }
    cudaMemcpy(d_matrix, h_matrix, rows * cols * sizeof(float), cudaMemcpyHostToDevice);
    
    dim3 block(32, 32);
    dim3 grid((cols + 31) / 32, 1);
    
    // Warmup
    warmup<<<1, 1>>>();
    cudaDeviceSynchronize();
    
    // Create events for timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    // Time version with conflicts
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        matrix_column_sum_conflict<<<grid, block>>>(d_matrix, d_col_sums, rows, cols);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float ms_conflict;
    cudaEventElapsedTime(&ms_conflict, start, stop);
    
    // Verify result
    cudaMemcpy(h_col_sums, d_col_sums, cols * sizeof(float), cudaMemcpyDeviceToHost);
    printf("Version with conflicts:\n");
    printf("  Time: %.3f ms (100 iterations)\n", ms_conflict);
    printf("  Result check: col_sums[0] = %.0f (expected: %d)\n\n", h_col_sums[0], rows);
    
    // Time fixed version
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        matrix_column_sum_fixed<<<grid, block>>>(d_matrix, d_col_sums, rows, cols);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float ms_fixed;
    cudaEventElapsedTime(&ms_fixed, start, stop);
    
    cudaMemcpy(h_col_sums, d_col_sums, cols * sizeof(float), cudaMemcpyDeviceToHost);
    printf("Version with padding (fixed):\n");
    printf("  Time: %.3f ms (100 iterations)\n", ms_fixed);
    printf("  Result check: col_sums[0] = %.0f (expected: %d)\n", h_col_sums[0], rows);
    printf("  Speedup: %.2fx\n\n", ms_conflict / ms_fixed);
    
    printf("Key Insight: Adding 1 element of padding per row eliminates\n");
    printf("32-way bank conflicts, significantly improving performance!\n");
    
    // Cleanup
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    delete[] h_matrix;
    delete[] h_col_sums;
    cudaFree(d_matrix);
    cudaFree(d_col_sums);
    
    return 0;
}

In [None]:
!nvcc -o bank_conflicts_exercises bank_conflicts_exercises.cu && ./bank_conflicts_exercises

### üî∂ Python/Numba Exercises (Optional)

### Exercise 1: Identify Bank Conflicts

For each access pattern, calculate the number of bank conflicts.

In [None]:
# Exercise 1: Analyze these patterns

# Pattern A: shared[threadIdx.x]
# ‚Üí Stride = 1, Bank = tid % 32
# Answer: ?

# Pattern B: shared[threadIdx.x * 2]
# ‚Üí Stride = 2, Bank = (tid * 2) % 32
# Answer: ?

# Pattern C: shared[threadIdx.x * 33]
# ‚Üí Stride = 33, Bank = (tid * 33) % 32 = (tid * 1) % 32
# Answer: ?

# Pattern D: 2D array tile[32][32], accessing tile[threadIdx.y][threadIdx.x]
# ‚Üí Within a warp (32 threads), ty is same, tx varies 0-31
# Answer: ?

# Pattern E: 2D array tile[32][32], accessing tile[threadIdx.x][threadIdx.y]
# ‚Üí Within a warp (32 threads), tx varies 0-31, ty is same
# ‚Üí Address = tx * 32 + ty, stride of 32
# Answer: ?

print("Analyze each pattern for bank conflicts!")
print("Hint: Bank = (element_index * sizeof(element)) / 4 % 32")

### Exercise 2: Fix the Bank Conflicts

The kernel below has bank conflicts. Fix it using padding.

In [None]:
# Exercise 2: Fix the bank conflicts

@cuda.jit
def matrix_column_sum_conflict(matrix, col_sums, rows, cols):
    """
    Sum each column of a matrix.
    PROBLEM: Column access pattern causes bank conflicts!
    """
    # Each block handles one tile of columns
    shared = cuda.shared.array(shape=(32, 32), dtype=float32)  # CONFLICTS!
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    col = cuda.blockIdx.x * 32 + tx
    
    # Load tile
    for row_offset in range(0, rows, 32):
        row = row_offset + ty
        if row < rows and col < cols:
            shared[ty, tx] = matrix[row, col]
        cuda.syncthreads()
        
        # Sum down columns (ty varies = conflicts!)
        # TODO: This has conflicts - how to fix?
        
        cuda.syncthreads()

@cuda.jit  
def matrix_column_sum_fixed(matrix, col_sums, rows, cols):
    """
    TODO: Fix the bank conflicts using padding.
    """
    # TODO: Change shared array dimensions
    shared = cuda.shared.array(shape=(32, 32), dtype=float32)  # FIX THIS
    
    # Rest of the implementation...
    pass

## üìù Key Takeaways

### Bank Conflict Rules:

1. **32 banks, 4 bytes per bank**
   - Bank = (byte_address / 4) % 32

2. **Conflict = multiple threads accessing different addresses in same bank**
   - N-way conflict ‚Üí N sequential accesses

3. **Broadcast is OK**
   - All threads reading SAME address = no conflict

### Avoidance Techniques:

| Technique | When to Use |
|-----------|-------------|
| **Padding** | 2D arrays with column access |
| **Access restructuring** | When you can change the algorithm |
| **Sequential addressing** | Reductions, scans |

### Quick Check:
- Stride 1: No conflict ‚úÖ
- Stride 2,4,8,16: 2,4,8,16-way conflict ‚ùå
- Stride 32: 32-way conflict ‚ùå‚ùå‚ùå
- Stride 33: No conflict! (33 % 32 = 1) ‚úÖ

---

### üìö Next Up: Day 4 - Special Memory Types
- Constant memory for read-only data
- Texture memory for spatial locality
- When to use each memory type

---

### üîó Resources
- [Device Memory Access](../../cuda-programming-guide/03-advanced/device-memory-access.md)
- [Performance Optimization](../../cuda-programming-guide/03-advanced/performance-optimization.md)