# üöÄ Day 1: Grid-Stride Loops - The Professional Pattern

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-03/day-1-grid-stride-loops.ipynb)

## Learning Objectives
- Understand limitations of naive "one thread per element" approach
- Master the grid-stride loop pattern for production code
- Handle arbitrary data sizes with fixed launch configurations
- Apply 1D and 2D grid-stride patterns

> **Primary Focus:** CUDA C++ code examples first, Python/Numba backup for interactive testing

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

# Verify GPU
print(f"\nCUDA available: {cuda.is_available()}")
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")
    print(f"Compute capability: {device.compute_capability}")
    print(f"Max threads per block: {device.MAX_THREADS_PER_BLOCK}")
    print(f"Max blocks per grid: {device.MAX_GRID_DIM_X}")

---

## Part 1: The Problem with Naive Kernels

### Naive Approach: One Thread Per Element

The simplest approach assigns exactly one thread to each element.

### CUDA C++ Implementation (Primary)

Compile and run:
```bash
nvcc -arch=sm_75 -o naive_add naive_add.cu
./naive_add
```

### Python/Numba (Optional - Interactive Testing)

In [None]:
%%writefile naive_add.cu
// naive_add.cu - Naive vector addition
#include <stdio.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(1); \
        } \
    } while(0)

// Naive kernel: one thread handles exactly one element
__global__ void naiveAdd(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (tid < n) {  // Bounds check
        out[tid] = a[tid] + b[tid];
    }
}

int main() {
    int n = 1000000;
    size_t size = n * sizeof(float);
    
    // Allocate host memory
    float *h_a, *h_b, *h_out;
    h_a = (float*)malloc(size);
    h_b = (float*)malloc(size);
    h_out = (float*)malloc(size);
    
    // Initialize data
    for (int i = 0; i < n; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }
    
    // Allocate device memory
    float *d_a, *d_b, *d_out;
    CUDA_CHECK(cudaMalloc(&d_a, size));
    CUDA_CHECK(cudaMalloc(&d_b, size));
    CUDA_CHECK(cudaMalloc(&d_out, size));
    
    // Copy to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice));
    
    // Launch configuration
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    
    printf("Launching with %d blocks, %d threads/block\n", blocksPerGrid, threadsPerBlock);
    
    naiveAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, n);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());
    
    // Copy back and verify
    CUDA_CHECK(cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost));
    printf("Result[0] = %f (expected 3.0)\n", h_out[0]);
    
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_out);
    free(h_a); free(h_b); free(h_out);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o naive_add naive_add.cu
!./naive_add

In [None]:
# Python equivalent for interactive testing
@cuda.jit
def naive_add(a, b, out, n):
    """Naive: One thread handles exactly one element."""
    tid = cuda.grid(1)
    
    if tid < n:  # Bounds check
        out[tid] = a[tid] + b[tid]

### Problems with Naive Approach

```
Problem 1: Block Size Dependency
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
N = 1000 elements
Block size = 256
Blocks needed = ceil(1000/256) = 4
Total threads = 4 √ó 256 = 1024
Wasted threads = 1024 - 1000 = 24 (2.4%)

Problem 2: Large Data
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
N = 1 billion elements
Block size = 256
Blocks needed = ceil(1B/256) = 3,906,250
Max blocks (on some GPUs) = 65,535
‚ùå FAILS for very large data!

Problem 3: Fixed Launch Config
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Every different N requires different grid size.
Can't tune for occupancy independently of data size.
```

In [None]:
# Demonstrate the limitation
device = cuda.get_current_device()
max_blocks = device.MAX_GRID_DIM_X
threads_per_block = 256
max_elements_naive = max_blocks * threads_per_block

print(f"Max blocks in X dimension: {max_blocks:,}")
print(f"Threads per block: {threads_per_block}")
print(f"Max elements with naive approach: {max_elements_naive:,}")
print(f"That's only {max_elements_naive / 1e9:.2f} billion elements")
print(f"\nModern datasets can have billions of elements!")

---

## Part 2: The Grid-Stride Loop Pattern

### The Solution: Each Thread Processes Multiple Elements

```
Grid-Stride Loop Concept:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

Data: [0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]...
       ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ  ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ   ‚îÇ
Grid:  T0 T1 T2 T3 T0 T1 T2 T3 T0  T1  T2  T3  T0  T1  T2  T3
       \________/  \________/  \_________/   \__________/
        Pass 1      Pass 2       Pass 3         Pass 4

Each thread processes elements at: tid, tid+gridsize, tid+2*gridsize, ...
```

### CUDA C++ Implementation (Primary)

**Key Difference:** 
- `gridDim.x * blockDim.x` = total threads in grid
- Loop `for (i = tid; i < n; i += stride)` processes multiple elements per thread

### CUDA C++ vs Python Comparison

| CUDA C++ | Python/Numba |
|----------|--------------|
| `blockIdx.x * blockDim.x + threadIdx.x` | `cuda.grid(1)` |
| `blockDim.x * gridDim.x` | `cuda.gridsize(1)` |

### Python/Numba (Optional - Interactive Testing)

In [None]:
%%writefile grid_stride_add.cu
// grid_stride_add.cu - Professional grid-stride pattern
#include <stdio.h>
#include <cuda_runtime.h>

// Grid-stride kernel: each thread processes multiple elements
__global__ void gridStrideAdd(const float* a, const float* b, float* out, int n) {
    // Global thread ID
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    // Total number of threads in the grid
    int stride = blockDim.x * gridDim.x;
    
    // Each thread processes elements at tid, tid+stride, tid+2*stride, ...
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] + b[i];
    }
}

int main() {
    int n = 100000000;  // 100M elements - works with any size!
    size_t size = n * sizeof(float);
    
    // Allocate
    float *d_a, *d_b, *d_out;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_out, size);
    
    // Fixed launch config - works for ANY data size
    int threadsPerBlock = 256;
    int blocksPerGrid = 256;  // Fixed, not dependent on n!
    
    printf("Processing %d elements with %d total threads\n", 
           n, threadsPerBlock * blocksPerGrid);
    printf("Each thread handles ~%d elements\n", 
           n / (threadsPerBlock * blocksPerGrid));
    
    gridStrideAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_out, n);
    cudaDeviceSynchronize();
    
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_out);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o grid_stride_add grid_stride_add.cu
!./grid_stride_add

In [None]:
# Python equivalent for interactive testing
@cuda.jit
def grid_stride_add(a, b, out, n):
    """Grid-stride loop: Each thread processes multiple elements."""
    tid = cuda.grid(1)           # Global thread ID (= blockIdx.x * blockDim.x + threadIdx.x)
    stride = cuda.gridsize(1)    # Total threads (= blockDim.x * gridDim.x)
    
    # Each thread processes elements at tid, tid+stride, tid+2*stride, ...
    for i in range(tid, n, stride):
        out[i] = a[i] + b[i]

### Key Functions

| Function | Returns | Description |
|----------|---------|-------------|
| `cuda.grid(1)` | int | Global thread ID (1D) |
| `cuda.gridsize(1)` | int | Total threads in grid = blocks √ó threads_per_block |
| `cuda.grid(2)` | (x, y) | Global thread ID (2D) |
| `cuda.gridsize(2)` | (x, y) | Total threads in each dimension |

In [None]:
# Visualize grid-stride pattern
@cuda.jit
def show_grid_stride(output, n):
    """Store which thread processed each element."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        output[i] = tid  # Store thread ID

# Small example
n = 20
blocks = 2
threads = 4
total_threads = blocks * threads

output = np.zeros(n, dtype=np.int32)
d_output = cuda.to_device(output)

show_grid_stride[blocks, threads](d_output, n)
result = d_output.copy_to_host()

print(f"Configuration: {blocks} blocks √ó {threads} threads = {total_threads} total threads")
print(f"Processing {n} elements")
print(f"\nElement index:  {list(range(n))}")
print(f"Processed by:   {list(result)}")
print(f"\nEach thread processes {n // total_threads} elements (plus remainder)")

---

## Part 3: Benefits of Grid-Stride Loops

### Benefit 1: Handle ANY Data Size

In [None]:
# Same kernel config works for different sizes
def test_sizes(kernel, sizes, blocks=256, threads=256):
    """Test kernel with various data sizes."""
    for n in sizes:
        a = np.random.rand(n).astype(np.float32)
        b = np.random.rand(n).astype(np.float32)
        out = np.zeros(n, dtype=np.float32)
        
        d_a = cuda.to_device(a)
        d_b = cuda.to_device(b)
        d_out = cuda.to_device(out)
        
        kernel[blocks, threads](d_a, d_b, d_out, n)
        result = d_out.copy_to_host()
        
        # Verify
        expected = a + b
        is_correct = np.allclose(result, expected)
        status = "‚úì" if is_correct else "‚úó"
        print(f"{status} N = {n:>12,} | Blocks = {blocks}, Threads = {threads}")

# Test grid-stride with various sizes
print("Grid-Stride Loop (same config for all sizes):")
test_sizes(grid_stride_add, [100, 1000, 10000, 100000, 1000000, 10000000])

### Benefit 2: Tune for Occupancy, Not Data Size

In [None]:
# Different launch configs, same result
n = 1_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32)
out = np.zeros(n, dtype=np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

configs = [
    (32, 64),    # Few threads
    (128, 128),  # Moderate
    (256, 256),  # Typical
    (512, 512),  # High occupancy
]

print(f"Testing different configs with N = {n:,}\n")
for blocks, threads in configs:
    start = time.perf_counter()
    for _ in range(100):
        grid_stride_add[blocks, threads](d_a, d_b, d_out, n)
    cuda.synchronize()
    elapsed = (time.perf_counter() - start) / 100 * 1000
    
    total_threads = blocks * threads
    elements_per_thread = n / total_threads
    print(f"Blocks={blocks:3}, Threads={threads:3} | "
          f"Total={total_threads:>7,} | "
          f"Elem/Thread={elements_per_thread:>6.1f} | "
          f"Time={elapsed:.3f}ms")

### Benefit 3: Better Memory Access Patterns

Grid-stride loops naturally provide **coalesced memory access**:

```
Pass 1: Threads 0,1,2,...,31 access elements 0,1,2,...,31  ‚Üê Coalesced!
Pass 2: Threads 0,1,2,...,31 access elements 256,257,...,287  ‚Üê Coalesced!
...
```

---

## Part 4: Common Patterns

### Pattern 1: Basic 1D Grid-Stride (CUDA C++)

```cpp
// 1D Grid-Stride Template
__global__ void gridStride1D(float* data, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        data[i] = data[i] * 2.0f;  // Some operation
    }
}
```

### Pattern 2: 2D Grid-Stride (CUDA C++)

```cpp
// 2D Grid-Stride for images/matrices
__global__ void gridStride2D(float* data, int height, int width) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int strideX = blockDim.x * gridDim.x;
    int strideY = blockDim.y * gridDim.y;
    
    for (int row = y; row < height; row += strideY) {
        for (int col = x; col < width; col += strideX) {
            int idx = row * width + col;
            data[idx] = data[idx] * 2.0f;
        }
    }
}

// Launch with 2D config:
// dim3 threads(16, 16);
// dim3 blocks(32, 32);
// gridStride2D<<<blocks, threads>>>(d_data, height, width);
```

### Pattern 3: Grid-Stride with Local Accumulation (CUDA C++)

```cpp
// Each thread computes partial sum of its elements
__global__ void gridStrideSum(const float* data, float* partialSums, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    float localSum = 0.0f;
    for (int i = tid; i < n; i += stride) {
        localSum += data[i];
    }
    
    // Store partial sum (will need reduction to complete)
    partialSums[tid] = localSum;
}
```

### Python/Numba Equivalents (Optional)

In [None]:
# Pattern 1: Basic 1D Grid-Stride (Python)
@cuda.jit
def grid_stride_1d(data, n):
    """Basic 1D grid-stride pattern."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        data[i] = data[i] * 2  # Some operation

### Pattern 2: 2D Grid-Stride (Python)

In [None]:
@cuda.jit
def grid_stride_2d(data, height, width):
    """2D grid-stride for images/matrices."""
    start_x, start_y = cuda.grid(2)
    stride_x, stride_y = cuda.gridsize(2)
    
    for y in range(start_y, height, stride_y):
        for x in range(start_x, width, stride_x):
            data[y, x] = data[y, x] * 2

# Test 2D grid-stride
height, width = 1000, 1000
data = np.random.rand(height, width).astype(np.float32)
d_data = cuda.to_device(data)

threads = (16, 16)
blocks = (32, 32)  # Can be any size, not dependent on image size

grid_stride_2d[blocks, threads](d_data, height, width)
result = d_data.copy_to_host()

print(f"Image: {height}x{width} = {height*width:,} pixels")
print(f"Grid: {blocks[0]*blocks[1]*threads[0]*threads[1]:,} threads")
print(f"Correct: {np.allclose(result, data * 2)}")

### Pattern 3: Grid-Stride with Local Accumulation (Python)

In [None]:
@cuda.jit
def grid_stride_sum(data, partial_sums, n):
    """Each thread computes partial sum of its elements."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    local_sum = 0.0
    for i in range(tid, n, stride):
        local_sum += data[i]
    
    # Store partial sum (will need reduction to complete)
    partial_sums[tid] = local_sum

# Test
n = 1_000_000
data = np.random.rand(n).astype(np.float32)
blocks, threads = 256, 256
total_threads = blocks * threads

d_data = cuda.to_device(data)
d_partial = cuda.device_array(total_threads, dtype=np.float32)

grid_stride_sum[blocks, threads](d_data, d_partial, n)
partial = d_partial.copy_to_host()

gpu_sum = partial.sum()  # Final reduction on CPU
cpu_sum = data.sum()

print(f"CPU sum: {cpu_sum:.6f}")
print(f"GPU sum: {gpu_sum:.6f}")
print(f"Match: {np.isclose(cpu_sum, gpu_sum, rtol=1e-4)}")

---

## Part 5: Optimal Launch Configuration

### Guidelines for Choosing Blocks and Threads

In [None]:
device = cuda.get_current_device()

print("Device Properties:")
print(f"  Max threads per block: {device.MAX_THREADS_PER_BLOCK}")
print(f"  Warp size: {device.WARP_SIZE}")
print(f"  Max blocks per SM: {device.MAX_BLOCK_DIM_X}")
print(f"  Multiprocessors: {device.MULTIPROCESSOR_COUNT}")

# Recommended config
threads = 256  # Multiple of warp size (32)
blocks = device.MULTIPROCESSOR_COUNT * 4  # Enough for good occupancy

print(f"\nRecommended starting config:")
print(f"  Threads per block: {threads}")
print(f"  Blocks: {blocks}")
print(f"  Total threads: {blocks * threads:,}")

### Rules of Thumb

```
Thread count per block:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
‚Ä¢ Always multiple of 32 (warp size)
‚Ä¢ 128-256 is usually good
‚Ä¢ 512 for compute-heavy kernels
‚Ä¢ Max 1024 on most GPUs

Block count:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
‚Ä¢ At least SMs √ó 2 (hide latency)
‚Ä¢ SMs √ó 4 to SMs √ó 8 is often optimal
‚Ä¢ More blocks = more flexibility for scheduler

With grid-stride, you can always use:
  blocks = SM_count * 4
  threads = 256
And it will work for any data size!
```

---

## Part 6: Benchmarking

In [None]:
def benchmark_add(n, iterations=100):
    """Compare naive vs grid-stride for vector addition."""
    a = np.random.rand(n).astype(np.float32)
    b = np.random.rand(n).astype(np.float32)
    out = np.zeros(n, dtype=np.float32)
    
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_out = cuda.to_device(out)
    
    # Naive config: one thread per element
    threads_naive = 256
    blocks_naive = (n + threads_naive - 1) // threads_naive
    
    # Grid-stride config: fixed size
    threads_gs = 256
    blocks_gs = 256
    
    # Warmup
    naive_add[blocks_naive, threads_naive](d_a, d_b, d_out, n)
    grid_stride_add[blocks_gs, threads_gs](d_a, d_b, d_out, n)
    cuda.synchronize()
    
    # Benchmark naive
    start = time.perf_counter()
    for _ in range(iterations):
        naive_add[blocks_naive, threads_naive](d_a, d_b, d_out, n)
    cuda.synchronize()
    naive_time = (time.perf_counter() - start) / iterations * 1000
    
    # Benchmark grid-stride
    start = time.perf_counter()
    for _ in range(iterations):
        grid_stride_add[blocks_gs, threads_gs](d_a, d_b, d_out, n)
    cuda.synchronize()
    gs_time = (time.perf_counter() - start) / iterations * 1000
    
    # NumPy baseline
    start = time.perf_counter()
    for _ in range(iterations):
        _ = a + b
    numpy_time = (time.perf_counter() - start) / iterations * 1000
    
    return naive_time, gs_time, numpy_time, blocks_naive

print(f"{'N':>12} | {'Naive (ms)':>10} | {'Grid-Stride':>11} | {'NumPy':>10} | {'GPU Speedup':>11}")
print("-" * 65)

for n in [10_000, 100_000, 1_000_000, 10_000_000]:
    naive_t, gs_t, np_t, blocks = benchmark_add(n)
    speedup = np_t / gs_t
    print(f"{n:>12,} | {naive_t:>10.3f} | {gs_t:>11.3f} | {np_t:>10.3f} | {speedup:>10.1f}x")

---

## Exercises

Complete these exercises in CUDA C++ first, then optionally test with Python.

### Exercise 1: Vector Scaling (CUDA C++)

```cpp
// TODO: Implement grid-stride vector scaling
// File: vector_scale.cu

__global__ void vectorScale(float* data, float scalar, int n) {
    // TODO: Implement grid-stride loop
    // Multiply every element by scalar: data[i] *= scalar
}

int main() {
    // Test with data = [1, 2, 3, 4, 5], scalar = 3
    // Expected: [3, 6, 9, 12, 15]
    return 0;
}
```

### Exercise 2: Vector Square Root (CUDA C++)

```cpp
// TODO: Apply sqrt to every element
// File: vector_sqrt.cu
#include <math.h>

__global__ void vectorSqrt(const float* input, float* output, int n) {
    // TODO: Implement grid-stride loop with sqrtf()
}
```

### Exercise 3: Conditional Processing (CUDA C++)

```cpp
// TODO: Double only positive values
// File: double_positives.cu

__global__ void doublePositives(float* data, int n) {
    // TODO: Double positive elements, leave others unchanged
    // Input:  [-2, -1, 0, 1, 2]
    // Output: [-2, -1, 0, 2, 4]
}
```

### Exercise 4: 2D Brightness Adjustment (CUDA C++)

```cpp
// TODO: Adjust image brightness
// File: brightness.cu

__global__ void adjustBrightness(float* image, float factor, int height, int width) {
    // TODO: Use 2D grid-stride pattern
    // Multiply all pixels by factor, clamp to [0, 1]
    // Clamp: fminf(1.0f, fmaxf(0.0f, value))
}
```

### Python/Numba Practice (Optional)

In [None]:
# Exercise 1: Vector Scaling (Python)
@cuda.jit
def vector_scale(data, scalar, n):
    """Multiply every element by scalar: data[i] *= scalar"""
    # TODO: Your implementation here
    pass

# Test
# data = np.array([1, 2, 3, 4, 5], dtype=np.float32)
# Expected after scale by 3: [3, 6, 9, 12, 15]

### Exercise 2: Vector Square Root (Python)

In [None]:
# TODO: Implement grid-stride sqrt
@cuda.jit
def vector_sqrt(input_data, output_data, n):
    """Compute sqrt of every element."""
    # Hint: Use math.sqrt(x) inside the kernel
    pass

# Test with input = [1, 4, 9, 16, 25]
# Expected output = [1, 2, 3, 4, 5]

### Exercise 3: Conditional Processing (Python)

In [None]:
# TODO: Double only positive values
@cuda.jit
def double_positives(data, n):
    """Double the value of positive elements, leave others unchanged."""
    # Your implementation here
    pass

# Test with input = [-2, -1, 0, 1, 2]
# Expected output = [-2, -1, 0, 2, 4]

### Exercise 4: 2D Brightness Adjustment (Python)

In [None]:
# TODO: Adjust image brightness
@cuda.jit
def adjust_brightness(image, factor, height, width):
    """Multiply all pixel values by factor, clamping to [0, 1]."""
    # Use 2D grid-stride pattern
    # Clamp: max(0, min(1, value))
    pass

# Test with a 100x100 image

---

## Summary

### CUDA C++ Grid-Stride Template

```cpp
// 1D Grid-Stride
__global__ void kernel1D(float* data, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        data[i] = process(data[i]);
    }
}

// 2D Grid-Stride
__global__ void kernel2D(float* data, int height, int width) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int strideX = blockDim.x * gridDim.x;
    int strideY = blockDim.y * gridDim.y;
    
    for (int row = y; row < height; row += strideY) {
        for (int col = x; col < width; col += strideX) {
            int idx = row * width + col;
            data[idx] = process(data[idx]);
        }
    }
}
```

### Python/Numba Equivalent (Optional Reference)

```python
@cuda.jit
def kernel_1d(data, n):
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(tid, n, stride):
        data[i] = process(data[i])

@cuda.jit
def kernel_2d(data, height, width):
    x, y = cuda.grid(2)
    stride_x, stride_y = cuda.gridsize(2)
    for row in range(y, height, stride_y):
        for col in range(x, width, stride_x):
            data[row, col] = process(data[row, col])
```

### Key Benefits
1. ‚úì Handle any data size with fixed launch config
2. ‚úì Tune occupancy independently of data
3. ‚úì Natural coalesced access
4. ‚úì Professional, reusable pattern

### Recommended Config
```cpp
int threadsPerBlock = 256;  // Multiple of 32
int blocksPerGrid = numSMs * 4;  // Good occupancy
```

---

## Next Steps

üìã **Day 2:** Element-wise vector operations (add, sub, mul, div, math functions)

We'll use grid-stride loops as the foundation for all our vector operations!