# GPU Optimization Step-by-Step
This notebook demonstrates GPU optimization techniques using CUDA and `pycuda`. We'll compare an **unoptimized implementation** with step-by-step optimizations and measure the improvements in performance.

## Learning Goals
- Understand the impact of unoptimized global memory access.
- Apply memory optimization techniques (e.g., shared memory).
- Optimize thread and block management to maximize GPU occupancy.
- Compare execution times and understand performance improvements.

## Steps
1. Install required libraries.
2. Implement an unoptimized GPU kernel.
3. Add memory optimizations.
4. Optimize thread and block configurations.
5. Measure and compare execution times for each version.

In [None]:
!pip install pycuda

## Step 1: Unoptimized Kernel
This kernel performs matrix addition with **unoptimized global memory access**. Each thread directly accesses global memory for all operations, which is inefficient.

In [None]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
import time

# Define unoptimized kernel
unoptimized_kernel = """
__global__ void matrix_add_unoptimized(float *a, float *b, float *c, int N) {
    int gid = blockIdx.x * blockDim.x + threadIdx.x;  // Global thread ID
    if (gid < N) {
        c[gid] = a[gid] + b[gid];
    }
}
"""

# Compile the unoptimized kernel
mod = SourceModule(unoptimized_kernel)
matrix_add_unoptimized = mod.get_function("matrix_add_unoptimized")

# Initialize data
N = 1024 * 1024
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)
c = np.zeros_like(a)

# Allocate device memory
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# Copy data to device
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# Define block and grid sizes
block_size = 256
grid_size = (N + block_size - 1) // block_size

# Measure execution time for unoptimized kernel
start_time = time.time()
matrix_add_unoptimized(
    a_gpu, b_gpu, c_gpu,
    np.int32(N),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)
cuda.memcpy_dtoh(c, c_gpu)
end_time = time.time()

print("Execution Time (Unoptimized):", end_time - start_time, "seconds")

## Step 2: Optimizing Memory Access
In this step, we use **shared memory** to reduce the number of accesses to global memory, which improves memory bandwidth utilization.

In [None]:
# Define optimized kernel with shared memory
optimized_memory_kernel = """
__global__ void matrix_add_shared_memory(float *a, float *b, float *c, int N) {
    __shared__ float shared_a[256];
    __shared__ float shared_b[256];
    
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (gid < N) {
        // Load data into shared memory
        shared_a[tid] = a[gid];
        shared_b[tid] = b[gid];
        __syncthreads();
        
        // Perform addition using shared memory
        c[gid] = shared_a[tid] + shared_b[tid];
    }
}
"""

# Compile the optimized memory kernel
mod = SourceModule(optimized_memory_kernel)
matrix_add_shared_memory = mod.get_function("matrix_add_shared_memory")

# Measure execution time for shared memory kernel
start_time = time.time()
matrix_add_shared_memory(
    a_gpu, b_gpu, c_gpu,
    np.int32(N),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)
cuda.memcpy_dtoh(c, c_gpu)
end_time = time.time()

print("Execution Time (Shared Memory):", end_time - start_time, "seconds")

## Step 3: Optimizing Thread and Block Management
Now, we'll tune the block size and grid size to maximize GPU occupancy and avoid divergence.

In [None]:
# Adjust block size for optimal GPU occupancy
block_size = 512  # Increase block size for better GPU utilization
grid_size = (N + block_size - 1) // block_size

# Measure execution time with optimized block size
start_time = time.time()
matrix_add_shared_memory(
    a_gpu, b_gpu, c_gpu,
    np.int32(N),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)
cuda.memcpy_dtoh(c, c_gpu)
end_time = time.time()

print("Execution Time (Optimized Block Size):", end_time - start_time, "seconds")

## Summary of Results
We started with an unoptimized kernel and progressively applied optimization techniques:

- **Unoptimized Kernel:** Direct global memory access.
- **Memory Optimization:** Used shared memory to reduce global memory access latency.
- **Thread/Block Optimization:** Tuned block size for better GPU occupancy.

Compare the execution times to see the improvements.