# GPU Optimization Example
This notebook demonstrates techniques for optimizing GPU performance using CUDA and `pycuda`.
- **Memory Optimization Techniques:**
  - Reducing memory bandwidth usage.
  - Effective use of shared memory.
- **Thread and Block Management:**
  - Maximizing occupancy.
  - Avoiding divergence.
- **Profiling Tools:** NVIDIA Nsight.

**Steps:**
1. Install necessary libraries.
2. Define and run CUDA kernel using `pycuda`.
3. Analyze optimization techniques in the code.

In [None]:
!pip install pycuda

In [None]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
import time

# Define the CUDA kernel
kernel_code = """
__global__ void matrix_add_optimized(float *a, float *b, float *c, int N) {
    // Shared memory for blocks
    __shared__ float shared_a[256];
    __shared__ float shared_b[256];
    
    int tid = threadIdx.x;  // Thread ID within the block
    int gid = blockIdx.x * blockDim.x + threadIdx.x;  // Global thread ID
    
    if (gid < N) {
        // Load data into shared memory
        shared_a[tid] = a[gid];
        shared_b[tid] = b[gid];
        __syncthreads();  // Ensure all threads in the block load data
        
        // Perform addition
        c[gid] = shared_a[tid] + shared_b[tid];
    }
}
"""

# Compile the CUDA kernel
mod = SourceModule(kernel_code)
matrix_add_optimized = mod.get_function("matrix_add_optimized")

# Define matrix size
N = 1024 * 1024
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)
c = np.zeros_like(a)

# Allocate device memory
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# Copy data to device
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# Define block and grid sizes
block_size = 256
grid_size = (N + block_size - 1) // block_size

# Measure execution time
start_time = time.time()

# Launch the kernel
matrix_add_optimized(
    a_gpu, b_gpu, c_gpu,
    np.int32(N),
    block=(block_size, 1, 1), grid=(grid_size, 1)
)

# Copy result back to host
cuda.memcpy_dtoh(c, c_gpu)

end_time = time.time()

# Verify results
print("First 10 elements of c:", c[:10])
print("Execution Time:", end_time - start_time, "seconds")

# Clean up
a_gpu.free()
b_gpu.free()
c_gpu.free()

### Explanation of Optimizations
#### **Memory Optimization Techniques**
1. **Reducing Memory Bandwidth Usage:**
   - Data is loaded into shared memory (`shared_a` and `shared_b`) from global memory. Shared memory is much faster than global memory.
   - Threads within a block use shared memory to perform the addition, reducing the number of global memory accesses.

2. **Effective Use of Shared Memory:**
   - Each thread loads a part of the data into shared memory. The `__syncthreads()` ensures all threads have loaded their respective data before proceeding.

#### **Thread and Block Management**
1. **Maximizing Occupancy:**
   - The grid and block sizes are calculated dynamically. `block_size` is chosen as 256, which is a multiple of the warp size (32), ensuring optimal GPU utilization.

2. **Avoiding Divergence:**
   - The kernel avoids divergent branching by using a single condition (`if gid < N`) that all threads in the block evaluate consistently.

#### **Profiling Tools**
To profile this code:
1. Download the script (`matrix_add.cu`) or run equivalent CUDA code on your local machine.
2. Use **NVIDIA Nsight Systems** or **NVIDIA Nsight Compute** to analyze kernel execution time, memory throughput, and thread utilization.

### Profiling with NVIDIA Nsight
1. Install Nsight tools from NVIDIA.
2. Profile the execution by running:
   ```bash
   nv-nsight-cu-cli ./matrix_add.cu
   ```
3. Look for metrics like:
   - Memory access patterns.
   - Warp divergence.
   - Shared memory usage.