# üöÄ Day 4: Error Handling & Debugging

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-01/day-4-error-handling.ipynb)

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

> **Note:** If running on Google Colab, go to `Runtime ‚Üí Change runtime type ‚Üí T4 GPU` before starting!

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
from numba.cuda.cudadrv.driver import CudaAPIError
import math
import traceback

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")
print("   Python/Numba is provided for quick interactive testing only.")

# Day 4: Error Handling & Debugging

Bugs in CUDA code can be subtle and hard to find. Today you'll learn:
- How CUDA errors work (and the `CUDA_CHECK` macro)
- Proper error checking patterns in CUDA C++
- Common pitfalls and how to avoid them
- Debugging with `cuda-memcheck` and `compute-sanitizer`

---

## 1. Understanding CUDA Errors

CUDA operations can fail for many reasons:
- Invalid kernel launch configuration
- Out of memory
- Device not available
- Invalid memory access
- Race conditions

**Key concept:** CUDA operations are often **asynchronous**. Errors may not be reported until later!

```
kernel<<<grid, block>>>(...);  // Launches, returns immediately
// ... other code ...
cudaDeviceSynchronize();       // Error might appear HERE!
```

In [None]:
# Setup
import numpy as np
from numba import cuda
from numba.cuda.cudadrv.driver import CudaAPIError
import math
import traceback

print("CUDA device:", cuda.get_current_device().name.decode())

## 2. Common CUDA Errors & How to Trigger Them

Let's intentionally cause errors to understand how they appear.

In [None]:
# Error 1: Invalid Launch Configuration
# Max threads per block is 1024, what happens if we exceed it?

@cuda.jit
def simple_kernel(arr):
    idx = cuda.grid(1)
    if idx < arr.size:
        arr[idx] = idx

arr = np.zeros(100, dtype=np.float32)
arr_d = cuda.to_device(arr)

print("Attempting to launch with 2048 threads per block...")
print("(Max allowed is 1024)")
print("-" * 50)

try:
    # This will fail - too many threads per block!
    simple_kernel[1, 2048](arr_d)
    cuda.synchronize()
except Exception as e:
    print(f"‚ùå Error caught: {type(e).__name__}")
    print(f"   Message: {e}")

In [None]:
# Error 2: Out of Memory
# Trying to allocate more than available GPU memory

ctx = cuda.current_context()
free_mem, total_mem = ctx.get_memory_info()
print(f"Free GPU memory: {free_mem / 1e9:.2f} GB")
print(f"Attempting to allocate: {free_mem * 2 / 1e9:.2f} GB (2x available)")
print("-" * 50)

try:
    # Try to allocate more than available
    huge_array = cuda.device_array(int(free_mem * 2), dtype=np.uint8)
except Exception as e:
    print(f"‚ùå Error caught: {type(e).__name__}")
    print(f"   Message: {e}")

## 3. The Debugging Checklist

When your CUDA code doesn't work, check these in order:

### üîç Checklist

1. **Is CUDA available?**
   ```python
   cuda.is_available()
   ```

2. **Are launch parameters valid?**
   - `threads_per_block` ‚â§ 1024
   - `blocks` > 0
   - Grid dimensions within limits

3. **Is there enough memory?**
   - Check `cuda.current_context().get_memory_info()`

4. **Are array sizes correct?**
   - Boundary checks in kernel: `if idx < n:`

5. **Are data types matching?**
   - GPU prefers float32, not float64

6. **Did you synchronize?**
   - `cuda.synchronize()` before reading results

In [None]:
# Helper function: Safe kernel launch wrapper
def safe_launch(kernel, grid, block, *args, **kwargs):
    """Launch kernel with error checking"""
    device = cuda.get_current_device()
    
    # Validate block size
    if isinstance(block, int):
        block = (block,)
    total_threads = 1
    for dim in block:
        total_threads *= dim
    if total_threads > device.MAX_THREADS_PER_BLOCK:
        raise ValueError(f"Block size {block} = {total_threads} threads exceeds max {device.MAX_THREADS_PER_BLOCK}")
    
    # Validate grid size
    if isinstance(grid, int):
        grid = (grid,)
    for i, dim in enumerate(grid):
        max_dim = [device.MAX_GRID_DIM_X, device.MAX_GRID_DIM_Y, device.MAX_GRID_DIM_Z][i]
        if dim > max_dim:
            raise ValueError(f"Grid dimension {i} = {dim} exceeds max {max_dim}")
    
    # Launch
    kernel[grid, block](*args, **kwargs)
    cuda.synchronize()

# Test safe launch
print("Testing safe_launch helper:")
arr = cuda.device_array(100, dtype=np.float32)

try:
    safe_launch(simple_kernel, 1, 2048, arr)  # Should fail validation
except ValueError as e:
    print(f"‚úÖ Caught before launch: {e}")

safe_launch(simple_kernel, 1, 256, arr)  # Should work
print("‚úÖ Valid launch succeeded")

## 4. Common Pitfalls & Bug Patterns

### Pitfall 1: Missing Boundary Check

In [None]:
# BAD: No boundary check
@cuda.jit
def bad_kernel_no_bounds(arr):
    idx = cuda.grid(1)
    arr[idx] = idx  # üí• Will access out-of-bounds memory!

# GOOD: With boundary check
@cuda.jit  
def good_kernel_with_bounds(arr, n):
    idx = cuda.grid(1)
    if idx < n:  # ‚úÖ Always check!
        arr[idx] = idx

# Demonstrate the difference
n = 100
arr = cuda.device_array(n, dtype=np.float32)
threads = 256  # More threads than elements!
blocks = 1

print("With proper bounds checking:")
good_kernel_with_bounds[blocks, threads](arr, n)
cuda.synchronize()
print("‚úÖ Completed safely")

### Pitfall 2: Wrong Data Type

In [None]:
# NumPy defaults to float64, but CUDA prefers float32
import time

@cuda.jit
def add_arrays(a, b, c):
    idx = cuda.grid(1)
    if idx < c.size:
        c[idx] = a[idx] + b[idx]

n = 10_000_000

# float64 (default) - slower on most GPUs
a64 = np.random.randn(n)  # Default is float64!
b64 = np.random.randn(n)
c64 = np.zeros(n)

# float32 - preferred
a32 = np.random.randn(n).astype(np.float32)
b32 = np.random.randn(n).astype(np.float32)
c32 = np.zeros(n, dtype=np.float32)

threads, blocks = 256, math.ceil(n / 256)

# Benchmark float64
a64_d, b64_d = cuda.to_device(a64), cuda.to_device(b64)
c64_d = cuda.device_array(n, dtype=np.float64)
add_arrays[blocks, threads](a64_d, b64_d, c64_d)
cuda.synchronize()

start = time.perf_counter()
for _ in range(10):
    add_arrays[blocks, threads](a64_d, b64_d, c64_d)
cuda.synchronize()
time64 = (time.perf_counter() - start) / 10

# Benchmark float32
a32_d, b32_d = cuda.to_device(a32), cuda.to_device(b32)
c32_d = cuda.device_array(n, dtype=np.float32)

start = time.perf_counter()
for _ in range(10):
    add_arrays[blocks, threads](a32_d, b32_d, c32_d)
cuda.synchronize()
time32 = (time.perf_counter() - start) / 10

print(f"float64: {time64*1000:.3f} ms")
print(f"float32: {time32*1000:.3f} ms")
print(f"Speedup: {time64/time32:.2f}x")
print("\nüí° Tip: Always use .astype(np.float32) unless you need float64 precision!")

### Pitfall 3: Forgetting to Synchronize

In [None]:
# Kernel execution is ASYNCHRONOUS
@cuda.jit
def slow_kernel(arr):
    """Simulate slow computation"""
    idx = cuda.grid(1)
    if idx < arr.size:
        # Busy work
        val = 0.0
        for i in range(1000):
            val += idx * 0.001
        arr[idx] = val

arr = cuda.device_array(10000, dtype=np.float32)
threads, blocks = 256, math.ceil(10000 / 256)

# BAD: Timing without synchronization
start = time.perf_counter()
slow_kernel[blocks, threads](arr)
# Missing: cuda.synchronize()
bad_time = time.perf_counter() - start
print(f"Without sync: {bad_time*1000:.3f} ms (WRONG! Kernel still running)")

# GOOD: Proper timing with synchronization  
start = time.perf_counter()
slow_kernel[blocks, threads](arr)
cuda.synchronize()  # Wait for kernel to complete
good_time = time.perf_counter() - start
print(f"With sync:    {good_time*1000:.3f} ms (Correct)")

print(f"\n‚ö†Ô∏è The unsynchronized time is {good_time/bad_time:.0f}x too fast!")

## 5. Debugging with Print Statements

In Numba CUDA, you can use `print()` inside kernels for debugging (but use sparingly - it's slow!).

In [None]:
@cuda.jit
def debug_kernel(arr, n):
    idx = cuda.grid(1)
    
    # Only print from first few threads to avoid output flood
    if idx < 3:
        print("Thread", idx, "starting")
    
    if idx < n:
        arr[idx] = idx * 2
        
        # Debug: Print values for first few elements
        if idx < 3:
            print("Thread", idx, "wrote value", arr[idx])

# Run with small array
arr = cuda.device_array(10, dtype=np.float32)
debug_kernel[1, 10](arr, 10)
cuda.synchronize()

print("\nFinal array:", arr.copy_to_host())

## üéØ Exercises

### Exercise 1: Error-Proof Kernel Wrapper
Create a robust wrapper function that validates all inputs before launching a kernel.

In [None]:
# TODO Exercise 1: Complete this error-proof wrapper

def launch_kernel_safe(kernel, data, threads_per_block=256):
    """
    Safely launch a kernel with automatic configuration and error checking.
    
    Args:
        kernel: The CUDA kernel function
        data: Input array (numpy or device array)
        threads_per_block: Threads per block (default 256)
    
    Returns:
        Device array with results
        
    Raises:
        ValueError: If inputs are invalid
        MemoryError: If not enough GPU memory
    """
    # TODO: Implement the following checks:
    # 1. Verify CUDA is available
    # 2. Check data is not empty
    # 3. Validate threads_per_block (1-1024)
    # 4. Check sufficient GPU memory
    # 5. Launch kernel with proper grid configuration
    # 6. Synchronize and check for errors
    
    pass

# Test your implementation
# ...

## üìù Key Takeaways

### Error Handling Best Practices:

1. **Always synchronize** before reading results or timing
   ```python
   kernel[grid, block](...)
   cuda.synchronize()  # Wait for completion
   result = output.copy_to_host()
   ```

2. **Validate launch configuration**
   - threads_per_block ‚â§ 1024
   - Check grid dimensions against device limits

3. **Always include boundary checks**
   ```python
   if idx < n:
       arr[idx] = ...
   ```

4. **Use try/except for error handling**
   ```python
   try:
       kernel[grid, block](...)
   except CudaAPIError as e:
       print(f"CUDA Error: {e}")
   ```

5. **Prefer float32** unless you need float64 precision

6. **Debug strategically**
   - Use print() sparingly (only first few threads)
   - Use small test cases first
   - Verify CPU results before GPU

---

### üìö Next Steps
You've completed Week 1! Before moving on:
1. Complete the checkpoint quiz
2. Finish all exercises in each notebook
3. Make sure you can run all code without errors

### üîó Resources
- [Error Handling Guide](../../cuda-programming-guide/02-basics/nvcc.md)
- [Debugging Documentation](../../cuda-programming-guide/04-special-topics/error-log-management.md)