# üöÄ Day 4: Error Handling & Debugging

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-01/day-4-error-handling.ipynb)

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

> **Note:** If running on Google Colab, go to `Runtime ‚Üí Change runtime type ‚Üí T4 GPU` before starting!

---

In [None]:
# Verify CUDA is available
!nvcc --version
!nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

---

# Day 4: Error Handling & Debugging

Bugs in CUDA code can be subtle and hard to find. Today you'll learn:
- The essential `CUDA_CHECK` macro
- How CUDA errors work (synchronous vs asynchronous)
- Common pitfalls and how to avoid them
- Debugging with `compute-sanitizer`

---

## 1. Understanding CUDA Errors

CUDA operations can fail for many reasons:
- Invalid kernel launch configuration
- Out of memory
- Invalid memory access
- Device not available

**Key concept:** Many CUDA operations are **asynchronous**. Errors may not appear until later!

```cpp
kernel<<<grid, block>>>(...)  // Launches, returns immediately
// ... other code ...
cudaDeviceSynchronize();       // Error might appear HERE!
```

## 2. The Essential CUDA_CHECK Macro

**Every CUDA call should be wrapped in error checking!**

This macro is used throughout production CUDA code:

In [None]:
%%writefile cuda_check.cu
#include <stdio.h>
#include <cuda_runtime.h>

// ============================================================
// THE ESSENTIAL CUDA_CHECK MACRO - Use this in EVERY project!
// ============================================================
#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s:%d, ", __FILE__, __LINE__); \
            fprintf(stderr, "code: %d, reason: %s\n", error, \
                    cudaGetErrorString(error)); \
            exit(1); \
        } \
    } while(0)

// Additional macro to check kernel launch errors
#define CUDA_CHECK_KERNEL() \
    do { \
        cudaError_t error = cudaGetLastError(); \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA Kernel Error: %s:%d, ", __FILE__, __LINE__); \
            fprintf(stderr, "code: %d, reason: %s\n", error, \
                    cudaGetErrorString(error)); \
            exit(1); \
        } \
        error = cudaDeviceSynchronize(); \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA Sync Error: %s:%d, ", __FILE__, __LINE__); \
            fprintf(stderr, "code: %d, reason: %s\n", error, \
                    cudaGetErrorString(error)); \
            exit(1); \
        } \
    } while(0)

// ============================================================

__global__ void simpleKernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

int main() {
    printf("=== CUDA Error Checking Demo ===\n\n");
    
    const int N = 1000;
    size_t size = N * sizeof(float);
    
    // Allocate host memory
    float* h_data = (float*)malloc(size);
    for (int i = 0; i < N; i++) {
        h_data[i] = i;
    }
    
    // Allocate device memory WITH error checking
    float* d_data;
    CUDA_CHECK(cudaMalloc(&d_data, size));
    printf("‚úÖ cudaMalloc succeeded\n");
    
    // Copy to device WITH error checking
    CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
    printf("‚úÖ cudaMemcpy H2D succeeded\n");
    
    // Launch kernel
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    simpleKernel<<<blocks, threads>>>(d_data, N);
    CUDA_CHECK_KERNEL();
    printf("‚úÖ Kernel execution succeeded\n");
    
    // Copy back WITH error checking
    CUDA_CHECK(cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost));
    printf("‚úÖ cudaMemcpy D2H succeeded\n");
    
    // Verify
    bool correct = true;
    for (int i = 0; i < N; i++) {
        if (h_data[i] != i * 2.0f) {
            correct = false;
            break;
        }
    }
    printf("\n%s Results correct!\n", correct ? "‚úÖ" : "‚ùå");
    
    // Cleanup WITH error checking
    CUDA_CHECK(cudaFree(d_data));
    free(h_data);
    printf("‚úÖ Cleanup succeeded\n");
    
    return 0;
}

In [None]:
!nvcc -o cuda_check cuda_check.cu && ./cuda_check

## 3. Common CUDA Errors & How to Trigger Them

Let's intentionally cause errors to understand how they appear:

In [None]:
%%writefile common_errors.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "‚ùå CUDA Error: %s\n", cudaGetErrorString(error)); \
            fprintf(stderr, "   at %s:%d\n", __FILE__, __LINE__); \
            return; \
        } \
    } while(0)

__global__ void simpleKernel(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) arr[idx] = idx;
}

// Error 1: Invalid launch configuration (too many threads)
void testInvalidLaunchConfig() {
    printf("\n=== Test 1: Invalid Launch Configuration ===\n");
    printf("Attempting to launch with 2048 threads per block...\n");
    printf("(Max allowed is 1024)\n\n");
    
    float* d_arr;
    CUDA_CHECK(cudaMalloc(&d_arr, 100 * sizeof(float)));
    
    // This will fail - too many threads per block!
    simpleKernel<<<1, 2048>>>(d_arr, 100);
    
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess) {
        printf("‚ùå Launch failed: %s\n", cudaGetErrorString(error));
    } else {
        printf("‚úÖ Launch succeeded (unexpected!)\n");
    }
    
    // Reset error state for next test
    cudaGetLastError();
    cudaFree(d_arr);
}

// Error 2: Out of memory
void testOutOfMemory() {
    printf("\n=== Test 2: Out of Memory ===\n");
    
    size_t freeBytes, totalBytes;
    cudaMemGetInfo(&freeBytes, &totalBytes);
    printf("Free GPU memory: %.1f GB\n", freeBytes / 1e9);
    printf("Attempting to allocate: %.1f GB (2x available)\n\n", freeBytes * 2 / 1e9);
    
    float* hugePtr;
    cudaError_t error = cudaMalloc(&hugePtr, freeBytes * 2);
    
    if (error != cudaSuccess) {
        printf("‚ùå Allocation failed: %s\n", cudaGetErrorString(error));
    } else {
        printf("‚úÖ Allocation succeeded (unexpected!)\n");
        cudaFree(hugePtr);
    }
    
    // Reset error state
    cudaGetLastError();
}

// Error 3: Invalid memory access
__global__ void outOfBoundsKernel(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Intentionally access out of bounds!
    arr[idx + 1000000] = 42.0f;  // BAD!
}

void testInvalidMemoryAccess() {
    printf("\n=== Test 3: Invalid Memory Access ===\n");
    printf("Launching kernel that accesses out-of-bounds memory...\n\n");
    
    float* d_arr;
    CUDA_CHECK(cudaMalloc(&d_arr, 100 * sizeof(float)));
    
    outOfBoundsKernel<<<1, 32>>>(d_arr, 100);
    
    // Must synchronize to catch the error!
    cudaError_t error = cudaDeviceSynchronize();
    
    if (error != cudaSuccess) {
        printf("‚ùå Execution failed: %s\n", cudaGetErrorString(error));
    } else {
        printf("‚úÖ Execution succeeded (error not detected without sanitizer)\n");
    }
    
    // Reset error state
    cudaDeviceReset();
}

int main() {
    printf("=== Common CUDA Errors Demo ===\n");
    
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Device: %s\n", prop.name);
    printf("Max threads per block: %d\n", prop.maxThreadsPerBlock);
    
    testInvalidLaunchConfig();
    testOutOfMemory();
    testInvalidMemoryAccess();
    
    printf("\n=== Summary ===\n");
    printf("1. Always check cudaGetLastError() after kernel launches\n");
    printf("2. Use cudaDeviceSynchronize() to catch async errors\n");
    printf("3. Use compute-sanitizer for memory errors\n");
    
    return 0;
}

In [None]:
!nvcc -o common_errors common_errors.cu && ./common_errors

## 4. The Debugging Checklist

When your CUDA code doesn't work, check these in order:

### üîç Checklist

1. **Is CUDA available?**
   ```cpp
   int deviceCount;
   cudaGetDeviceCount(&deviceCount);
   ```

2. **Are launch parameters valid?**
   - `threads_per_block` ‚â§ 1024
   - `blocks` > 0
   - Grid dimensions within limits

3. **Is there enough memory?**
   ```cpp
   size_t freeBytes, totalBytes;
   cudaMemGetInfo(&freeBytes, &totalBytes);
   ```

4. **Are array sizes correct?**
   - Boundary checks in kernel: `if (idx < n)`

5. **Are data types matching?**
   - GPU prefers float32 over float64

6. **Did you synchronize?**
   - `cudaDeviceSynchronize()` before reading results

In [None]:
%%writefile debug_checklist.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "‚ùå %s\n", cudaGetErrorString(error)); \
            return false; \
        } \
    } while(0)

// Validation function - run this before any CUDA code
bool validateCudaSetup() {
    printf("=== CUDA Setup Validation ===\n\n");
    
    // Check 1: Device availability
    int deviceCount;
    CUDA_CHECK(cudaGetDeviceCount(&deviceCount));
    if (deviceCount == 0) {
        printf("‚ùå No CUDA devices found!\n");
        return false;
    }
    printf("‚úÖ Found %d CUDA device(s)\n", deviceCount);
    
    // Check 2: Device properties
    cudaDeviceProp prop;
    CUDA_CHECK(cudaGetDeviceProperties(&prop, 0));
    printf("‚úÖ Device: %s\n", prop.name);
    printf("   - Compute capability: %d.%d\n", prop.major, prop.minor);
    printf("   - Max threads/block: %d\n", prop.maxThreadsPerBlock);
    printf("   - Max grid dims: [%d, %d, %d]\n",
           prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
    printf("   - Total memory: %.1f GB\n", prop.totalGlobalMem / 1e9);
    
    // Check 3: Available memory
    size_t freeBytes, totalBytes;
    CUDA_CHECK(cudaMemGetInfo(&freeBytes, &totalBytes));
    printf("‚úÖ Memory: %.1f GB free / %.1f GB total\n",
           freeBytes / 1e9, totalBytes / 1e9);
    
    // Check 4: Test allocation
    float* testPtr;
    CUDA_CHECK(cudaMalloc(&testPtr, 1024));
    CUDA_CHECK(cudaFree(testPtr));
    printf("‚úÖ Test allocation succeeded\n");
    
    printf("\n=== All checks passed! ===\n");
    return true;
}

// Safe kernel launcher with validation
template<typename KernelFunc>
bool safeLaunch(KernelFunc kernel, dim3 grid, dim3 block,
                const char* kernelName) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    
    // Validate block size
    int totalThreads = block.x * block.y * block.z;
    if (totalThreads > prop.maxThreadsPerBlock) {
        printf("‚ùå Block size %d exceeds max %d\n",
               totalThreads, prop.maxThreadsPerBlock);
        return false;
    }
    
    // Validate grid size
    if (grid.x > (unsigned)prop.maxGridSize[0] ||
        grid.y > (unsigned)prop.maxGridSize[1] ||
        grid.z > (unsigned)prop.maxGridSize[2]) {
        printf("‚ùå Grid size exceeds device limits\n");
        return false;
    }
    
    printf("‚úÖ %s: grid(%d,%d,%d) block(%d,%d,%d) validated\n",
           kernelName, grid.x, grid.y, grid.z, block.x, block.y, block.z);
    return true;
}

int main() {
    if (!validateCudaSetup()) {
        return 1;
    }
    
    printf("\n=== Launch Validation Examples ===\n\n");
    
    // Valid configuration
    safeLaunch(nullptr, dim3(256), dim3(256), "validKernel");
    
    // Invalid: too many threads
    safeLaunch(nullptr, dim3(1), dim3(2048), "invalidKernel");
    
    return 0;
}

In [None]:
!nvcc -o debug_checklist debug_checklist.cu && ./debug_checklist

## 5. Common Pitfalls & Bug Patterns

### Pitfall 1: Missing Boundary Check

In [None]:
%%writefile pitfall_bounds.cu
#include <stdio.h>
#include <cuda_runtime.h>

// BAD: No boundary check - will access invalid memory!
__global__ void badKernelNoBounds(float* arr) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    arr[idx] = idx;  // üí• May access out-of-bounds!
}

// GOOD: With boundary check - safe!
__global__ void goodKernelWithBounds(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {  // ‚úÖ Always check!
        arr[idx] = idx;
    }
}

int main() {
    printf("=== Boundary Check Demo ===\n\n");
    
    const int N = 100;
    const int THREADS = 256;  // More threads than elements!
    const int BLOCKS = 1;
    
    printf("Array size: %d elements\n", N);
    printf("Threads launched: %d (more than elements!)\n\n", THREADS * BLOCKS);
    
    float* d_arr;
    cudaMalloc(&d_arr, N * sizeof(float));
    
    // Safe version with bounds check
    goodKernelWithBounds<<<BLOCKS, THREADS>>>(d_arr, N);
    cudaDeviceSynchronize();
    
    cudaError_t error = cudaGetLastError();
    printf("%s Kernel with bounds check\n", 
           error == cudaSuccess ? "‚úÖ" : "‚ùå");
    
    cudaFree(d_arr);
    
    printf("\nüí° Always use: if (idx < n) before accessing array[idx]\n");
    
    return 0;
}

In [None]:
!nvcc -o pitfall_bounds pitfall_bounds.cu && ./pitfall_bounds

### Pitfall 2: Forgetting to Synchronize

In [None]:
%%writefile pitfall_sync.cu
#include <stdio.h>
#include <cuda_runtime.h>
#include <time.h>

__global__ void slowKernel(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Simulate slow computation
        float val = 0.0f;
        for (int i = 0; i < 10000; i++) {
            val += idx * 0.0001f;
        }
        arr[idx] = val;
    }
}

int main() {
    printf("=== Synchronization Demo ===\n\n");
    
    const int N = 10000;
    float* d_arr;
    cudaMalloc(&d_arr, N * sizeof(float));
    
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    
    // BAD: Timing without synchronization
    clock_t start1 = clock();
    slowKernel<<<blocks, threads>>>(d_arr, N);
    // Missing: cudaDeviceSynchronize();
    clock_t end1 = clock();
    float badTime = (float)(end1 - start1) / CLOCKS_PER_SEC * 1000;
    
    // Wait for kernel to complete before next measurement
    cudaDeviceSynchronize();
    
    // GOOD: Timing with synchronization
    clock_t start2 = clock();
    slowKernel<<<blocks, threads>>>(d_arr, N);
    cudaDeviceSynchronize();  // Wait for completion!
    clock_t end2 = clock();
    float goodTime = (float)(end2 - start2) / CLOCKS_PER_SEC * 1000;
    
    printf("Without sync: %.3f ms (WRONG - kernel still running!)\n", badTime);
    printf("With sync:    %.3f ms (Correct)\n", goodTime);
    printf("\n‚ö†Ô∏è  Unsynchronized time is %.0fx too fast!\n", goodTime / badTime);
    
    printf("\nüí° Always call cudaDeviceSynchronize() before:\n");
    printf("   - Reading results from device memory\n");
    printf("   - Timing kernel execution\n");
    printf("   - Error checking for kernel issues\n");
    
    cudaFree(d_arr);
    return 0;
}

In [None]:
!nvcc -O3 -o pitfall_sync pitfall_sync.cu && ./pitfall_sync

### Pitfall 3: Wrong Data Type

In [None]:
%%writefile pitfall_dtype.cu
#include <stdio.h>
#include <cuda_runtime.h>

// Template kernel works with any floating point type
template<typename T>
__global__ void addArrays(T* a, T* b, T* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

template<typename T>
float benchmarkAddition(int n, const char* typeName) {
    size_t size = n * sizeof(T);
    
    // Allocate and initialize
    T *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);
    
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    
    // Warmup
    addArrays<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();
    
    // Time multiple runs
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        addArrays<<<blocks, threads>>>(d_a, d_b, d_c, n);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    
    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    ms /= 100;
    
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    printf("%s: %.3f ms\n", typeName, ms);
    return ms;
}

int main() {
    printf("=== Data Type Performance ===\n\n");
    
    const int N = 10000000;
    printf("Array size: %d elements\n\n", N);
    
    float timeF32 = benchmarkAddition<float>(N, "float32");
    float timeF64 = benchmarkAddition<double>(N, "float64");
    
    printf("\nSpeedup (float32 vs float64): %.2fx\n", timeF64 / timeF32);
    
    printf("\nüí° Use float32 unless you need float64 precision!\n");
    printf("   Most consumer GPUs have limited float64 performance.\n");
    
    return 0;
}

In [None]:
!nvcc -O3 -o pitfall_dtype pitfall_dtype.cu && ./pitfall_dtype

## 6. Debugging with compute-sanitizer

NVIDIA's `compute-sanitizer` is like Valgrind for CUDA. It catches:
- Out-of-bounds memory access
- Race conditions
- Memory leaks
- Uninitialized memory access

In [None]:
%%writefile sanitizer_test.cu
#include <stdio.h>
#include <cuda_runtime.h>

// This kernel has a bug - out of bounds access
__global__ void buggyKernel(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // BUG: Missing bounds check!
    arr[idx] = 42.0f;  // Will access beyond allocated memory
}

// This kernel is correct
__global__ void correctKernel(float* arr, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {  // Proper bounds check
        arr[idx] = 42.0f;
    }
}

int main(int argc, char** argv) {
    printf("=== compute-sanitizer Test ===\n\n");
    
    const int N = 100;
    float* d_arr;
    cudaMalloc(&d_arr, N * sizeof(float));
    
    // Run with more threads than elements to trigger the bug
    int threads = 256;
    int blocks = 1;
    
    if (argc > 1 && strcmp(argv[1], "--buggy") == 0) {
        printf("Running BUGGY kernel (will be caught by sanitizer)...\n");
        buggyKernel<<<blocks, threads>>>(d_arr, N);
    } else {
        printf("Running CORRECT kernel...\n");
        correctKernel<<<blocks, threads>>>(d_arr, N);
    }
    
    cudaDeviceSynchronize();
    
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess) {
        printf("‚ùå CUDA Error: %s\n", cudaGetErrorString(error));
    } else {
        printf("‚úÖ No CUDA errors detected by runtime\n");
    }
    
    cudaFree(d_arr);
    
    printf("\nRun with compute-sanitizer to detect memory errors:\n");
    printf("  compute-sanitizer --tool memcheck ./sanitizer_test --buggy\n");
    
    return 0;
}

In [None]:
!nvcc -g -G -o sanitizer_test sanitizer_test.cu && ./sanitizer_test

In [None]:
# Try to run with compute-sanitizer (may not be available on all systems)
!which compute-sanitizer && compute-sanitizer --tool memcheck ./sanitizer_test --buggy || echo "compute-sanitizer not available in this environment"

## üéØ Exercises

### Exercise 1: Create a Safe Kernel Launcher
Complete this robust wrapper that validates all inputs before launching.

In [None]:
%%writefile exercise1_safe_launch.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(err)); \
            return false; \
        } \
    } while(0)

__global__ void scaleKernel(float* data, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) data[idx] *= scale;
}

// TODO: Implement this safe launcher
bool safeLaunchScaleKernel(float* d_data, float scale, int n, int threadsPerBlock) {
    // TODO: Add these checks:
    // 1. Verify d_data is not NULL
    // 2. Verify n > 0
    // 3. Verify threadsPerBlock is 1-1024
    // 4. Calculate proper grid size
    // 5. Check available memory
    // 6. Launch kernel
    // 7. Check for launch errors
    // 8. Synchronize and check for execution errors
    
    return true;  // Return false on any error
}

int main() {
    printf("Exercise 1: Implement safeLaunchScaleKernel!\n");
    printf("Add validation for all inputs and proper error checking.\n");
    return 0;
}

## üìù Key Takeaways

### Error Handling Best Practices:

1. **Use CUDA_CHECK macro** for every API call
   ```cpp
   CUDA_CHECK(cudaMalloc(&ptr, size));
   ```

2. **Check kernel errors immediately**
   ```cpp
   kernel<<<grid, block>>>(...);
   CUDA_CHECK(cudaGetLastError());  // Launch errors
   CUDA_CHECK(cudaDeviceSynchronize());  // Execution errors
   ```

3. **Always include boundary checks**
   ```cpp
   if (idx < n) {
       array[idx] = ...;
   }
   ```

4. **Validate launch configuration**
   - threads_per_block ‚â§ 1024
   - Check grid dimensions against device limits

5. **Use compute-sanitizer for debugging**
   ```bash
   compute-sanitizer --tool memcheck ./myprogram
   ```

---

### üìö Week 1 Complete!
You've learned:
- Day 1: GPU basics and your first kernel
- Day 2: Thread indexing and grid-stride loops
- Day 3: Memory management fundamentals
- Day 4: Error handling and debugging

**Next:** Week 2 - Shared Memory & Performance Basics

---

### üîó Resources
- [CUDA Error Handling](../../cuda-programming-guide/02-basics/intro-to-cuda-cpp.md)
- [Quick Reference](../../notes/cuda-quick-reference.md)

In [None]:
# Cleanup generated files
!rm -f cuda_check common_errors debug_checklist pitfall_bounds pitfall_sync pitfall_dtype sanitizer_test
!rm -f *.cu
print("‚úÖ Cleanup complete!")

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
from numba.cuda.cudadrv.driver import CudaAPIError
import math
import traceback

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")
print("   Python/Numba is provided for quick interactive testing only.")

# Day 4: Error Handling & Debugging

Bugs in CUDA code can be subtle and hard to find. Today you'll learn:
- How CUDA errors work (and the `CUDA_CHECK` macro)
- Proper error checking patterns in CUDA C++
- Common pitfalls and how to avoid them
- Debugging with `cuda-memcheck` and `compute-sanitizer`

---

## 1. Understanding CUDA Errors

CUDA operations can fail for many reasons:
- Invalid kernel launch configuration
- Out of memory
- Device not available
- Invalid memory access
- Race conditions

**Key concept:** CUDA operations are often **asynchronous**. Errors may not be reported until later!

```
kernel<<<grid, block>>>(...);  // Launches, returns immediately
// ... other code ...
cudaDeviceSynchronize();       // Error might appear HERE!
```

In [None]:
# Setup
import numpy as np
from numba import cuda
from numba.cuda.cudadrv.driver import CudaAPIError
import math
import traceback

print("CUDA device:", cuda.get_current_device().name.decode())

## 2. Common CUDA Errors & How to Trigger Them

Let's intentionally cause errors to understand how they appear.

In [None]:
# Error 1: Invalid Launch Configuration
# Max threads per block is 1024, what happens if we exceed it?

@cuda.jit
def simple_kernel(arr):
    idx = cuda.grid(1)
    if idx < arr.size:
        arr[idx] = idx

arr = np.zeros(100, dtype=np.float32)
arr_d = cuda.to_device(arr)

print("Attempting to launch with 2048 threads per block...")
print("(Max allowed is 1024)")
print("-" * 50)

try:
    # This will fail - too many threads per block!
    simple_kernel[1, 2048](arr_d)
    cuda.synchronize()
except Exception as e:
    print(f"‚ùå Error caught: {type(e).__name__}")
    print(f"   Message: {e}")

In [None]:
# Error 2: Out of Memory
# Trying to allocate more than available GPU memory

ctx = cuda.current_context()
free_mem, total_mem = ctx.get_memory_info()
print(f"Free GPU memory: {free_mem / 1e9:.2f} GB")
print(f"Attempting to allocate: {free_mem * 2 / 1e9:.2f} GB (2x available)")
print("-" * 50)

try:
    # Try to allocate more than available
    huge_array = cuda.device_array(int(free_mem * 2), dtype=np.uint8)
except Exception as e:
    print(f"‚ùå Error caught: {type(e).__name__}")
    print(f"   Message: {e}")

## 3. The Debugging Checklist

When your CUDA code doesn't work, check these in order:

### üîç Checklist

1. **Is CUDA available?**
   ```python
   cuda.is_available()
   ```

2. **Are launch parameters valid?**
   - `threads_per_block` ‚â§ 1024
   - `blocks` > 0
   - Grid dimensions within limits

3. **Is there enough memory?**
   - Check `cuda.current_context().get_memory_info()`

4. **Are array sizes correct?**
   - Boundary checks in kernel: `if idx < n:`

5. **Are data types matching?**
   - GPU prefers float32, not float64

6. **Did you synchronize?**
   - `cuda.synchronize()` before reading results

In [None]:
# Helper function: Safe kernel launch wrapper
def safe_launch(kernel, grid, block, *args, **kwargs):
    """Launch kernel with error checking"""
    device = cuda.get_current_device()
    
    # Validate block size
    if isinstance(block, int):
        block = (block,)
    total_threads = 1
    for dim in block:
        total_threads *= dim
    if total_threads > device.MAX_THREADS_PER_BLOCK:
        raise ValueError(f"Block size {block} = {total_threads} threads exceeds max {device.MAX_THREADS_PER_BLOCK}")
    
    # Validate grid size
    if isinstance(grid, int):
        grid = (grid,)
    for i, dim in enumerate(grid):
        max_dim = [device.MAX_GRID_DIM_X, device.MAX_GRID_DIM_Y, device.MAX_GRID_DIM_Z][i]
        if dim > max_dim:
            raise ValueError(f"Grid dimension {i} = {dim} exceeds max {max_dim}")
    
    # Launch
    kernel[grid, block](*args, **kwargs)
    cuda.synchronize()

# Test safe launch
print("Testing safe_launch helper:")
arr = cuda.device_array(100, dtype=np.float32)

try:
    safe_launch(simple_kernel, 1, 2048, arr)  # Should fail validation
except ValueError as e:
    print(f"‚úÖ Caught before launch: {e}")

safe_launch(simple_kernel, 1, 256, arr)  # Should work
print("‚úÖ Valid launch succeeded")

## 4. Common Pitfalls & Bug Patterns

### Pitfall 1: Missing Boundary Check

In [None]:
# BAD: No boundary check
@cuda.jit
def bad_kernel_no_bounds(arr):
    idx = cuda.grid(1)
    arr[idx] = idx  # üí• Will access out-of-bounds memory!

# GOOD: With boundary check
@cuda.jit  
def good_kernel_with_bounds(arr, n):
    idx = cuda.grid(1)
    if idx < n:  # ‚úÖ Always check!
        arr[idx] = idx

# Demonstrate the difference
n = 100
arr = cuda.device_array(n, dtype=np.float32)
threads = 256  # More threads than elements!
blocks = 1

print("With proper bounds checking:")
good_kernel_with_bounds[blocks, threads](arr, n)
cuda.synchronize()
print("‚úÖ Completed safely")

### Pitfall 2: Wrong Data Type

In [None]:
# NumPy defaults to float64, but CUDA prefers float32
import time

@cuda.jit
def add_arrays(a, b, c):
    idx = cuda.grid(1)
    if idx < c.size:
        c[idx] = a[idx] + b[idx]

n = 10_000_000

# float64 (default) - slower on most GPUs
a64 = np.random.randn(n)  # Default is float64!
b64 = np.random.randn(n)
c64 = np.zeros(n)

# float32 - preferred
a32 = np.random.randn(n).astype(np.float32)
b32 = np.random.randn(n).astype(np.float32)
c32 = np.zeros(n, dtype=np.float32)

threads, blocks = 256, math.ceil(n / 256)

# Benchmark float64
a64_d, b64_d = cuda.to_device(a64), cuda.to_device(b64)
c64_d = cuda.device_array(n, dtype=np.float64)
add_arrays[blocks, threads](a64_d, b64_d, c64_d)
cuda.synchronize()

start = time.perf_counter()
for _ in range(10):
    add_arrays[blocks, threads](a64_d, b64_d, c64_d)
cuda.synchronize()
time64 = (time.perf_counter() - start) / 10

# Benchmark float32
a32_d, b32_d = cuda.to_device(a32), cuda.to_device(b32)
c32_d = cuda.device_array(n, dtype=np.float32)

start = time.perf_counter()
for _ in range(10):
    add_arrays[blocks, threads](a32_d, b32_d, c32_d)
cuda.synchronize()
time32 = (time.perf_counter() - start) / 10

print(f"float64: {time64*1000:.3f} ms")
print(f"float32: {time32*1000:.3f} ms")
print(f"Speedup: {time64/time32:.2f}x")
print("\nüí° Tip: Always use .astype(np.float32) unless you need float64 precision!")

### Pitfall 3: Forgetting to Synchronize

In [None]:
# Kernel execution is ASYNCHRONOUS
@cuda.jit
def slow_kernel(arr):
    """Simulate slow computation"""
    idx = cuda.grid(1)
    if idx < arr.size:
        # Busy work
        val = 0.0
        for i in range(1000):
            val += idx * 0.001
        arr[idx] = val

arr = cuda.device_array(10000, dtype=np.float32)
threads, blocks = 256, math.ceil(10000 / 256)

# BAD: Timing without synchronization
start = time.perf_counter()
slow_kernel[blocks, threads](arr)
# Missing: cuda.synchronize()
bad_time = time.perf_counter() - start
print(f"Without sync: {bad_time*1000:.3f} ms (WRONG! Kernel still running)")

# GOOD: Proper timing with synchronization  
start = time.perf_counter()
slow_kernel[blocks, threads](arr)
cuda.synchronize()  # Wait for kernel to complete
good_time = time.perf_counter() - start
print(f"With sync:    {good_time*1000:.3f} ms (Correct)")

print(f"\n‚ö†Ô∏è The unsynchronized time is {good_time/bad_time:.0f}x too fast!")

## 5. Debugging with Print Statements

In Numba CUDA, you can use `print()` inside kernels for debugging (but use sparingly - it's slow!).

In [None]:
@cuda.jit
def debug_kernel(arr, n):
    idx = cuda.grid(1)
    
    # Only print from first few threads to avoid output flood
    if idx < 3:
        print("Thread", idx, "starting")
    
    if idx < n:
        arr[idx] = idx * 2
        
        # Debug: Print values for first few elements
        if idx < 3:
            print("Thread", idx, "wrote value", arr[idx])

# Run with small array
arr = cuda.device_array(10, dtype=np.float32)
debug_kernel[1, 10](arr, 10)
cuda.synchronize()

print("\nFinal array:", arr.copy_to_host())

## üéØ Additional Exercises

### üî∑ CUDA C++ Exercises (Primary)

Complete this error handling exercise in CUDA C++.

In [None]:
%%writefile error_handling_exercises.cu
// error_handling_exercises.cu - Safe kernel launcher exercise
#include <stdio.h>
#include <cuda_runtime.h>
#include <stdlib.h>

// =============================================================================
// CUDA Error Checking Macro
// =============================================================================

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error at %s:%d: %s\n", __FILE__, __LINE__, \
                    cudaGetErrorString(err)); \
            return false; \
        } \
    } while(0)

// =============================================================================
// Exercise 1: Error-Proof Kernel Wrapper
// Create a robust wrapper that validates all inputs before launching
// =============================================================================

__global__ void scaleKernel(float* data, float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= scale;
    }
}

// Safe launcher that validates inputs
bool safeLaunchScaleKernel(float* d_data, float scale, int n, int threadsPerBlock) {
    // 1. Check device is available
    int deviceCount;
    CUDA_CHECK(cudaGetDeviceCount(&deviceCount));
    if (deviceCount == 0) {
        fprintf(stderr, "Error: No CUDA devices found\n");
        return false;
    }
    
    // 2. Validate pointer (can't be NULL for device pointer)
    if (d_data == NULL) {
        fprintf(stderr, "Error: Device pointer is NULL\n");
        return false;
    }
    
    // 3. Validate array size
    if (n <= 0) {
        fprintf(stderr, "Error: Array size must be positive (got %d)\n", n);
        return false;
    }
    
    // 4. Validate threads per block
    cudaDeviceProp prop;
    CUDA_CHECK(cudaGetDeviceProperties(&prop, 0));
    
    if (threadsPerBlock <= 0 || threadsPerBlock > prop.maxThreadsPerBlock) {
        fprintf(stderr, "Error: Invalid threads per block: %d (max: %d)\n", 
                threadsPerBlock, prop.maxThreadsPerBlock);
        return false;
    }
    
    // 5. Calculate grid size and validate
    int blocks = (n + threadsPerBlock - 1) / threadsPerBlock;
    if (blocks > prop.maxGridSize[0]) {
        fprintf(stderr, "Error: Grid too large: %d blocks (max: %d)\n",
                blocks, prop.maxGridSize[0]);
        return false;
    }
    
    // 6. Check for enough memory
    size_t freeMem, totalMem;
    CUDA_CHECK(cudaMemGetInfo(&freeMem, &totalMem));
    size_t required = n * sizeof(float);
    if (required > freeMem) {
        fprintf(stderr, "Error: Not enough GPU memory. Need %zu, have %zu\n",
                required, freeMem);
        return false;
    }
    
    // 7. Launch kernel
    printf("  Launching with %d blocks √ó %d threads\n", blocks, threadsPerBlock);
    scaleKernel<<<blocks, threadsPerBlock>>>(d_data, scale, n);
    
    // 8. Check for launch errors
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel launch error: %s\n", cudaGetErrorString(err));
        return false;
    }
    
    // 9. Check for execution errors
    CUDA_CHECK(cudaDeviceSynchronize());
    
    return true;
}

// =============================================================================
// Test harness
// =============================================================================

int main() {
    printf("=== Error Handling Exercise ===\n\n");
    
    // Test valid launch
    printf("Test 1: Valid launch\n");
    {
        const int N = 1000;
        float *d_data;
        cudaMalloc(&d_data, N * sizeof(float));
        
        // Initialize
        float *h_data = (float*)malloc(N * sizeof(float));
        for (int i = 0; i < N; i++) h_data[i] = (float)i;
        cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice);
        
        bool result = safeLaunchScaleKernel(d_data, 2.0f, N, 256);
        printf("  Result: %s\n\n", result ? "‚úì SUCCESS" : "‚úó FAILED");
        
        cudaFree(d_data);
        free(h_data);
    }
    
    // Test invalid threads per block
    printf("Test 2: Invalid threads per block (2048)\n");
    {
        const int N = 1000;
        float *d_data;
        cudaMalloc(&d_data, N * sizeof(float));
        
        bool result = safeLaunchScaleKernel(d_data, 2.0f, N, 2048);  // Too many!
        printf("  Result: %s (expected: caught error)\n\n", 
               result ? "‚úó SHOULD HAVE FAILED" : "‚úì Correctly caught");
        
        cudaFree(d_data);
    }
    
    // Test NULL pointer
    printf("Test 3: NULL pointer\n");
    {
        bool result = safeLaunchScaleKernel(NULL, 2.0f, 1000, 256);
        printf("  Result: %s (expected: caught error)\n\n",
               result ? "‚úó SHOULD HAVE FAILED" : "‚úì Correctly caught");
    }
    
    // Test invalid array size
    printf("Test 4: Invalid array size (0)\n");
    {
        float *d_data;
        cudaMalloc(&d_data, 1000 * sizeof(float));
        
        bool result = safeLaunchScaleKernel(d_data, 2.0f, 0, 256);
        printf("  Result: %s (expected: caught error)\n\n",
               result ? "‚úó SHOULD HAVE FAILED" : "‚úì Correctly caught");
        
        cudaFree(d_data);
    }
    
    printf("=== All tests complete! ===\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o error_handling_exercises error_handling_exercises.cu && ./error_handling_exercises

### üî∂ Python/Numba Exercises (Optional)

### Exercise 1: Error-Proof Kernel Wrapper
Create a robust wrapper function that validates all inputs before launching a kernel.

In [None]:
# TODO Exercise 1: Complete this error-proof wrapper

def launch_kernel_safe(kernel, data, threads_per_block=256):
    """
    Safely launch a kernel with automatic configuration and error checking.
    
    Args:
        kernel: The CUDA kernel function
        data: Input array (numpy or device array)
        threads_per_block: Threads per block (default 256)
    
    Returns:
        Device array with results
        
    Raises:
        ValueError: If inputs are invalid
        MemoryError: If not enough GPU memory
    """
    # TODO: Implement the following checks:
    # 1. Verify CUDA is available
    # 2. Check data is not empty
    # 3. Validate threads_per_block (1-1024)
    # 4. Check sufficient GPU memory
    # 5. Launch kernel with proper grid configuration
    # 6. Synchronize and check for errors
    
    pass

# Test your implementation
# ...

## üìù Key Takeaways

### Error Handling Best Practices:

1. **Always synchronize** before reading results or timing
   ```python
   kernel[grid, block](...)
   cuda.synchronize()  # Wait for completion
   result = output.copy_to_host()
   ```

2. **Validate launch configuration**
   - threads_per_block ‚â§ 1024
   - Check grid dimensions against device limits

3. **Always include boundary checks**
   ```python
   if idx < n:
       arr[idx] = ...
   ```

4. **Use try/except for error handling**
   ```python
   try:
       kernel[grid, block](...)
   except CudaAPIError as e:
       print(f"CUDA Error: {e}")
   ```

5. **Prefer float32** unless you need float64 precision

6. **Debug strategically**
   - Use print() sparingly (only first few threads)
   - Use small test cases first
   - Verify CPU results before GPU

---

### üìö Next Steps
You've completed Week 1! Before moving on:
1. Complete the checkpoint quiz
2. Finish all exercises in each notebook
3. Make sure you can run all code without errors

### üîó Resources
- [Error Handling Guide](../../cuda-programming-guide/02-basics/nvcc.md)
- [Debugging Documentation](../../cuda-programming-guide/04-special-topics/error-log-management.md)