# Getting Started with CUDA in Python

This notebook introduces GPU programming using Numba and CuPy on Google Colab.

**Before running:** Make sure GPU is enabled!
- `Runtime ‚Üí Change runtime type ‚Üí GPU`

## 1Ô∏è‚É£ Check GPU Availability

In [None]:
# Check what GPU we have
!nvidia-smi

## 2Ô∏è‚É£ Verify Numba CUDA

In [None]:
from numba import cuda
import numpy as np

print("CUDA Available:", cuda.is_available())

if cuda.is_available():
    print("GPU Name:", cuda.get_current_device().name.decode())
    print("Compute Capability:", cuda.get_current_device().compute_capability)

## 3Ô∏è‚É£ Verify CuPy

In [None]:
import cupy as cp

print("CuPy version:", cp.__version__)
print("CUDA version:", cp.cuda.runtime.runtimeGetVersion())

# Simple test
x_gpu = cp.array([1, 2, 3, 4, 5])
print("\nGPU array:", x_gpu)
print("Sum on GPU:", cp.sum(x_gpu))

## 4Ô∏è‚É£ First Numba CUDA Kernel: Vector Addition

Let's write our first CUDA kernel to add two arrays element-wise.

In [None]:
from numba import cuda
import numpy as np
import math

@cuda.jit
def add_kernel(a, b, c):
    """
    CUDA kernel to add two arrays: c = a + b
    Each thread handles one element.
    """
    idx = cuda.grid(1)  # Get global thread ID
    
    if idx < c.size:  # Boundary check
        c[idx] = a[idx] + b[idx]

# Create test data
n = 1_000_000
a = np.ones(n, dtype=np.float32)
b = np.ones(n, dtype=np.float32) * 2
c = np.zeros(n, dtype=np.float32)

# Copy to GPU
a_gpu = cuda.to_device(a)
b_gpu = cuda.to_device(b)
c_gpu = cuda.to_device(c)

# Configure kernel launch
threads_per_block = 256
blocks = math.ceil(n / threads_per_block)

# Launch kernel
add_kernel[blocks, threads_per_block](a_gpu, b_gpu, c_gpu)

# Copy result back to CPU
c_result = c_gpu.copy_to_host()

print(f"First 10 results: {c_result[:10]}")
print(f"All values correct: {np.allclose(c_result, 3.0)}")

## 5Ô∏è‚É£ CuPy: NumPy on GPU

CuPy provides a NumPy-like interface for GPU arrays.

In [None]:
import cupy as cp
import numpy as np
import time

# Create large arrays
n = 10_000_000

# CPU version
a_cpu = np.random.rand(n)
b_cpu = np.random.rand(n)

start = time.time()
c_cpu = a_cpu + b_cpu
cpu_time = time.time() - start

# GPU version
a_gpu = cp.random.rand(n)
b_gpu = cp.random.rand(n)

start = time.time()
c_gpu = a_gpu + b_gpu
cp.cuda.Stream.null.synchronize()  # Wait for GPU to finish
gpu_time = time.time() - start

print(f"CPU time: {cpu_time:.4f}s")
print(f"GPU time: {gpu_time:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")

## 6Ô∏è‚É£ Matrix Multiplication Comparison

In [None]:
import cupy as cp
import numpy as np
import time

# Matrix size
size = 2000

# CPU
A_cpu = np.random.rand(size, size)
B_cpu = np.random.rand(size, size)

start = time.time()
C_cpu = np.dot(A_cpu, B_cpu)
cpu_time = time.time() - start

# GPU
A_gpu = cp.random.rand(size, size)
B_gpu = cp.random.rand(size, size)

start = time.time()
C_gpu = cp.dot(A_gpu, B_gpu)
cp.cuda.Stream.null.synchronize()
gpu_time = time.time() - start

print(f"Matrix size: {size}x{size}")
print(f"CPU time: {cpu_time:.4f}s")
print(f"GPU time: {gpu_time:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")

## üéØ Summary

In this notebook, you learned:
- ‚úÖ How to check GPU availability in Colab
- ‚úÖ Verify Numba and CuPy installations
- ‚úÖ Write a simple CUDA kernel with Numba
- ‚úÖ Use CuPy for NumPy-like GPU operations
- ‚úÖ Compare CPU vs GPU performance

**Next Steps:**
- Learn about CUDA thread hierarchy (blocks, grids, threads)
- Explore memory management (global, shared, local memory)
- Optimize kernel performance