# üöÄ Day 1: GPU Fundamentals & Your First CUDA Program

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-01/day-1-gpu-basics.ipynb)

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

> **Note:** If running on Google Colab, go to `Runtime ‚Üí Change runtime type ‚Üí T4 GPU` before starting!

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
# Primary learning should be done with CUDA C++ code

import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math

print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")
print("   Python/Numba is provided for quick interactive testing only.")

# Day 1: GPU Fundamentals & Your First CUDA Program

Welcome to your CUDA learning journey! Today we'll understand:
- Why GPUs exist and when to use them
- GPU architecture basics
- How to query GPU properties
- Your first CUDA kernel

**Prerequisites:** Basic C/C++ knowledge, understanding of pointers

---

## 1. Why GPUs? The Parallel Computing Revolution

### The Fundamental Problem

Modern applications process **massive amounts of data**:
- Neural networks: billions of matrix operations
- Video processing: millions of pixels per frame
- Scientific simulations: countless particles/cells

CPUs are optimized for **speed on single tasks** (latency).  
GPUs are optimized for **throughput on many tasks** (parallelism).

### CPU vs GPU: An Analogy

```
üöó CPU (Sports Car)          üöõ GPU (Fleet of Trucks)
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
‚Ä¢ 4-16 very fast cores       ‚Ä¢ 1000s of simpler cores
‚Ä¢ Complex control logic      ‚Ä¢ Simple control logic  
‚Ä¢ Large caches per core      ‚Ä¢ Smaller shared caches
‚Ä¢ Great for: 1 task FAST     ‚Ä¢ Great for: MANY tasks
```

**Delivering 10,000 packages:**
- CPU: 4 sports cars √ó 2,500 trips = slow
- GPU: 1,000 trucks √ó 10 trips = FAST!

## 2. Your First CUDA Program: Device Query

### CUDA C++ Implementation (Primary)

In [None]:
%%writefile device_query.cu
// device_query.cu - Query GPU properties
#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount = 0;
    cudaGetDeviceCount(&deviceCount);
    
    if (deviceCount == 0) {
        printf("No CUDA devices found!\n");
        return 1;
    }
    
    printf("Found %d CUDA device(s)\n\n", deviceCount);
    
    for (int dev = 0; dev < deviceCount; dev++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        
        printf("Device %d: %s\n", dev, prop.name);
        printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("  Multiprocessors: %d\n", prop.multiProcessorCount);
        printf("  Max Threads/Block: %d\n", prop.maxThreadsPerBlock);
        printf("  Warp Size: %d\n", prop.warpSize);
        printf("  Global Memory: %.2f GB\n", prop.totalGlobalMem / 1e9);
        printf("  Shared Memory/Block: %.1f KB\n", prop.sharedMemPerBlock / 1024.0);
    }
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o device_query device_query.cu
!./device_query

### Python/Numba (Optional - Interactive Testing)

In [None]:
# Python equivalent for quick testing (OPTIONAL)
from numba import cuda

print("=" * 50)
print("CUDA AVAILABILITY CHECK")
print("=" * 50)

if cuda.is_available():
    print("‚úÖ CUDA is available!")
    print(f"   CUDA GPUs detected: {len(cuda.gpus)}")
else:
    print("‚ùå CUDA is NOT available!")
    print("   Make sure you have:")
    print("   1. NVIDIA GPU installed")
    print("   2. CUDA Toolkit installed")
    print("   3. numba installed: pip install numba")

## 3. Understanding CUDA Device Properties

Before writing CUDA code, we need to understand our GPU's capabilities. Key properties include:

| Property | What It Means |
|----------|---------------|
| **Compute Capability** | GPU architecture version (e.g., 8.6 = Ampere) |
| **Streaming Multiprocessors (SMs)** | Independent processing units |
| **Max Threads per Block** | How many threads can cooperate |
| **Warp Size** | Threads executed in lockstep (always 32) |
| **Global Memory** | Total GPU memory (VRAM) |
| **Shared Memory per Block** | Fast on-chip memory for cooperation |

Let's query our GPU:

In [None]:
# Query GPU properties
device = cuda.get_current_device()

print("=" * 60)
print(f"GPU: {device.name.decode('utf-8')}")
print("=" * 60)

# Compute capability
cc = device.compute_capability
print(f"\nüìä Compute Capability: {cc[0]}.{cc[1]}")

# Architecture mapping
arch_names = {
    (7, 0): "Volta", (7, 5): "Turing",
    (8, 0): "Ampere", (8, 6): "Ampere", (8, 9): "Ada Lovelace",
    (9, 0): "Hopper"
}
arch = arch_names.get(cc, "Unknown")
print(f"   Architecture: {arch}")

# Processor info
print(f"\nüîß Processor Info:")
print(f"   Multiprocessors (SMs): {device.MULTIPROCESSOR_COUNT}")
print(f"   Max Threads per Block: {device.MAX_THREADS_PER_BLOCK}")
print(f"   Max Block Dimensions: {device.MAX_BLOCK_DIM_X} x {device.MAX_BLOCK_DIM_Y} x {device.MAX_BLOCK_DIM_Z}")
print(f"   Max Grid Dimensions: {device.MAX_GRID_DIM_X} x {device.MAX_GRID_DIM_Y} x {device.MAX_GRID_DIM_Z}")
print(f"   Warp Size: {device.WARP_SIZE}")

# Memory info
print(f"\nüíæ Memory Info:")
print(f"   Shared Memory per Block: {device.MAX_SHARED_MEMORY_PER_BLOCK / 1024:.1f} KB")

# Get total memory using context
context = cuda.current_context()
free_mem, total_mem = context.get_memory_info()
print(f"   Total Global Memory: {total_mem / (1024**3):.2f} GB")
print(f"   Free Memory: {free_mem / (1024**3):.2f} GB")

## 4. Your First CUDA Kernel: Vector Addition

A **kernel** is a function that runs on the GPU. Let's start with the "Hello World" of GPU programming: adding two vectors.

### Key Concepts:
- `__global__` keyword marks a function as a GPU kernel
- Kernels run on **many threads simultaneously**
- Each thread processes a different element

```
CPU View:           GPU View (1000 threads):
                    
for i in range(N):  Thread 0: c[0] = a[0] + b[0]
    c[i] = a[i]+b[i] Thread 1: c[1] = a[1] + b[1]
                    Thread 2: c[2] = a[2] + b[2]
(sequential)        ...
                    Thread 999: c[999] = a[999] + b[999]
                    (ALL AT ONCE!)
```

### CUDA C++ Implementation (Primary)

In [None]:
%%writefile vector_add.cu
// vector_add.cu - Your first CUDA kernel!
#include <stdio.h>
#include <cuda_runtime.h>

// CUDA kernel - runs on GPU
__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    // Calculate global thread ID
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Boundary check
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1000000;  // 1 million elements
    size_t size = n * sizeof(float);
    
    // Allocate host memory
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_c = (float*)malloc(size);
    
    // Initialize host arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }
    
    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);
    
    // Copy data from host to device
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
    
    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
    
    // Copy result back to host
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
    
    // Verify
    printf("c[0] = %f (expected 3.0)\n", h_c[0]);
    
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o vector_add vector_add.cu
!./vector_add

### Python/Numba (Optional - Interactive Testing)

In [None]:
# Python equivalent for quick testing (OPTIONAL)
@cuda.jit
def vector_add_kernel(a, b, c):
    """Each thread computes one element of c = a + b"""
    idx = cuda.grid(1)  # Same as: blockIdx.x * blockDim.x + threadIdx.x
    if idx < c.size:
        c[idx] = a[idx] + b[idx]

# Create test data
N = 1_000_000
a_host = np.random.randn(N).astype(np.float32)
b_host = np.random.randn(N).astype(np.float32)
c_host = np.zeros(N, dtype=np.float32)

print(f"Vector size: {N:,} elements")
print(f"Memory per vector: {a_host.nbytes / 1024 / 1024:.2f} MB")

## 5. Memory Management: Host ‚Üî Device Transfers

### CUDA C++ Memory Functions

| Function | Description |
|----------|-------------|
| `cudaMalloc(&ptr, size)` | Allocate GPU memory |
| `cudaMemcpy(dst, src, size, kind)` | Copy between host/device |
| `cudaFree(ptr)` | Free GPU memory |

```cpp
// Memory management in CUDA C++
float *d_array;
cudaMalloc(&d_array, n * sizeof(float));                    // Allocate on GPU
cudaMemcpy(d_array, h_array, n * sizeof(float), cudaMemcpyHostToDevice);  // Copy to GPU
cudaMemcpy(h_result, d_array, n * sizeof(float), cudaMemcpyDeviceToHost); // Copy from GPU
cudaFree(d_array);                                           // Free GPU memory
```

Data must be **explicitly copied** between CPU (host) and GPU (device):

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   CPU       ‚îÇ  cudaMemcpy H‚ÜíD   ‚îÇ   GPU       ‚îÇ
‚îÇ   (Host)    ‚îÇ ================‚ñ∫ ‚îÇ  (Device)   ‚îÇ
‚îÇ             ‚îÇ                    ‚îÇ             ‚îÇ
‚îÇ  a_host[]   ‚îÇ                    ‚îÇ  a_device[] ‚îÇ
‚îÇ  b_host[]   ‚îÇ                    ‚îÇ  b_device[] ‚îÇ
‚îÇ  c_host[]   ‚îÇ ‚óÑ================ ‚îÇ  c_device[] ‚îÇ
‚îÇ             ‚îÇ  cudaMemcpy D‚ÜíH   ‚îÇ             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        PCIe Bus (bottleneck!)
```

**Key Functions:**
- `cuda.to_device(array)` - Copy host ‚Üí device
- `cuda.device_array(shape)` - Allocate on device (no copy)
- `device_array.copy_to_host()` - Copy device ‚Üí host

In [None]:
# Transfer data to GPU
a_device = cuda.to_device(a_host)  # Copy a to GPU
b_device = cuda.to_device(b_host)  # Copy b to GPU
c_device = cuda.device_array(N, dtype=np.float32)  # Allocate c on GPU (no copy needed)

print("‚úÖ Data transferred to GPU")
print(f"   a_device type: {type(a_device)}")
print(f"   Shape: {a_device.shape}, Dtype: {a_device.dtype}")

## 6. Thread and Block Configuration

CUDA organizes threads in a hierarchy:

```
                    Grid (all threads)
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ  Block 0    Block 1    Block 2  ...‚îÇ
                    ‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îÇ
                    ‚îÇ  ‚îÇThread‚îÇ   ‚îÇThread‚îÇ   ‚îÇThread‚îÇ    ‚îÇ
                    ‚îÇ  ‚îÇ  0-N ‚îÇ   ‚îÇ  0-N ‚îÇ   ‚îÇ  0-N ‚îÇ    ‚îÇ
                    ‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Kernel Launch Syntax: `kernel[blocks_per_grid, threads_per_block](...)`

**Rules of thumb:**
- `threads_per_block`: Usually 128, 256, or 512 (must be ‚â§ 1024)
- `blocks_per_grid`: Calculated to cover all elements
- Total threads = blocks √ó threads_per_block

In [None]:
# Configure kernel launch parameters
threads_per_block = 256  # Common choice

# Calculate blocks needed to cover all elements
# Formula: ceil(N / threads_per_block)
blocks_per_grid = math.ceil(N / threads_per_block)

print(f"üìê Launch Configuration:")
print(f"   Array size: {N:,}")
print(f"   Threads per block: {threads_per_block}")
print(f"   Blocks per grid: {blocks_per_grid:,}")
print(f"   Total threads: {blocks_per_grid * threads_per_block:,}")
print(f"   Extra threads (boundary check needed): {blocks_per_grid * threads_per_block - N:,}")

# Launch the kernel!
vector_add_kernel[blocks_per_grid, threads_per_block](a_device, b_device, c_device)

# Wait for GPU to finish
cuda.synchronize()
print("\n‚úÖ Kernel execution complete!")

In [None]:
# Copy result back to CPU and verify
c_host = c_device.copy_to_host()

# Verify correctness
expected = a_host + b_host
if np.allclose(c_host, expected):
    print("‚úÖ VERIFICATION PASSED!")
    print(f"   First 5 elements: {c_host[:5]}")
    print(f"   Expected:         {expected[:5]}")
else:
    print("‚ùå VERIFICATION FAILED!")
    diff = np.abs(c_host - expected).max()
    print(f"   Max difference: {diff}")

## 7. Performance Comparison: CPU vs GPU

Now let's see the real benefit of GPU computing - speed!

In [None]:
import time

def benchmark_cpu_gpu(sizes):
    """Compare CPU and GPU performance across different array sizes"""
    results = []
    
    for N in sizes:
        # Create data
        a = np.random.randn(N).astype(np.float32)
        b = np.random.randn(N).astype(np.float32)
        
        # CPU timing
        start = time.perf_counter()
        c_cpu = a + b
        cpu_time = time.perf_counter() - start
        
        # GPU timing (including transfers)
        start = time.perf_counter()
        a_d = cuda.to_device(a)
        b_d = cuda.to_device(b)
        c_d = cuda.device_array(N, dtype=np.float32)
        
        tpb = 256
        bpg = math.ceil(N / tpb)
        vector_add_kernel[bpg, tpb](a_d, b_d, c_d)
        cuda.synchronize()
        c_gpu = c_d.copy_to_host()
        gpu_time = time.perf_counter() - start
        
        speedup = cpu_time / gpu_time if gpu_time > 0 else 0
        results.append((N, cpu_time*1000, gpu_time*1000, speedup))
        
    return results

# Run benchmarks
sizes = [1_000, 10_000, 100_000, 1_000_000, 10_000_000, 50_000_000]
print("üèÅ Benchmarking CPU vs GPU...")
print("-" * 65)
print(f"{'Array Size':>12} | {'CPU (ms)':>10} | {'GPU (ms)':>10} | {'Speedup':>10}")
print("-" * 65)

results = benchmark_cpu_gpu(sizes)
for N, cpu_ms, gpu_ms, speedup in results:
    indicator = "üöÄ" if speedup > 1 else "üê¢"
    print(f"{N:>12,} | {cpu_ms:>10.3f} | {gpu_ms:>10.3f} | {speedup:>9.2f}x {indicator}")

print("-" * 65)
print("\nüí° Note: GPU shines with larger arrays (overhead amortized)")

## üéØ Exercises

Now it's your turn! Complete these exercises to solidify your understanding.

### Exercise 1: Vector Subtraction
Modify the vector addition kernel to perform subtraction (c = a - b).

### Exercise 2: Element-wise Multiplication  
Create a new kernel for element-wise multiplication (c = a * b).

### Exercise 3: Different Block Sizes
Experiment with different `threads_per_block` values (64, 128, 256, 512, 1024).
Which performs best? Why might that be?

In [None]:
# TODO Exercise 1: Vector Subtraction
# Create a kernel that computes c = a - b

@cuda.jit
def vector_sub_kernel(a, b, c):
    idx = cuda.grid(1)
    if idx < c.size:
        # TODO: Replace pass with subtraction
        pass

# Test your kernel here:
# ...

In [None]:
# TODO Exercise 2: Element-wise Multiplication
# Create a kernel that computes c = a * b

@cuda.jit
def vector_mul_kernel(a, b, c):
    # TODO: Implement this kernel
    pass

# Test your kernel here:
# ...

In [None]:
# TODO Exercise 3: Block Size Experiment
# Try different threads_per_block values and compare performance

def benchmark_block_sizes(N=10_000_000):
    """Test different block sizes"""
    a = np.random.randn(N).astype(np.float32)
    b = np.random.randn(N).astype(np.float32)
    a_d = cuda.to_device(a)
    b_d = cuda.to_device(b)
    c_d = cuda.device_array(N, dtype=np.float32)
    
    block_sizes = [32, 64, 128, 256, 512, 1024]
    
    print(f"Testing with N = {N:,}")
    print("-" * 40)
    
    for tpb in block_sizes:
        bpg = math.ceil(N / tpb)
        
        # Warmup
        vector_add_kernel[bpg, tpb](a_d, b_d, c_d)
        cuda.synchronize()
        
        # Benchmark
        start = time.perf_counter()
        for _ in range(100):
            vector_add_kernel[bpg, tpb](a_d, b_d, c_d)
        cuda.synchronize()
        elapsed = (time.perf_counter() - start) / 100 * 1000
        
        print(f"Block size {tpb:4d}: {elapsed:.3f} ms")

# TODO: Run the benchmark and analyze results
# benchmark_block_sizes()

## üìù Key Takeaways

### Today You Learned:

1. **GPUs vs CPUs**: GPUs excel at throughput (many simple operations), CPUs at latency (complex single operations)

2. **GPU Architecture**:
   - Streaming Multiprocessors (SMs) contain many CUDA cores
   - Warps are groups of 32 threads executing together
   - Compute capability indicates architecture generation

3. **CUDA Programming Model**:
   - Kernels run on GPU, host code runs on CPU
   - Threads organized into blocks, blocks into grids
   - Each thread gets a unique index via `cuda.grid()`

4. **Memory Management**:
   - Data must be explicitly transferred between host and device
   - `cuda.to_device()` copies to GPU
   - `array.copy_to_host()` copies back

5. **Performance Considerations**:
   - GPU overhead matters for small arrays
   - GPU wins big for large parallel workloads
   - Block size affects performance (experiment!)

---

### üìö Next Up: Day 2 - Thread Indexing Deep Dive
- 1D, 2D, and 3D thread indexing
- Grid-stride loops for arbitrary sizes
- Handling edge cases

---

### üîó Additional Resources
- [CUDA Programming Guide - Introduction](../../cuda-programming-guide/01-introduction/programming-model.md)
- [Quick Reference](../../notes/cuda-quick-reference.md)