# Exercise 02: Hello GPU üëã

Write your first CUDA kernel - printing from the GPU!

## Learning Goals
- Write a `__global__` function (kernel)
- Launch a kernel with `<<<blocks, threads>>>`
- Understand thread/block indexing
- See parallel execution in action

## üöÄ Setup

**Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí T4 GPU ‚Üí Save

## Step 1: Verify CUDA

In [None]:
!nvcc --version
!nvidia-smi --query-gpu=name --format=csv,noheader

## üìö Key Concepts

### Kernel Declaration
```cpp
__global__ void myKernel() {
    // This runs on GPU in parallel!
}
```

### Kernel Launch Syntax
```cpp
myKernel<<<numBlocks, threadsPerBlock>>>();
cudaDeviceSynchronize();  // Wait for GPU to finish
```

### Thread Identification
```cpp
int threadId = threadIdx.x;   // 0 to blockDim.x-1
int blockId = blockIdx.x;     // 0 to gridDim.x-1
int globalId = blockIdx.x * blockDim.x + threadIdx.x;
```

### Visual Example
```
Launch: <<<2, 4>>>

Block 0:          Block 1:
[T0 T1 T2 T3]    [T0 T1 T2 T3]
Global IDs:      Global IDs:
[0  1  2  3]     [4  5  6  7]
```

## Step 2: Your Exercise - Complete the Kernel

Fill in the TODOs below to create your first working kernel!

In [None]:
%%writefile hello_gpu.cu
/**
 * Exercise 02: Hello GPU
 * 
 * Your first CUDA kernel!
 * 
 * TODO: Complete the kernel and launch configurations
 */

#include <stdio.h>
#include <cuda_runtime.h>

// TODO 1: Write a kernel that prints a greeting from each thread
// The kernel should print:
// "Hello from block X, thread Y (global ID: Z)"
// 
// Hints:
// - Use printf() - it works on GPU!
// - blockIdx.x gives block index
// - threadIdx.x gives thread index within block
// - Global ID = blockIdx.x * blockDim.x + threadIdx.x

__global__ void helloKernel() {
    // Calculate global thread ID
    int globalId = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Print greeting from this thread
    printf("Hello from block %d, thread %d (global ID: %d)\n",
           blockIdx.x, threadIdx.x, globalId);
}

int main() {
    printf("=== Launching with 1 block, 8 threads ===\n");
    
    // TODO 2: Launch helloKernel with 1 block and 8 threads
    // Syntax: kernelName<<<numBlocks, threadsPerBlock>>>();
    helloKernel<<<1, 8>>>();
    
    // Don't forget to synchronize!
    cudaDeviceSynchronize();
    
    printf("\n=== Launching with 2 blocks, 4 threads each ===\n");
    
    // TODO 3: Launch helloKernel with 2 blocks and 4 threads per block
    helloKernel<<<2, 4>>>();
    
    cudaDeviceSynchronize();
    
    printf("\n=== Launching with 4 blocks, 2 threads each ===\n");
    
    // TODO 4: Launch helloKernel with 4 blocks and 2 threads per block
    helloKernel<<<4, 2>>>();
    
    cudaDeviceSynchronize();
    
    printf("\n‚úÖ All kernels completed!\n");
    
    return 0;
}

## Step 3: Compile

In [None]:
!nvcc -arch=sm_75 hello_gpu.cu -o hello_gpu
print("‚úÖ Compilation successful!")

## Step 4: Run and Observe!

In [None]:
!./hello_gpu

---

## üîÑ Python Comparison (Optional)

Want to see the same kernel in Python? Here's how it looks using **Numba CUDA**.

> **Note**: C++ is the industry standard. Python is great for quick prototyping and learning concepts.

### Side-by-Side Comparison

| CUDA C++ | Python (Numba) |
|----------|----------------|
| `__global__ void kernel()` | `@cuda.jit` |
| `blockIdx.x` | `cuda.blockIdx.x` |
| `threadIdx.x` | `cuda.threadIdx.x` |
| `blockDim.x` | `cuda.blockDim.x` |
| `<<<blocks, threads>>>` | `kernel[blocks, threads]()` |
| `cudaDeviceSynchronize()` | `cuda.synchronize()` |

### When to Use Each

| Use C++ When... | Use Python When... |
|-----------------|-------------------|
| Production code | Quick prototyping |
| Maximum performance | Data science workflows |
| Low-level control needed | Learning concepts |
| Industry/job requirements | Rapid experimentation |

In [None]:
# üêç Python Version (Numba) - Same logic, different syntax
# Run this cell to see the Python equivalent!

!pip install numba -q

from numba import cuda
import numpy as np

# Python kernel - note the @cuda.jit decorator instead of __global__
@cuda.jit
def hello_kernel_python():
    # Same indexing, just with cuda. prefix
    block_id = cuda.blockIdx.x
    thread_id = cuda.threadIdx.x
    global_id = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    # Note: printf doesn't work in Numba, we'd need to write to an array
    # This is one limitation of Python CUDA

# To actually see output, we need to write to memory (Python limitation)
@cuda.jit
def hello_kernel_with_output(output):
    global_id = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    if global_id < output.size:
        output[global_id] = global_id  # Store the global ID

# Launch equivalent to <<<1, 8>>>
output = np.zeros(8, dtype=np.int32)
d_output = cuda.to_device(output)

hello_kernel_with_output[1, 8](d_output)  # [blocks, threads] syntax
cuda.synchronize()

result = d_output.copy_to_host()
print("Python Numba result (global IDs):", result)
print("\nüí° Notice: Python syntax [1, 8] vs C++ syntax <<<1, 8>>>")

### üéØ Key Takeaway

**C++ gives you more control** (printf from GPU, full CUDA API), while **Python is more concise** but has limitations.

For serious CUDA work ‚Üí **Stick with C++** (the exercises above)

---

## üîç Observations & Questions

### What to Notice:

1. **Output Order**: Are the threads printed in order?
   - ‚ùì Why might they appear out of order?
   - üí° Threads execute in parallel - no guaranteed order!

2. **Block vs Thread IDs**:
   - In launch `<<<1, 8>>>`: How many blocks? How many threads?
   - In launch `<<<2, 4>>>`: How are global IDs calculated?

3. **Total Threads**:
   - `<<<1, 8>>>` = 1 √ó 8 = **8 threads**
   - `<<<2, 4>>>` = 2 √ó 4 = **8 threads** (same total, different organization!)
   - `<<<4, 2>>>` = 4 √ó 2 = **8 threads**

### üß™ Experiments to Try

Run these experiments by modifying the code above:

In [None]:
%%writefile experiment1.cu
// Experiment 1: Large number of threads
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void helloKernel() {
    int globalId = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Thread %d\n", globalId);
}

int main() {
    printf("Launching 256 threads...\n");
    helloKernel<<<4, 64>>>();  // 4 blocks √ó 64 threads = 256 threads
    cudaDeviceSynchronize();
    return 0;
}

In [None]:
!nvcc -arch=sm_75 experiment1.cu -o experiment1 && ./experiment1

## üìä Understanding Kernel Launch

### Launch Configuration: `<<<blocks, threads>>>`

| Syntax | Meaning | Total Threads |
|--------|---------|---------------|
| `<<<1, 256>>>` | 1 block, 256 threads/block | 256 |
| `<<<256, 1>>>` | 256 blocks, 1 thread/block | 256 |
| `<<<16, 16>>>` | 16 blocks, 16 threads/block | 256 |
| `<<<4, 64>>>` | 4 blocks, 64 threads/block | 256 |

**All launch the same total number of threads, but organized differently!**

### Why Does Organization Matter?

1. **Hardware Limits**:
   - Max threads per block: **1024** (on most GPUs)
   - Must stay within limits!

2. **Performance**:
   - Threads in same block can share memory
   - Better organization = better performance

3. **Problem Mapping**:
   - For 2D image: `<<<dim3(width/16, height/16), dim3(16, 16)>>>`
   - Maps naturally to problem structure

### üéØ Tasks Checklist

- ‚úÖ Complete the `helloKernel()` function
- ‚úÖ Launch with different block/thread configurations
- ‚úÖ Observe non-deterministic output order
- ‚úÖ Calculate global IDs correctly
- ‚úÖ Try launching 1024 threads (max per block)
- ‚úÖ Try launching 10,000 threads across multiple blocks

### üí° Bonus Challenges

1. **Max Threads**: Launch the maximum threads per block (1024)
2. **Huge Launch**: Launch 1 million threads. How many blocks do you need?
3. **Warp Alignment**: Launch 32, 64, 96 threads. Notice patterns?
4. **Error Check**: What happens if you try `<<<1, 2048>>>`? (exceeds limit)

In [None]:
%%writefile bonus_max.cu
// Bonus: Maximum threads per block
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void maxThreadsKernel() {
    int globalId = blockIdx.x * blockDim.x + threadIdx.x;
    if (globalId % 100 == 0) {  // Print every 100th to avoid spam
        printf("Thread %d\n", globalId);
    }
}

int main() {
    printf("Launching 1024 threads per block (max)...\n");
    maxThreadsKernel<<<1, 1024>>>();
    cudaDeviceSynchronize();
    
    printf("\nLaunching 1 million threads...\n");
    // Need: 1,000,000 / 1024 ‚âà 977 blocks
    int numBlocks = (1000000 + 1024 - 1) / 1024;  // Ceiling division
    maxThreadsKernel<<<numBlocks, 1024>>>();
    cudaDeviceSynchronize();
    printf("‚úÖ Launched %d threads!\n", numBlocks * 1024);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 bonus_max.cu -o bonus_max && ./bonus_max

## ‚ö†Ô∏è Important Notes

1. **printf from GPU**:
   - Requires compute capability 2.0+ ‚úÖ
   - Output buffer is limited (~1MB)
   - Too many printf calls may lose output

2. **cudaDeviceSynchronize()**:
   - **Essential** to see printf output
   - Waits for all GPU threads to finish
   - Without it, program may exit before GPU prints

3. **Non-deterministic Order**:
   - Threads execute in parallel
   - No guaranteed execution order
   - **Never assume sequential order!**

4. **Thread Limits**:
   - Max threads per block: 1024
   - Max blocks: 2^31 - 1 (huge!)
   - Check limits with `cudaDeviceProp`

## ‚û°Ô∏è Next Steps

You've learned:
- ‚úÖ How to write a `__global__` kernel
- ‚úÖ How to launch kernels with `<<<blocks, threads>>>`
- ‚úÖ Thread and block indexing
- ‚úÖ Parallel execution is non-deterministic

**Next**: Learn how to do actual computation on GPU!
- Vector addition
- Memory transfers (host ‚Üî device)
- Array processing

Continue to the next exercise or explore [Week 1 Learning Path](../../../learning-path/week-01/)

---

**Questions to Think About:**
1. Why do we need blocks AND threads?
2. What happens if we launch more threads than CUDA cores?
3. How does the GPU schedule thread execution?

Answers in the [Programming Model](../../../cuda-programming-guide/01-introduction/programming-model.md) guide!