In [None]:
# ⚙️ Setup
import subprocess, sys
try:
    import google.colab
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
except ImportError:
    pass

import numpy as np
from numba import cuda
import time

print("⚠️  CUDA C++ is PRIMARY. Python/Numba for quick testing only.")
if cuda.is_available():
    print(f"GPU: {cuda.get_current_device().name}")

---

## Part 1: What are CUDA Streams?

### The Concept

```
STREAM = A sequence of operations that execute in order

Without Streams (Sequential):          With Streams (Concurrent):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌──────────┐                           Stream 0: ┌──────────┐
│ H2D Copy │                                     │ H2D Copy │───┐
└────┬─────┘                           Stream 1: └──────────┘   │
     ↓                                           ┌──────────┐   │
┌──────────┐                                     │ Kernel A │←──┤
│ Kernel A │                           Stream 0: └──────────┘   │
└────┬─────┘                                     ┌──────────┐   │
     ↓                                           │ Kernel B │←──┘
┌──────────┐                           Stream 1: └──────────┘
│ D2H Copy │                                     ┌──────────┐
└──────────┘                                     │ D2H Copy │
                                                 └──────────┘
Time: ████████████████████             Time: ████████████
```

### Why Streams Matter

```
GPU Hardware has multiple engines:
┌─────────────────────────────────────────────────────────┐
│  Copy Engine (H2D)  │  Compute Engine  │  Copy Engine (D2H)  │
│  ─────────────────  │  ───────────────  │  ─────────────────  │
│  Can run while      │  Can run while   │  Can run while     │
│  compute runs!      │  copying!        │  compute runs!     │
└─────────────────────────────────────────────────────────┘

Streams let you USE all engines simultaneously!
```

---

## Part 2: Stream Creation and Management

### CUDA C++ Stream Basics (Primary)

The following code demonstrates four different methods for working with CUDA streams:
1. **Default Stream** - Sequential operations that wait for all previous work
2. **Non-Default Streams** - Created streams that can run concurrently
3. **Stream with Flags** - Non-blocking streams that don't sync with default stream
4. **Per-Thread Default Stream** - Compile-time option for thread-local default streams

### Compile and Run
```bash
nvcc -O3 -arch=sm_75 stream_basics.cu -o stream_basics
./stream_basics
```

In [None]:
%%writefile stream_basics.cu
// stream_basics.cu - Creating and using streams
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void simpleKernel(float* data, int n, int streamId) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        // Simulate some work
        for (int i = 0; i < 100; i++) {
            data[tid] = sqrtf(data[tid]) + 1.0f;
        }
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements
    const int NUM_STREAMS = 4;
    
    // ============================================
    // Method 1: Default Stream (NULL stream)
    // ============================================
    // Operations in default stream are BLOCKING
    // They wait for ALL previous operations to complete
    
    float *d_default;
    cudaMalloc(&d_default, N * sizeof(float));
    
    // These execute sequentially (default stream)
    simpleKernel<<<256, 256>>>(d_default, N, 0);  // Stream 0 (default)
    simpleKernel<<<256, 256>>>(d_default, N, 0);  // Waits for above
    
    cudaDeviceSynchronize();  // Wait for all
    
    // ============================================
    // Method 2: Create Non-Default Streams
    // ============================================
    cudaStream_t streams[NUM_STREAMS];
    float* d_data[NUM_STREAMS];
    
    // Create streams
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamCreate(&streams[i]);
        cudaMalloc(&d_data[i], N * sizeof(float));
    }
    
    // Launch kernels in different streams (can run concurrently!)
    for (int i = 0; i < NUM_STREAMS; i++) {
        // Syntax: kernel<<<grid, block, sharedMem, stream>>>
        simpleKernel<<<256, 256, 0, streams[i]>>>(d_data[i], N, i);
    }
    
    // Synchronize all streams
    cudaDeviceSynchronize();
    
    // ============================================
    // Method 3: Stream with Flags
    // ============================================
    cudaStream_t nonBlockingStream;
    
    // cudaStreamNonBlocking: doesn't sync with default stream
    cudaStreamCreateWithFlags(&nonBlockingStream, cudaStreamNonBlocking);
    
    simpleKernel<<<256, 256, 0, nonBlockingStream>>>(d_default, N, 0);
    
    // ============================================
    // Method 4: Per-Thread Default Stream
    // ============================================
    // Compile with: nvcc --default-stream per-thread
    // Makes each CPU thread have its own default stream
    
    // Cleanup
    for (int i = 0; i < NUM_STREAMS; i++) {
        cudaStreamDestroy(streams[i]);
        cudaFree(d_data[i]);
    }
    cudaStreamDestroy(nonBlockingStream);
    cudaFree(d_default);
    
    printf("Stream basics complete!\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o stream_basics stream_basics.cu
!./stream_basics

---

## Part 3: Default Stream Behavior

### Legacy vs Per-Thread Default Stream

This example demonstrates the difference between:
- **Legacy Default Stream** - Acts as a synchronization barrier between all streams
- **Non-Blocking Streams** - Use `cudaStreamNonBlocking` flag to avoid synchronization with the default stream

### Visual Comparison

```
Legacy Default Stream (--default-stream legacy):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Stream1:    ████████│wait│        │████████│
Default:            │    │████████│        │
                    ↑    ↑        ↑
              sync points (implicit barriers)

Per-Thread Default (--default-stream per-thread):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Stream1:    ████████████████
Default:    ████████
            (can overlap - each thread has own default)

Non-Blocking Flag (cudaStreamNonBlocking):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NonBlock:   ████████████████████████
Default:    ████████
            (no sync between them)
```

In [None]:
%%writefile default_stream_behavior.cu
// default_stream_behavior.cu
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void work(float* data, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        for (int i = 0; i < 1000; i++) {
            data[tid] = sinf(data[tid]);
        }
    }
}

int main() {
    const int N = 1 << 18;
    float *d_a, *d_b;
    cudaMalloc(&d_a, N * sizeof(float));
    cudaMalloc(&d_b, N * sizeof(float));
    
    cudaStream_t stream1;
    cudaStreamCreate(&stream1);
    
    // ============================================
    // LEGACY DEFAULT STREAM (blocking)
    // ============================================
    // Compile with: nvcc file.cu (default)
    
    printf("Testing with legacy default stream...\n");
    
    work<<<256, 256, 0, stream1>>>(d_a, N);  // Stream 1
    work<<<256, 256>>>(d_b, N);               // Default stream
    work<<<256, 256, 0, stream1>>>(d_a, N);  // Stream 1
    
    // Execution order with legacy default stream:
    // 1. First stream1 kernel starts
    // 2. Default stream kernel WAITS for stream1 to finish
    // 3. Second stream1 kernel WAITS for default stream
    // Result: NO OVERLAP (all sequential)
    
    cudaDeviceSynchronize();
    
    // ============================================
    // NON-BLOCKING STREAM (no sync with default)
    // ============================================
    cudaStream_t nonBlocking;
    cudaStreamCreateWithFlags(&nonBlocking, cudaStreamNonBlocking);
    
    printf("Testing with non-blocking stream...\n");
    
    work<<<256, 256, 0, nonBlocking>>>(d_a, N);  // Non-blocking
    work<<<256, 256>>>(d_b, N);                   // Default stream
    work<<<256, 256, 0, nonBlocking>>>(d_a, N);  // Non-blocking
    
    // Execution: All three can potentially overlap!
    
    cudaDeviceSynchronize();
    
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(nonBlocking);
    cudaFree(d_a);
    cudaFree(d_b);
    
    printf("Default stream behavior demo complete!\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o default_stream_behavior default_stream_behavior.cu
!./default_stream_behavior

In [None]:
# Python/Numba Stream Example (OPTIONAL)

@cuda.jit
def simple_work(data):
    tid = cuda.grid(1)
    if tid < data.shape[0]:
        val = data[tid]
        for _ in range(100):
            val = val * 1.001 + 0.001
        data[tid] = val

n = 1 << 20
threads = 256
blocks = (n + threads - 1) // threads

# Create streams
stream1 = cuda.stream()
stream2 = cuda.stream()

# Allocate data
d_a = cuda.device_array(n, dtype=np.float32)
d_b = cuda.device_array(n, dtype=np.float32)

# Launch in different streams
simple_work[blocks, threads, stream1](d_a)
simple_work[blocks, threads, stream2](d_b)

# Synchronize
stream1.synchronize()
stream2.synchronize()

print("Both kernels completed (potentially overlapped)")

---

## Part 4: Stream Synchronization

### Synchronization Methods

This example demonstrates different ways to synchronize with CUDA streams:
1. **cudaDeviceSynchronize()** - Waits for ALL streams (heavy-weight)
2. **cudaStreamSynchronize(stream)** - Waits for specific stream only
3. **cudaStreamQuery(stream)** - Non-blocking check if stream is done
4. **Callbacks** - Run host function when stream reaches point (deprecated in favor of graphs)

### Synchronization Comparison

```
┌─────────────────────────────────────────────────────────┐
│           Synchronization Methods                       │
├──────────────────────┬──────────────────────────────────┤
│ Method               │ Use Case                         │
├──────────────────────┼──────────────────────────────────┤
│ cudaDeviceSynchronize│ End of program, debugging        │
│ cudaStreamSynchronize│ Wait for specific stream         │
│ cudaStreamQuery      │ Poll without blocking            │
│ cudaEventSynchronize │ Wait for specific point (Day 4)  │
│ cudaStreamWaitEvent  │ Inter-stream dependency (Day 4)  │
└──────────────────────┴──────────────────────────────────┘
```

In [None]:
%%writefile stream_sync.cu
// stream_sync.cu - Different synchronization approaches
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void kernel(float* data, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) data[tid] *= 2.0f;
}

int main() {
    const int N = 1 << 20;
    float *d_data;
    cudaMalloc(&d_data, N * sizeof(float));
    
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    
    // ============================================
    // Method 1: cudaDeviceSynchronize()
    // ============================================
    // Waits for ALL streams on the device
    // Heavy-weight, use sparingly
    
    kernel<<<256, 256, 0, stream>>>(d_data, N);
    cudaDeviceSynchronize();  // Blocks until ALL work done
    printf("Method 1: cudaDeviceSynchronize() - waited for all work\n");
    
    // ============================================
    // Method 2: cudaStreamSynchronize(stream)
    // ============================================
    // Waits for specific stream only
    // Lighter weight than device sync
    
    kernel<<<256, 256, 0, stream>>>(d_data, N);
    cudaStreamSynchronize(stream);  // Wait for this stream only
    printf("Method 2: cudaStreamSynchronize() - waited for specific stream\n");
    
    // ============================================
    // Method 3: cudaStreamQuery(stream)
    // ============================================
    // Non-blocking check if stream is done
    
    kernel<<<256, 256, 0, stream>>>(d_data, N);
    
    cudaError_t status;
    int pollCount = 0;
    do {
        status = cudaStreamQuery(stream);
        pollCount++;
        // Can do CPU work here while waiting!
    } while (status == cudaErrorNotReady);
    
    if (status == cudaSuccess) {
        printf("Method 3: cudaStreamQuery() - polled %d times before completion\n", pollCount);
    }
    
    // ============================================
    // Method 4: Callback (deprecated in favor of graphs)
    // ============================================
    // cudaStreamAddCallback() - runs host function when stream reaches point
    
    cudaStreamDestroy(stream);
    cudaFree(d_data);
    
    printf("Stream synchronization demo complete!\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o stream_sync stream_sync.cu
!./stream_sync

---

## Exercises

### Exercise 1: Multi-Stream Kernel Launch

```cpp
// Launch 8 independent kernels in 4 streams
// Measure time with vs without streams

// Your implementation:
```

### Exercise 2: Stream Query Loop

```cpp
// While GPU is working, do CPU computation
// Use cudaStreamQuery to check completion

// Your implementation:
```

### Exercise 3: Non-Blocking vs Blocking

```cpp
// Compare behavior with:
// 1. Regular stream (cudaStreamCreate)
// 2. Non-blocking stream (cudaStreamNonBlocking)

// Your implementation:
```

---

## Key Takeaways

```
┌─────────────────────────────────────────────────────────┐
│                   STREAM BASICS                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Stream = sequence of GPU operations                    │
│                                                         │
│  Key Functions:                                         │
│  • cudaStreamCreate(&stream)                            │
│  • cudaStreamDestroy(stream)                            │
│  • cudaStreamSynchronize(stream)                        │
│  • kernel<<<grid, block, smem, stream>>>()              │
│                                                         │
│  Default Stream Gotcha:                                 │
│  • Legacy: blocks other streams                         │
│  • Use cudaStreamNonBlocking for independence           │
│                                                         │
└─────────────────────────────────────────────────────────┘
```

## Next: Day 2 - Overlapping Data Transfers

Tomorrow we'll learn to overlap H2D/D2H transfers with computation using pinned memory!