# GPU Computing

## Introduction

The previous lecture covered techniques for writing efficient Python code: profiling, vectorization, Numba JIT compilation, and Cython. These approaches optimize code running on the CPU. For certain computations, we can achieve even greater speedups by using a Graphics Processing Unit (GPU).

GPUs were originally designed for rendering graphics, which requires performing the same operation on millions of pixels simultaneously. This architecture turns out to be ideal for many scientific computing tasks: matrix operations, Monte Carlo simulations, and any computation where the same operation is applied to large amounts of data.

This lecture covers GPU computing in Python using two libraries:

* **CuPy**: A drop-in replacement for NumPy that runs on NVIDIA GPUs
* **PyTorch**: A deep learning framework with powerful GPU array operations

Both libraries let you write Python code that executes on the GPU with minimal changes to your existing NumPy-based code.

### CPU vs GPU Architecture

A CPU is designed for general-purpose computing. It has a small number of powerful cores (typically 4-16 on consumer hardware) optimized for complex tasks with branching logic, sequential dependencies, and varied workloads. Each core can handle sophisticated operations independently.

A GPU takes a different approach. It has thousands of simpler cores designed to execute the same instruction on many data elements simultaneously. An NVIDIA GPU might have 5,000+ CUDA cores, each less powerful than a CPU core but collectively capable of massive parallelism.

```
CPU: Few powerful cores          GPU: Many simple cores
+----+----+----+----+           +--+--+--+--+--+--+--+--+
|    |    |    |    |           |  |  |  |  |  |  |  |  |
|Core|Core|Core|Core|           +--+--+--+--+--+--+--+--+
|  1 |  2 |  3 |  4 |           |  |  |  |  |  |  |  |  |
|    |    |    |    |           +--+--+--+--+--+--+--+--+
+----+----+----+----+           |  |  |  |  |  |  |  |  |
  4 cores @ 4 GHz               +--+--+--+--+--+--+--+--+
                                  ... thousands of cores
```

This architecture difference means:

* **GPUs excel at data parallelism**: When you need to perform the same operation on millions of elements (matrix multiplication, element-wise operations, convolutions), GPUs can process many elements simultaneously.

* **CPUs excel at task parallelism and complex logic**: When operations have dependencies, require branching, or involve complex control flow, CPUs are more efficient.

### When to Use GPU Computing

GPU acceleration provides significant speedups for:

* Large matrix operations (multiplication, decomposition)
* Element-wise operations on large arrays
* Monte Carlo simulations with independent trials
* Convolutions and signal processing
* Operations where the same computation is applied to many data points

GPU acceleration is **not** beneficial for:

* Small data (overhead of data transfer exceeds computation time)
* Sequential algorithms where each step depends on the previous
* Operations with complex branching or conditionals
* I/O-bound tasks (reading files, network operations)

A rule of thumb: if your computation is embarrassingly parallel and operates on millions of elements, consider GPU acceleration.

### The GPU Computing Landscape in Python

Several libraries provide GPU computing in Python:

* **CuPy**: NumPy-compatible array library for NVIDIA GPUs. Minimal code changes required.
* **PyTorch**: Deep learning framework with GPU tensors. Popular for ML but useful for general GPU computing.
* **JAX**: Google's library for high-performance numerical computing with automatic differentiation.
* **Numba CUDA**: Write custom GPU kernels in Python (covered in efficient code lecture's Numba section).

This lecture focuses on CuPy and PyTorch as they cover most statistical computing needs with minimal learning curve.

## GPU Architecture Basics

Before writing GPU code, understanding a few architectural concepts helps explain performance characteristics.

### CUDA and GPU Execution

NVIDIA GPUs use CUDA (Compute Unified Device Architecture) as their parallel computing platform. When you run CuPy or PyTorch code on an NVIDIA GPU, it uses CUDA under the hood.

Key concepts:

* **CUDA cores**: The basic processing units that execute computations in parallel
* **Streaming Multiprocessors (SMs)**: Groups of CUDA cores that share resources
* **Warps**: Groups of 32 threads that execute instructions in lockstep

For practical purposes, you can think of the GPU as having thousands of workers that can all perform the same operation simultaneously on different data.

### Host and Device Memory

A critical concept in GPU computing is the separation between:

* **Host**: The CPU and its main memory (RAM)
* **Device**: The GPU and its dedicated memory (VRAM)

Data must be explicitly transferred between host and device:

```
+------------------+      Transfer      +------------------+
|      HOST        |  <------------>   |     DEVICE       |
|   CPU + RAM      |                   |   GPU + VRAM     |
| (numpy arrays)   |                   | (cupy/torch)     |
+------------------+                   +------------------+
```

This data transfer has overhead. Moving a large array to the GPU takes time, even before any computation begins. The key to efficient GPU programming is:

1. Transfer data to GPU once
2. Perform many operations on GPU
3. Transfer results back when done

Avoid repeatedly moving data back and forth between CPU and GPU.

In [None]:
!pip install cupy-cuda12x
!pip install torch

### Checking GPU Availability

Before running GPU code, verify that a GPU is available:

In [None]:
# Check with CuPy
import cupy as cp

print(f"CuPy version: {cp.__version__}")
print(f"CUDA available: {cp.cuda.is_available()}")
if cp.cuda.is_available():
    print(f"GPU device: {cp.cuda.runtime.getDeviceProperties(0)['name']}")

In [None]:
# Check with PyTorch
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### Question

Consider these three computational tasks:

1. Computing the mean of 100 numbers
2. Multiplying two 10000x10000 matrices
3. A recursive Fibonacci calculation

Which task(s) would benefit from GPU acceleration, and why?

**Answer:**

Only task 2 (matrix multiplication) would benefit from GPU acceleration.

* **Task 1** involves too little data. The overhead of transferring 100 numbers to the GPU would exceed the computation time.
* **Task 2** is ideal for GPU: matrix multiplication is highly parallelizable, and the large size (100 million operations) amortizes the transfer overhead.
* **Task 3** is inherently sequential. Each Fibonacci number depends on the previous two, so there's no parallelism to exploit.

## CuPy: NumPy on the GPU

CuPy provides a NumPy-compatible interface for GPU arrays. Most NumPy code can run on GPU by simply replacing `numpy` with `cupy`.

### Basic Usage

CuPy arrays work like NumPy arrays:

In [8]:
import numpy as np
import cupy as cp
import time

# Generate large arrays on CPU first to ensure same values
N = 10_000_000
print(f"Generating {N:,} random numbers on CPU...")
a_cpu = np.random.rand(N).astype(np.float32)
b_cpu = np.random.rand(N).astype(np.float32)

print("--- CuPy (GPU) ---")
# Transfer to GPU before timing computation to isolate processing speed
a_gpu = cp.asarray(a_cpu)
b_gpu = cp.asarray(b_cpu)

# Synchronize before timing to ensure GPU is ready
cp.cuda.Stream.null.synchronize()
start_gpu = time.perf_counter()

# Operations run on GPU
c_gpu = a_gpu + b_gpu
d_gpu = cp.sqrt(a_gpu)
e_gpu = cp.dot(a_gpu, b_gpu)

# Synchronize after operations to wait for completion
cp.cuda.Stream.null.synchronize()
end_gpu = time.perf_counter()

print(f"GPU Execution Time: {(end_gpu - start_gpu) * 1000:.4f} ms")

print("\n--- NumPy (CPU) ---")
start_cpu = time.perf_counter()

# Operations run on CPU
c_cpu = a_cpu + b_cpu
d_cpu = np.sqrt(a_cpu)
e_cpu = np.dot(a_cpu, b_cpu)

end_cpu = time.perf_counter()

print(f"CPU Execution Time: {(end_cpu - start_cpu) * 1000:.4f} ms")

speedup = (end_cpu - start_cpu) / (end_gpu - start_gpu)
print(f"\nSpeedup: {speedup:.2f}x")

Generating 10,000,000 random numbers on CPU...
--- CuPy (GPU) ---
GPU Execution Time: 1.6526 ms

--- NumPy (CPU) ---
CPU Execution Time: 34.6861 ms

Speedup: 20.99x


### Transferring Data Between CPU and GPU

In [10]:
import numpy as np
import cupy as cp

# Create NumPy array on CPU
data_cpu = np.random.randn(1000000)

# Transfer to GPU
data_gpu = cp.asarray(data_cpu)

# Perform computation on GPU
result_gpu = cp.mean(data_gpu ** 2)

# Transfer result back to CPU
result_cpu = result_gpu.get()  # or: cp.asnumpy(result_gpu)

print(f"Type on GPU: {type(result_gpu)}")
print(f"Result: {result_gpu}")
print(f"Type on CPU: {type(result_cpu)}")
print(f"Result: {result_cpu}")

Type on GPU: <class 'cupy.ndarray'>
Result: 1.0010916407329538
Type on CPU: <class 'numpy.ndarray'>
Result: 1.0010916407329538


Key functions:
* `cp.asarray(numpy_array)`: Transfer NumPy array to GPU
* `gpu_array.get()` or `cp.asnumpy(gpu_array)`: Transfer CuPy array to CPU

### CuPy as a Drop-in Replacement

Most NumPy functions have CuPy equivalents:

In [11]:
import numpy as np
import cupy as cp

# NumPy code
def compute_stats_numpy(data):
    mean = np.mean(data)
    std = np.std(data)
    normalized = (data - mean) / std
    return np.sum(normalized ** 2)

# CuPy code - just change np to cp
def compute_stats_cupy(data):
    mean = cp.mean(data)
    std = cp.std(data)
    normalized = (data - mean) / std
    return cp.sum(normalized ** 2)

# Test
data_np = np.random.randn(10000000)
data_cp = cp.asarray(data_np)

result_np = compute_stats_numpy(data_np)
result_cp = compute_stats_cupy(data_cp).get()

print(f"NumPy result: {result_np}")
print(f"CuPy result: {result_cp}")

NumPy result: 9999999.99999999
CuPy result: 10000000.0


### Random Number Generation

CuPy provides GPU-accelerated random number generation:

In [12]:
import cupy as cp

# Set seed for reproducibility
cp.random.seed(42)

# Generate random numbers on GPU
uniform = cp.random.random(1000000)
normal = cp.random.randn(1000000)
integers = cp.random.randint(0, 100, size=1000000)

print(f"Uniform mean: {cp.mean(uniform):.4f}")
print(f"Normal mean: {cp.mean(normal):.4f}")
print(f"Integer mean: {cp.mean(integers):.4f}")

Uniform mean: 0.4997
Normal mean: -0.0002
Integer mean: 49.5714


### Performance Comparison

Let's compare NumPy and CuPy for matrix multiplication:

In [14]:
import numpy as np
import cupy as cp
import time


def benchmark_matmul(size, n_runs=10):
    """Compare NumPy vs CuPy matrix multiplication."""
    # Create random matrices
    A_np = np.random.randn(size, size).astype(np.float32)
    B_np = np.random.randn(size, size).astype(np.float32)

    A_cp = cp.asarray(A_np)
    B_cp = cp.asarray(B_np)

    # Warm up GPU
    _ = cp.dot(A_cp, B_cp)
    cp.cuda.Stream.null.synchronize()

    # Time NumPy
    start = time.perf_counter()
    for _ in range(n_runs):
        C_np = np.dot(A_np, B_np)
    numpy_time = (time.perf_counter() - start) / n_runs

    # Time CuPy
    start = time.perf_counter()
    for _ in range(n_runs):
        C_cp = cp.dot(A_cp, B_cp)
        cp.cuda.Stream.null.synchronize()  # Wait for GPU to finish
    cupy_time = (time.perf_counter() - start) / n_runs

    print(f"Size {size}x{size}:")
    print(f"  NumPy: {numpy_time*1000:.2f} ms")
    print(f"  CuPy:  {cupy_time*1000:.2f} ms")
    print(f"  Speedup: {numpy_time/cupy_time:.1f}x")


# Test different sizes
for size in [100, 500, 1000, 2000, 4000]:
    benchmark_matmul(size)
    print()

Size 100x100:
  NumPy: 0.04 ms
  CuPy:  0.06 ms
  Speedup: 0.6x

Size 500x500:
  NumPy: 0.70 ms
  CuPy:  0.08 ms
  Speedup: 8.2x

Size 1000x1000:
  NumPy: 3.79 ms
  CuPy:  0.22 ms
  Speedup: 17.2x

Size 2000x2000:
  NumPy: 26.96 ms
  CuPy:  1.17 ms
  Speedup: 23.1x

Size 4000x4000:
  NumPy: 205.83 ms
  CuPy:  10.48 ms
  Speedup: 19.6x



Typical results show:
- Small matrices (100x100): GPU may be slower due to transfer overhead
- Large matrices (2000x2000+): GPU can be 10-50x faster

### Question

Consider the following code:

In [None]:
import numpy as np
import cupy as cp

data = np.random.randn(1000000)

# Version A
for i in range(100):
    gpu_data = cp.asarray(data)
    result = cp.sum(gpu_data ** 2)
    cpu_result = result.get()

# Version B
gpu_data = cp.asarray(data)
for i in range(100):
    result = cp.sum(gpu_data ** 2)
cpu_result = result.get()

Which version is faster, and why?

**Answer:**

Version B is significantly faster. In Version A, data is transferred to the GPU 100 times (once per iteration). In Version B, data is transferred once before the loop. Since data transfer between CPU and GPU is slow relative to computation, Version A wastes time on redundant transfers. Always minimize data movement between host and device.

## PyTorch for GPU Computing

PyTorch is primarily known as a deep learning framework, but its GPU tensor operations are useful for general scientific computing. PyTorch offers some advantages over CuPy:

* Automatic differentiation (useful for optimization)
* Broader ecosystem and community support
* Easy model deployment

### Tensors: PyTorch's Array Type

PyTorch uses tensors instead of arrays. Tensors are similar to NumPy arrays but can live on GPU:

In [15]:
import torch
import numpy as np

# Create tensors
a = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
b = torch.zeros(3, 4)  # 3x4 matrix of zeros
c = torch.randn(2, 3)  # 2x3 matrix of random normal values

print(f"a: {a}")
print(f"b shape: {b.shape}")
print(f"c:\n{c}")

a: tensor([1., 2., 3., 4., 5.])
b shape: torch.Size([3, 4])
c:
tensor([[-0.2688, -0.7397, -0.2322],
        [-0.5727, -0.5272, -1.8478]])


### Moving Tensors to GPU

In [16]:
import torch

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create tensor on CPU
x_cpu = torch.randn(1000, 1000)
print(f"x_cpu device: {x_cpu.device}")

# Move to GPU
x_gpu = x_cpu.to(device)
print(f"x_gpu device: {x_gpu.device}")

# Alternative: create directly on GPU
y_gpu = torch.randn(1000, 1000, device=device)
print(f"y_gpu device: {y_gpu.device}")

Using device: cuda
x_cpu device: cpu
x_gpu device: cuda:0
y_gpu device: cuda:0


Common patterns for device management:

In [17]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Method 1: .to(device)
x = torch.randn(100).to(device)

# Method 2: .cuda() (only works if CUDA available)
if torch.cuda.is_available():
    y = torch.randn(100).cuda()

# Method 3: Create directly on device
z = torch.randn(100, device=device)

### Converting Between PyTorch and NumPy

In [None]:
import torch
import numpy as np

# NumPy to PyTorch (CPU)
np_array = np.array([1.0, 2.0, 3.0])
torch_tensor = torch.from_numpy(np_array)

# PyTorch (CPU) to NumPy
back_to_numpy = torch_tensor.numpy()

# For GPU tensors, must move to CPU first
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gpu_tensor = torch.randn(5, device=device)
cpu_numpy = gpu_tensor.cpu().numpy()

print(f"Original NumPy: {np_array}")
print(f"Torch tensor: {torch_tensor}")
print(f"Back to NumPy: {back_to_numpy}")
print(f"GPU tensor to NumPy: {cpu_numpy}")

### Basic Operations

PyTorch operations mirror NumPy:

In [None]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create tensors on GPU
a = torch.randn(1000, 1000, device=device)
b = torch.randn(1000, 1000, device=device)

# Element-wise operations
c = a + b
d = a * b
e = torch.exp(a)

# Matrix operations
f = torch.mm(a, b)  # Matrix multiplication
g = torch.inverse(a)  # Matrix inverse

# Reductions
h = torch.sum(a)
i = torch.mean(a, dim=0)  # Mean along rows

# Linear algebra
u, s, v = torch.svd(a)

print(f"Matrix product shape: {f.shape}")
print(f"Sum: {h.item():.4f}")  # .item() extracts scalar value

### Automatic Differentiation

A unique feature of PyTorch is automatic differentiation. This computes gradients automatically, which is useful for optimization:

In [18]:
import torch

# Create tensor with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True, device='cuda')

# Compute a function
y = x[0]**2 + 3*x[1]**2

# Compute gradient
y.backward()

# Access gradients
print(f"x = {x}")
print(f"y = x[0]^2 + 3*x[1]^2 = {y.item()}")
print(f"dy/dx = {x.grad}")  # Should be [2*x[0], 6*x[1]] = [4, 18]

x = tensor([2., 3.], device='cuda:0', requires_grad=True)
y = x[0]^2 + 3*x[1]^2 = 31.0
dy/dx = tensor([ 4., 18.], device='cuda:0')


This is particularly useful for:
* Maximum likelihood estimation
* Gradient-based optimization
* Neural network training

### Question

What happens if you try to perform an operation between a CPU tensor and a GPU tensor in PyTorch?

In [19]:
import torch

a = torch.randn(100)  # CPU tensor
b = torch.randn(100, device='cuda')  # GPU tensor
c = a + b  # What happens?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

**Answer:**

PyTorch raises a RuntimeError because tensors must be on the same device for operations. The error message will say something like "Expected all tensors to be on the same device." You must explicitly move tensors to the same device before operating on them: either `a.cuda() + b` or `a + b.cpu()`.

## Practical Considerations

### Timing GPU Operations

GPU operations are asynchronous. When you call a GPU function, Python returns immediately while the GPU continues working. To get accurate timing, you must synchronize:

In [20]:
import torch
import time

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

a = torch.randn(5000, 5000, device=device)
b = torch.randn(5000, 5000, device=device)

# WRONG: This doesn't measure GPU time correctly
start = time.perf_counter()
c = torch.mm(a, b)
wrong_time = time.perf_counter() - start
print(f"Wrong measurement: {wrong_time*1000:.2f} ms")

# CORRECT: Synchronize before measuring
torch.cuda.synchronize()
start = time.perf_counter()
c = torch.mm(a, b)
torch.cuda.synchronize()  # Wait for GPU to finish
correct_time = time.perf_counter() - start
print(f"Correct measurement: {correct_time*1000:.2f} ms")

Wrong measurement: 11.56 ms
Correct measurement: 17.15 ms


For CuPy:

In [None]:
import cupy as cp
import time

a = cp.random.randn(5000, 5000)
b = cp.random.randn(5000, 5000)

# Synchronize with CuPy
cp.cuda.Stream.null.synchronize()
start = time.perf_counter()
c = cp.dot(a, b)
cp.cuda.Stream.null.synchronize()
elapsed = time.perf_counter() - start
print(f"Time: {elapsed*1000:.2f} ms")

### Memory Management

GPU memory is limited (typically 8-24 GB on consumer GPUs). Monitor and manage it:

In [21]:
import torch

if torch.cuda.is_available():
    # Check memory usage
    print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

    # Create a large tensor
    x = torch.randn(10000, 10000, device='cuda')
    print(f"After allocation: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

    # Delete and clear cache
    del x
    torch.cuda.empty_cache()
    print(f"After cleanup: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

Allocated: 0.32 GB
Cached: 0.43 GB
After allocation: 0.72 GB
After cleanup: 0.32 GB


### Common Pitfalls

**1. Data transfer overhead**

In [None]:
# BAD: Transferring inside a loop
for i in range(1000):
    x_gpu = torch.tensor(data[i]).cuda()
    result = process(x_gpu)
    results.append(result.cpu())

# GOOD: Batch transfer
x_gpu = torch.tensor(data).cuda()
results_gpu = process_batch(x_gpu)
results = results_gpu.cpu()

**2. Forgetting to synchronize for timing**

Always call `torch.cuda.synchronize()` or `cp.cuda.Stream.null.synchronize()` before timing measurements.

**3. Operations between different devices**

In [None]:
# This will error
a_cpu = torch.randn(100)
b_gpu = torch.randn(100, device='cuda')
# c = a_cpu + b_gpu  # RuntimeError!

# Fix: move to same device
c = a_cpu.cuda() + b_gpu

**4. Using GPU for small operations**

In [None]:
import torch
import time

device = 'cuda'

# Small operation: CPU is faster due to transfer overhead
small = torch.randn(100)
torch.cuda.synchronize()
start = time.perf_counter()
result = small.to(device).sum().cpu()
torch.cuda.synchronize()
print(f"Small tensor GPU: {(time.perf_counter()-start)*1000:.3f} ms")

start = time.perf_counter()
result = small.sum()
print(f"Small tensor CPU: {(time.perf_counter()-start)*1000:.3f} ms")

### Question

A researcher has the following code that runs slowly:

In [None]:
results = []
for i in range(10000):
    x = cp.asarray(data[i])  # data[i] is a NumPy array of shape (100,)
    y = cp.sum(x ** 2)
    results.append(y.get())

What is the main performance problem, and how would you fix it?

**Answer:**

The problem is that each of the 10,000 iterations transfers a small array (100 elements) to the GPU and transfers the result back. The data transfer overhead dominates the computation time.

Fix by batching the data:

In [None]:
# Stack all data into one array and transfer once
all_data = cp.asarray(np.stack(data))  # Shape: (10000, 100)
# Compute all results at once on GPU
results_gpu = cp.sum(all_data ** 2, axis=1)  # Shape: (10000,)
# Transfer back once
results = results_gpu.get()

This reduces 20,000 transfers to just 2 (one in, one out) and uses GPU parallelism effectively.

## Statistical Computing Examples

### Monte Carlo Pi Estimation

A classic example to demonstrate GPU parallelism:

In [27]:
import numpy as np
import cupy as cp
import torch
import time


def estimate_pi_numpy(n):
    """Estimate pi using Monte Carlo - NumPy."""
    # NumPy defaults to float64
    x = np.random.random(n).astype(np.float32)
    y = np.random.random(n).astype(np.float32)
    inside = np.sum(x**2 + y**2 <= 1)
    return 4 * inside / n


def estimate_pi_cupy(n):
    """Estimate pi using Monte Carlo - CuPy."""
    # Explicitly use float32 for GPU speed (GPUs are much faster at float32)
    x = cp.random.random(n, dtype=cp.float32)
    y = cp.random.random(n, dtype=cp.float32)
    inside = cp.sum(x**2 + y**2 <= 1)
    return 4 * float(inside.get()) / n


def estimate_pi_torch(n, device):
    """Estimate pi using Monte Carlo - PyTorch."""
    # PyTorch defaults to float32
    x = torch.rand(n, device=device)
    y = torch.rand(n, device=device)
    inside = torch.sum(x**2 + y**2 <= 1)
    return 4 * inside.item() / n


# Benchmark
n = 100_000_000
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Benchmarking with N={n:,} samples...")
print("Warming up GPU... (to exclude compilation time)")
# Warmup
estimate_pi_cupy(1000)
if torch.cuda.is_available():
    estimate_pi_torch(1000, device)

print("--- Start Timing ---")

# NumPy
start = time.perf_counter()
pi_numpy = estimate_pi_numpy(n)
numpy_time = time.perf_counter() - start

# CuPy
cp.cuda.Stream.null.synchronize()
start = time.perf_counter()
pi_cupy = estimate_pi_cupy(n)
cp.cuda.Stream.null.synchronize()
cupy_time = time.perf_counter() - start

# PyTorch
torch.cuda.synchronize()
start = time.perf_counter()
pi_torch = estimate_pi_torch(n, device)
torch.cuda.synchronize()
torch_time = time.perf_counter() - start

print(f"NumPy (float64):  pi = {pi_numpy:.6f}, time = {numpy_time:.3f}s")
print(f"CuPy (float32):   pi = {pi_cupy:.6f}, time = {cupy_time:.3f}s, speedup = {numpy_time/cupy_time:.1f}x")
print(f"PyTorch (float32): pi = {pi_torch:.6f}, time = {torch_time:.3f}s, speedup = {numpy_time/torch_time:.1f}x")

Benchmarking with N=100,000,000 samples...
Warming up GPU... (to exclude compilation time)
--- Start Timing ---
NumPy (float64):  pi = 3.141690, time = 2.476s
CuPy (float32):   pi = 3.141643, time = 0.025s, speedup = 99.8x
PyTorch (float32): pi = 3.141643, time = 0.024s, speedup = 101.2x


### Bootstrap Confidence Intervals

Bootstrap resampling benefits from GPU parallelism:

In [23]:
import numpy as np
import cupy as cp
import time


def bootstrap_mean_numpy(data, n_bootstrap=10000):
    """Bootstrap mean estimation - NumPy."""
    n = len(data)
    means = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        sample = np.random.choice(data, size=n, replace=True)
        means[i] = np.mean(sample)
    return means


def bootstrap_mean_cupy(data, n_bootstrap=10000):
    """Bootstrap mean estimation - CuPy (vectorized)."""
    n = len(data)
    # Generate all bootstrap indices at once
    indices = cp.random.randint(0, n, size=(n_bootstrap, n))
    # Gather samples and compute means
    samples = data[indices]
    means = cp.mean(samples, axis=1)
    return means


# Generate data
np.random.seed(42)
data_np = np.random.randn(10000)
data_cp = cp.asarray(data_np)

n_bootstrap = 10000

# NumPy
start = time.perf_counter()
means_np = bootstrap_mean_numpy(data_np, n_bootstrap)
numpy_time = time.perf_counter() - start

# CuPy
cp.cuda.Stream.null.synchronize()
start = time.perf_counter()
means_cp = bootstrap_mean_cupy(data_cp, n_bootstrap)
cp.cuda.Stream.null.synchronize()
cupy_time = time.perf_counter() - start

# Compute confidence intervals
ci_np = np.percentile(means_np, [2.5, 97.5])
ci_cp = cp.percentile(means_cp, [2.5, 97.5]).get()

print(f"Bootstrap with {n_bootstrap} resamples:")
print(f"NumPy:  CI = [{ci_np[0]:.4f}, {ci_np[1]:.4f}], time = {numpy_time:.3f}s")
print(f"CuPy:   CI = [{ci_cp[0]:.4f}, {ci_cp[1]:.4f}], time = {cupy_time:.3f}s")
print(f"Speedup: {numpy_time/cupy_time:.1f}x")

Bootstrap with 10000 resamples:
NumPy:  CI = [-0.0221, 0.0176], time = 1.975s
CuPy:   CI = [-0.0217, 0.0177], time = 0.504s
Speedup: 3.9x


### Linear Regression with GPU

Matrix operations for linear regression:

In [24]:
import numpy as np
import torch
import time


def ols_numpy(X, y):
    """Ordinary least squares - NumPy."""
    return np.linalg.lstsq(X, y, rcond=None)[0]


def ols_torch(X, y):
    """Ordinary least squares - PyTorch."""
    return torch.linalg.lstsq(X, y).solution


# Generate data
np.random.seed(42)
n, p = 100000, 500
X_np = np.random.randn(n, p).astype(np.float32)
true_beta = np.random.randn(p).astype(np.float32)
y_np = X_np @ true_beta + 0.1 * np.random.randn(n).astype(np.float32)

# Convert to PyTorch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
X_torch = torch.from_numpy(X_np).to(device)
y_torch = torch.from_numpy(y_np).to(device)

# Benchmark
start = time.perf_counter()
beta_np = ols_numpy(X_np, y_np)
numpy_time = time.perf_counter() - start

torch.cuda.synchronize()
start = time.perf_counter()
beta_torch = ols_torch(X_torch, y_torch)
torch.cuda.synchronize()
torch_time = time.perf_counter() - start

print(f"OLS with n={n}, p={p}")
print(f"NumPy:   time = {numpy_time:.3f}s")
print(f"PyTorch: time = {torch_time:.3f}s")
print(f"Speedup: {numpy_time/torch_time:.1f}x")

# Verify results are similar
beta_torch_np = beta_torch.cpu().numpy()
print(f"Max coefficient difference: {np.max(np.abs(beta_np - beta_torch_np)):.6f}")

OLS with n=100000, p=500
NumPy:   time = 3.798s
PyTorch: time = 0.242s
Speedup: 15.7x
Max coefficient difference: 0.000001


## Summary

* **GPU computing** accelerates data-parallel computations by using thousands of simple cores working simultaneously.

* **CuPy** provides a NumPy-compatible interface for GPU arrays. Replace `numpy` with `cupy` and your code runs on GPU with minimal changes.

* **PyTorch** offers GPU tensors plus automatic differentiation. Use it when you need gradients or are working near deep learning applications.

* **Data transfer** between CPU and GPU has significant overhead. Minimize transfers by batching operations and keeping data on GPU.

* **Timing GPU operations** requires synchronization. Always call `synchronize()` before timing measurements.

* **GPU excels at** large matrix operations, element-wise computations on big arrays, and embarrassingly parallel tasks like Monte Carlo simulations.

* **GPU is not suitable for** small data, sequential algorithms, or tasks with complex branching.

## Recommended Resources

* [CuPy Documentation](https://docs.cupy.dev/) - Official CuPy documentation
* [CuPy User Guide](https://docs.cupy.dev/en/stable/user_guide/index.html) - Getting started with CuPy
* [PyTorch Documentation](https://pytorch.org/docs/stable/index.html) - Official PyTorch documentation
* [PyTorch CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) - GPU programming with PyTorch
* [NVIDIA CUDA Python](https://developer.nvidia.com/cuda-python) - NVIDIA's CUDA Python resources