# Parallel GPU: CPU vs GPU Performance Comparison

This notebook compares CPU and GPU performance for estimating confidence interval coverage rates using Monte Carlo simulation.

## Problem Setup
- We generate samples from a Poisson distribution with mean Î¼ = 2
- For each sample, we compute a 95% confidence interval using the normal approximation
- We check whether the true mean falls within the confidence interval
- The coverage rate is the proportion of samples where the CI contains the true mean

## This Approach
- **CPU**: Process each replication sequentially in a loop
- **GPU**: Generate all replications and compute coverage in parallel using CuPy

## Import Libraries

We import:
- `time` for measuring execution time
- `numpy` for CPU-based numerical computations
- `cupy` for GPU-accelerated numerical computations (if available)

In [1]:
import time
import numpy as np

try:
    import cupy as cp
except ImportError:
    cp = None

## Define Helper Function

`ci_bounds(x)` computes the 95% confidence interval bounds for a sample using the normal approximation:
- $\bar{x}$: sample mean
- $s$: sample standard deviation
- $n$: sample size
- Margin of error: $1.96 \times s / \sqrt{n}$
- Bounds: $[\bar{x} - \text{margin}, \bar{x} + \text{margin}]$

In [2]:
mu = 2

def ci_bounds(x):
    """Compute 95% confidence interval bounds."""
    n = len(x)
    xbar = np.mean(x)
    sig = np.std(x)
    upper = xbar + 1.96 / np.sqrt(n) * sig
    lower = xbar - 1.96 / np.sqrt(n) * sig
    return {"lower": lower, "upper": upper}

## Define Coverage Estimation Functions

### `coverage_cpu(rep, n, lam)`
- Generates `rep` samples sequentially
- For each sample, computes the CI and checks if it contains `lam`
- Returns the proportion of covered samples

### `coverage_gpu(rep, n, lam)`
- Generates all `rep` samples simultaneously as a (rep, n) array
- Computes means and standard deviations across all replications in parallel
- Checks coverage for all replications at once using vectorized operations
- Returns the proportion of covered samples

In [3]:
def coverage_cpu(rep, n, lam):
    """Estimate coverage rate using CPU.
    Generates samples and computes coverage sequentially."""
    covered_count = 0
    for _ in range(rep):
        x = np.random.poisson(lam, n)
        bounds = ci_bounds(x)
        if (bounds["lower"] <= lam) and (lam <= bounds["upper"]):
            covered_count += 1
    return covered_count / rep

def coverage_gpu(rep, n, lam):
    """Estimate coverage rate using GPU.
    All replications are generated and processed in parallel."""
    x = cp.random.poisson(lam=lam, size=(rep, n))
    xbar = cp.mean(x, axis=1)
    sig = cp.std(x, axis=1)
    margin = (1.96 / np.sqrt(n)) * sig
    covered = (xbar - margin <= lam) & (lam <= xbar + margin)
    return cp.mean(covered)

## Set Experimental Configuration

We define:
- `Rep`: Number of Monte Carlo replications (100,000)
- `sample_size`: Sample size per replication (5)

The coverage rate for small samples (n=5) is expected to be below 0.95 because the normal approximation to the Poisson distribution is poor when n is small.

In [4]:
# Experimental configuration
Rep = 100000
sample_size = 5

print(f"Sample size = {sample_size}, replications = {Rep}:")

Sample size = 5, replications = 100000:


## CPU Benchmark

We run the CPU version and measure:
- Execution time
- Estimated coverage rate

The CPU version processes each replication sequentially, which is straightforward but slower for large numbers of replications.

In [5]:
print(f"\n{'='*60}")
print("CPU Coverage Estimation")
print(f"{'='*60}")

t0 = time.time()
out_cpu = coverage_cpu(Rep, sample_size, mu)
cpu_time = time.time() - t0

print(f"CPU sequential run takes {cpu_time:.4f} seconds")
print(f"CPU coverage = {out_cpu:.4f}")


CPU Coverage Estimation
CPU sequential run takes 3.3451 seconds
CPU coverage = 0.8427


## GPU Benchmark

We run the GPU version and measure:
- GPU device name
- Execution time
- Estimated coverage rate

The GPU version:
1. Generates all replications at once using `cp.random.poisson`
2. Computes statistics across all replications in parallel
3. Uses `cp.cuda.Stream.null.synchronize()` to ensure all GPU operations complete before timing

When CuPy is not available or no GPU is detected, the GPU run is skipped.

In [6]:
if cp is None:
    print("\nGPU run skipped: cupy is not installed.")
    gpu_time = None
    out_gpu = None
else:
    try:
        n_gpu = cp.cuda.runtime.getDeviceCount()
        if n_gpu < 1:
            print("\nGPU run skipped: no CUDA device detected.")
            gpu_time = None
            out_gpu = None
        else:
            props = cp.cuda.runtime.getDeviceProperties(0)
            gpu_name = props["name"].decode("utf-8")
            print(f"\n{'='*60}")
            print("GPU Coverage Estimation")
            print(f"{'='*60}")
            print(f"GPU device: {gpu_name}")

            t0 = time.time()
            out_gpu = coverage_gpu(Rep, sample_size, mu)
            cp.cuda.Stream.null.synchronize()
            gpu_time = time.time() - t0

            print(f"GPU parallel run takes {gpu_time:.4f} seconds")
            print(f"GPU coverage = {float(out_gpu.get()):.4f}")
    except cp.cuda.runtime.CUDARuntimeError as err:
        print(f"\nGPU run skipped: {err}")
        gpu_time = None
        out_gpu = None


GPU Coverage Estimation
GPU device: NVIDIA GeForce RTX 3060


GPU parallel run takes 0.2087 seconds
GPU coverage = 0.8440


## Performance Comparison

We compare CPU and GPU results:
- **Execution Time**: GPU should be significantly faster for large numbers of replications
- **Coverage Rate**: Both methods should produce nearly identical results
- **Speedup**: Ratio of CPU time to GPU time
- **Time Saved**: Actual time difference
- **Percentage Faster**: How much faster GPU is compared to CPU

In [7]:
print(f"\n{'='*70}")
print("PERFORMANCE COMPARISON SUMMARY")
print(f"{'='*70}")
print(f"\nConfiguration: Sample size = {sample_size:,}, Replications = {Rep:,}")
print(f"\n{'Metric':<25} {'CPU':<15} {'GPU':<15} {'Speedup':<10}")
print("-" * 70)

print(f"{'Execution Time (s)':<25} {cpu_time:<15.4f} ", end="")
if gpu_time is not None:
    speedup = cpu_time / gpu_time
    print(f"{gpu_time:<15.4f} {speedup:<10.2f}x")
else:
    print(f"{'N/A':<15} {'N/A':<10}")

print(f"{'Coverage Rate':<25} {out_cpu:<15.4f} ", end="")
if out_gpu is not None:
    print(f"{float(out_gpu.get()):<15.4f} -")
else:
    print(f"{'N/A':<15} -")

if gpu_time is not None:
    print(f"\n{'='*70}")
    print(f"GPU Speedup: {speedup:.2f}x")
    print(f"Time Saved: {cpu_time - gpu_time:.4f} seconds")
    print(f"{(cpu_time - gpu_time) / cpu_time * 100:.1f}% faster than CPU")


PERFORMANCE COMPARISON SUMMARY

Configuration: Sample size = 5, Replications = 100,000

Metric                    CPU             GPU             Speedup   
----------------------------------------------------------------------
Execution Time (s)        3.3451          0.2087          16.03     x
Coverage Rate             0.8427          0.8440          -

GPU Speedup: 16.03x
Time Saved: 3.1365 seconds
93.8% faster than CPU
