# Writing Efficient Code and GPU Computing Homework

Please save your solutions as a **PDF** and upload it to Canvas.

## Problem 1: Profiling and Vectorization

**(a)** Consider the following cProfile output from a data analysis program:

```
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    8.234    0.002    8.234    0.002 analysis.py:12(compute_distances)
     5000    0.089    0.000    0.089    0.000 analysis.py:28(normalize_vector)
  2500000    1.456    0.000    1.456    0.000 analysis.py:35(squared_diff)
        1    0.002    0.002    9.781    9.781 analysis.py:50(main)
```

Which function should you optimize first? Explain your reasoning based on the profiling data. What percentage of the total runtime does this function account for?

**(b)** The following function computes weighted squared differences between two arrays:

In [None]:
def weighted_squared_diff_loop(x, y, w):
    """Compute sum of weighted squared differences using a loop."""
    n = len(x)
    total = 0.0
    for i in range(n):
        diff = x[i] - y[i]
        total += w[i] * diff * diff
    return total

Write a vectorized version of this function using NumPy operations. Your function should produce the same result but without explicit Python loops. For example, `weighted_squared_diff(np.array([1, 2, 3]), np.array([0, 1, 1]), np.array([1, 2, 3]))` should return `15.0`.

In [None]:
import numpy as np


def weighted_squared_diff(x, y, w):
    """Compute sum of weighted squared differences using vectorization."""
    pass

**(c)** Write a function that transforms an array by replacing negative values with zero and scaling all positive values by their mean. For example, given `np.array([-2, 4, -1, 6, 2])`, the positive values are `[4, 6, 2]` with mean `4.0`, so the result should be `np.array([0, 1, 0, 1.5, 0.5])`. Use boolean indexing instead of loops.

In [None]:
import numpy as np


def transform_array(arr):
    """Replace negatives with 0, scale positives by their mean."""
    pass

## Problem 2: Parallelization and JIT Compilation

**(a)** The following function computes the mean of a bootstrap sample:

In [None]:
def compute_bootstrap_mean(args):
    """Compute mean of a bootstrap sample."""
    data, seed = args
    rng = np.random.RandomState(seed)
    sample = rng.choice(data, size=len(data), replace=True)
    return np.mean(sample)

Write a function that uses `multiprocessing.Pool` to compute `n_bootstrap` bootstrap means in parallel. Each bootstrap iteration should receive a unique seed to ensure different random samples. Return a list of the bootstrap means.

In [None]:
import numpy as np
import multiprocessing as mp


def compute_bootstrap_mean(args):
    """Compute mean of a bootstrap sample."""
    data, seed = args
    rng = np.random.RandomState(seed)
    sample = rng.choice(data, size=len(data), replace=True)
    return np.mean(sample)


def parallel_bootstrap(data, n_bootstrap, n_workers=4):
    """Compute bootstrap means in parallel."""
    pass

**(b)** Write a Numba-optimized function that computes the running maximum of an array. For each position `i`, the output should contain the maximum of all elements from index 0 to `i` (inclusive). For example, `running_max(np.array([3, 1, 4, 1, 5, 9, 2, 6]))` should return `np.array([3, 3, 4, 4, 5, 9, 9, 9])`.

In [None]:
from numba import njit
import numpy as np


@njit
def running_max(arr):
    """Compute running maximum of array."""
    pass

**(c)** The following Numba function attempts to filter an array to keep only positive values, but it fails to compile. Explain why it fails and provide a corrected version that compiles successfully with `@njit`.

In [None]:
from numba import njit
import numpy as np


@njit
def filter_positive_broken(arr):
    """Return array containing only positive values (BROKEN)."""
    result = []
    for x in arr:
        if x > 0:
            result.append(x)
    return np.array(result)

## Problem 3: GPU Computing Fundamentals

**(a)** For each of the following computational tasks, state whether it would benefit from GPU acceleration and explain why or why not.

1. Computing the mean of 500 numbers
2. Multiplying two 5000x5000 matrices
3. Reading a 10GB CSV file from disk
4. Running 1 million independent Monte Carlo simulations
5. Computing Fibonacci numbers recursively

**(b)** The following code runs slowly despite using GPU. Identify the performance problem and rewrite the code to fix it. The goal is to compute the sum of squares for 1000 different arrays.

In [None]:
import cupy as cp
import numpy as np

results = []
for i in range(1000):
    data = np.random.randn(10000)  # Generate on CPU
    gpu_data = cp.asarray(data)    # Transfer to GPU
    result = cp.sum(gpu_data ** 2) # Compute on GPU
    results.append(result.get())   # Transfer back to CPU

print(f"Total: {sum(results)}")

Write an efficient version that minimizes data transfers between CPU and GPU.

## Problem 4: CuPy and PyTorch

**(a)** Convert the following NumPy code to CuPy. The function computes z-score normalization and then the correlation matrix.

In [None]:
import numpy as np


def correlation_matrix_numpy(X):
    """Compute correlation matrix after z-score normalization.

    X has shape (n_samples, n_features).
    """
    # Z-score normalize each column
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    Z = (X - mean) / std

    # Compute correlation matrix
    n = X.shape[0]
    corr = (Z.T @ Z) / n
    return corr

Write the CuPy version that performs the computation on GPU and returns the result as a NumPy array.

In [None]:
import cupy as cp
import numpy as np


def correlation_matrix_cupy(X):
    """Compute correlation matrix using CuPy (GPU)."""
    pass

**(b)** The following PyTorch code has a bug that causes a runtime error. Identify the error and provide the corrected code.

In [None]:
import torch
import numpy as np


def process_data(numpy_array):
    """Process data using PyTorch on GPU."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Convert to tensor and move to GPU
    x = torch.from_numpy(numpy_array).to(device)

    # Create another tensor for computation
    weights = torch.ones(len(numpy_array))

    # Weighted sum
    result = torch.sum(x * weights)

    return result.item()

**(c)** Explain why the following GPU timing code gives incorrect measurements. Then provide corrected code that accurately measures GPU computation time.

In [None]:
import torch
import time

device = torch.device('cuda')
a = torch.randn(5000, 5000, device=device)
b = torch.randn(5000, 5000, device=device)

start = time.perf_counter()
c = torch.mm(a, b)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed*1000:.2f} ms")

## Problem 5: Performance Comparison

**(a)** In extreme value statistics, we often need to estimate the probability that the maximum of n independent standard normal random variables exceeds a threshold t. This can be done via Monte Carlo simulation: generate n normal values, take the maximum, and check if it exceeds t. Repeat this many times and compute the proportion that exceed t.

Implement two versions of this simulation:

1. A Numba-optimized CPU version using `@njit`
2. A CuPy GPU version using vectorized operations

Both functions should take parameters `n` (number of normal values per trial), `t` (threshold), and `n_simulations` (number of Monte Carlo trials), and return the estimated probability.

In [None]:
from numba import njit
import numpy as np
import cupy as cp


@njit
def estimate_prob_numba(n, t, n_simulations):
    """Estimate P(max of n normals > t) using Numba."""
    pass


def estimate_prob_cupy(n, t, n_simulations):
    """Estimate P(max of n normals > t) using CuPy."""
    pass

**(b)** Design an experiment to find the "crossover point" where the GPU version becomes faster than the CPU version. Your experiment should vary the problem size (e.g., `n_simulations`) and measure execution time for both implementations. Describe what factors affect where this crossover occurs and what values you would test.

**(c)** Suppose you need to run a very large simulation with `n_simulations = 100_000_000` but your GPU only has 8GB of memory. The naive CuPy implementation would require generating a matrix of shape `(n_simulations, n)` which may not fit in memory. Write a batched version that processes the simulations in chunks to stay within memory limits.

In [None]:
import cupy as cp


def estimate_prob_cupy_batched(n, t, n_simulations, batch_size=1_000_000):
    """Estimate P(max of n normals > t) using CuPy with batching."""
    pass