# Homework: Documenting Your Code + Testing Your Code

## Problem 1 - Write docstrings

The following functions are missing docstrings. Write Google-style docstrings for each function, including `Args`, `Returns`, and `Raises` sections where appropriate. Make sure to document default values and explain what each parameter means.

In [None]:
import numpy as np

def normalize(data, method="zscore"):
    """Normalizes a dataset using the specified method.

    Standardizes or scales input data to a common range or distribution.
    Supports Z-score standardization (default) and Min-Max scaling.

    Args:
        data (array-like): The input data to be normalized. Can be a list,
            tuple, or NumPy array.
        method (str, optional): The normalization technique to use.
            Options are:
            * 'zscore': Standardizes data to have a mean of 0 and a
                standard deviation of 1.
            * 'minmax': Scales data to the range [0, 1].
            Defaults to "zscore".

    Returns:
        numpy.ndarray: An array containing the normalized data.

    Raises:
        ValueError: If `method` is not "zscore" or "minmax".
    """
    if method == "zscore":
        return (data - np.mean(data)) / np.std(data)
    elif method == "minmax":
        return (data - np.min(data)) / (np.max(data) - np.min(data))
    else:
        raise ValueError(f"Unknown method: {method}")


def weighted_mean(values, weights=None):
    """Calculates the arithmetic or weighted mean of a set of values.

    If weights are provided, computes the weighted average where each value
    contributes according to its corresponding weight. If no weights are
    provided, computes the standard arithmetic mean.

    Args:
        values (array-like): The input values to average. Can be a list,
            tuple, or NumPy array.
        weights (array-like, optional): An array of weights corresponding to
            `values`. If provided, must have the same length as `values`.
            Defaults to None.

    Returns:
        float: The calculated mean (weighted or arithmetic).

    Raises:
        ValueError: If `weights` is provided but does not have the same
            length as `values`.
    """
    if weights is None:
        return np.mean(values)
    if len(values) != len(weights):
        raise ValueError("values and weights must have the same length")
    return np.sum(values * weights) / np.sum(weights)

def remove_outliers(data, threshold=3.0):
    """Removes outliers from a dataset based on Z-score thresholding.

    Filters out data points that lie more than a specified number of standard
    deviations away from the mean. This assumes the data follows a roughly
    normal distribution.

    Args:
        data (numpy.ndarray): The input data array containing numerical values.
        threshold (float, optional): The Z-score threshold. Data points with a
            Z-score absolute value greater than this will be removed.
            Defaults to 3.0.

    Returns:
        numpy.ndarray: A filtered array with outliers removed.
    """
    mean = np.mean(data)
    std = np.std(data)
    mask = np.abs(data - mean) <= threshold * std
    return data[mask]

## Problem 2 - Add type hints

The following functions have incomplete or missing type hints. Add appropriate type hints for all parameters and return values. Use `|` syntax for union types where a parameter can accept multiple types or return `None`.

In [None]:
import numpy as np

def clip_values(arr: np.ndarray, lower: float | int, upper: float | int) -> np.ndarray:
    """Clip array values to be within [lower, upper] range."""
    return np.clip(arr, lower, upper)


def find_peaks(data: list[float] | np.ndarray, min_height: float | int | None = None) -> list[int] | None:
    """Find indices where values are local maxima above min_height.

    Returns None if no peaks are found.
    """
    peaks = []
    for i in range(1, len(data) - 1):
        if data[i] > data[i - 1] and data[i] > data[i + 1]:
            if min_height is None or data[i] >= min_height:
                peaks.append(i)
    if len(peaks) == 0:
        return None
    return peaks


def summarize(data: list[float] | np.ndarray, stats: list[str]) -> dict[str, float]:
    """Calculate summary statistics for data.

    Args:
        data: Input array of numeric values.
        stats: List of statistic names to compute.
            Valid options: "mean", "median", "std", "min", "max"

    Returns:
        Dictionary mapping statistic names to computed values.
    """
    result = {}
    for stat in stats:
        if stat == "mean":
            result[stat] = float(np.mean(data))
        elif stat == "median":
            result[stat] = float(np.median(data))
        elif stat == "std":
            result[stat] = float(np.std(data))
        elif stat == "min":
            result[stat] = float(np.min(data))
        elif stat == "max":
            result[stat] = float(np.max(data))
    return result

## Problem 3: Identifying Test Types

For each scenario below, identify whether the test being described is a **unit test**, **integration test**, or **regression test**. Briefly explain your reasoning.

**(a)** You write a test that verifies `calculate_variance()` returns 0 for the input `[3.0, 3.0, 3.0]`.

This is a **unit test** because it tests a single function in isolation with a specific input to verify its correctness.

**(b)** After discovering that `fit_model()` crashes when given a dataset with a single row, you fix the bug and add a test with a one-row input.

This is a **regression test** because it ensures that a previously identified bug (crashing with a single-row dataset) does not reoccur in future versions of the code.

**(c)** You write a test that loads data from a CSV file, passes it through `clean_data()`, fits a model with `fit_linear_regression()`, and verifies the model's R-squared value is within an expected range.

This is an **integration test** because it tests multiple components (data loading, cleaning, and model fitting) working together to ensure they function correctly as a whole.

**(d)** A user reports that `normalize()` returns incorrect values when all input values are negative. After fixing the issue, you add a test with input `[-5.0, -3.0, -1.0]`.

This is a **regression test** because it verifies that a previously reported issue (incorrect normalization of negative values) has been resolved and does not reoccur in future versions.

## Problem 4: Code Review - What's Wrong with These Tests?

Review the following test code and identify at least **four** problems with the test design or implementation. Explain why each is problematic and suggest how to fix it.

In [None]:
import numpy as np
import pytest

def test_calculate_mean():
    data = [10, 20, 30, 40, 50]
    assert np.mean(data) == 30

def test_calculate_median():
    data = [10, 20, 30, 40, 50]
    assert np.median(data) == 30

def test_standard_deviation():
    data = [10, 20, 30, 40, 50]
    assert np.std(data) > 0

def test_min():
    data = [10, 20, 30, 40, 50]
    assert np.min(data) == 10

def test_max():
    data = [10, 20, 30, 40, 50]
    assert np.max(data) == 50

def test_sum():   
    data = [10, 20, 30, 40, 50]
    assert np.sum(data) == 150

def test_variance_positive(arr):
    var = np.var(arr)
    assert var >= 0

def test_correlation():
    x = np.array([1.0, 2.0, 3.0])
    y = np.array([2.0, 4.0, 6.0])
    corr = np.corrcoef(x, y)[0, 1]
    assert corr == pytest.approx(1.0)


def test_append_result():
    results = []
    results.append(42)
    assert 42 in results
    assert len(results) == 1

def test_check_results():
    results = []
    assert len(results) == 1

The function test_all_statistics tests the mean, median, std, min, max, and sum all in a single block. If the assertion for mean fails, the test crashes immediately. This makes it hard to identify which specific statistic calculation is failing. The Fix: Break them into separate "Unit" tests. Each test should verify one specific behavior.

Pytest will not run verify_variance_positiv since its name does not start with test_. The Fix: Rename the function to test_variance_positive.

In test_correlation, the code uses assert corr == 1.0 to check for perfect correlation. However, floating-point arithmetic can introduce small precision errors, making direct equality checks unreliable. The Fix: Use a tolerance-based comparison, such as abs(corr - 1.0) < 1e-6.

test_append_result and test_check_results rely on a global list named results to store intermediate results. This creates hidden dependencies between tests, making them order-dependent and harder to maintain. The Fix: Each test should manage its own state and not rely on shared global variables.

## Problem 5: The Flaky Test

Your colleague wrote the following test for a bootstrap confidence interval function:

In [None]:
import numpy as np

def bootstrap_ci(data, confidence=0.95, n_bootstrap=1000):
    """Compute bootstrap confidence interval for the mean."""
    means = []
    n = len(data)
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=n, replace=True)
        means.append(np.mean(sample))

    alpha = 1 - confidence
    lower = np.percentile(means, 100 * alpha / 2)
    upper = np.percentile(means, 100 * (1 - alpha / 2))
    return lower, upper

def test_bootstrap_ci_contains_true_mean():
    data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    true_mean = 5.5
    np.random.seed(42)
    lower, upper = bootstrap_ci(data, confidence=0.95, n_bootstrap=1000)
    assert lower <= upper
    assert lower >= np.min(data)
    assert upper <= np.max(data)
    assert lower < sample_mean < upper

**(a)** The test passes most of the time but occasionally fails. Explain why this test is "flaky" (non-deterministic).

The line assert lower < true_mean < upper demands that the interval capture the mean 100% of the time. Since the method relies on randomness, occasionally the random sampler will pick a skewed set of numbers (e.g., mostly 9s and 10s), shifting the interval enough that it no longer covers 5.5, causing the test to fail.

**(b)** Your colleague argues: "The test is correct because a 95% confidence interval should contain the true mean 95% of the time, so occasional failures are expected." Is this a good argument for keeping the test as-is? Why or why not?

This is a bad argument because he/she is confusing statistical validity with test reliability. If the test fails, we don't know the answer. It could be broken code, or it could be "bad luck" (statistical noise). We need to verify that the Python implementation (the loop, the random choice, the percentile calculation) is correct. We can do this without actual randomness.

**(c)** Rewrite the test to be deterministic and reliable while still meaningfully testing the `bootstrap_ci` function. Your solution should: ensure reproducible results and verify that the confidence interval has reasonable properties.

Add a random seed in test_bootstrap_ci_contains_true_mean(). 

**(d)** Propose an alternative testing strategy that could verify the 95% coverage property without making the test flaky. You don't need to implement it, but describe the approach.

Instead of checking if one interval captures the mean (which fails 5% of the time), check if the algorithm captures the mean roughly 95% of the time over many trials, and make that simulation reproducible.