FISHING

[@sparshsah](https://github.com/sparshsah)

# Context

We have a population afflicted with some medical condition.
We want to investigate whether a drug $D$ improves survivorship.
We want to run the investigation at the $\alpha = 0.05$ significance level,
i.e. if the drug does not improve survival probability,
we want only a 5\% chance of a false positive.

We _know_ that without the drug, survival probability is exactly $0.30$, i.e.
letting $D_s$ be a boolean indicator of whether patient $s$ is administered the drug
and $y_s$ be a boolean indicator of whether patient $s$ survives,

$$
    \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

Although experimental design is a perfectly reasonable thing to scrutinize in real life,
that is not the point here, so in the following,
assume that all experimental-design considerations are airtight:
* Each investigation is double-blind,
* The participants are truly randomly selected,
* The control group administered a placebo achieved exactly
    $\hat{\Pr}[y_s = 1 \mid D_s = 0] = 0.30$,
* The Normal approximation to the Binomial is good enough (so we can use $1.65$ as the critical value of the z-score as our test statistic),
* Etc.

We will examine several investigators' proposed methods, and determine which ones are valid, i.e. which ones respect the $\alpha = 0.05$ probability of a false positive under the null hypothesis that the drug has no treatment effect, i.e.

$$
    H_0:
    \Pr[y_s = 1 \mid D_s = 1] = \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

To determine validity, we will assume that the null is true and
simulate each investigator's investigation 10,000 times,
then hypothesis-test whether his or her proposed method was valid.
But of course each investigator's investigation will itself involve
statistical inference via hypothesis-testing,
so in each simulation we're going to be running a separate inner test.
In(ference)ception!

Note: I'm going to say "t-test" a lot below,
even though really, because we're assuming convergence of
the sampling distribution of the sample proportion to the Normal
and therefore using a z-score as our test-statistic, it's like a asymptotic "large-sample Wald test".
But "large-sample Wald test" sounds unfamiliar while "t-test" is taught in high-school stats classes.

In [31]:
from typing import Final

import numpy as np

In [52]:
def get_bernoulli_stderr(
    p: float = 0.50,
    n: float = 100,
) -> float:
    """Stderr of sample proportion of successes observed among `n` Bernoulli draws,
    assuming that each draw has i.i.d. ground-truth probability of success = `p`.

    Remember:
        * Bernoulli variance is `pq`.
        * When you add `n` independent random variables, their variances add up.
        * When you scale a random variable by `c`, its variance scales by `c^2`.
    So, when calculating the sample mean:
        * You add `n` Bernoulli indicators, so the variance of the sum is npq.
        * You then divide by `n` to get the fraction of successes, so the variance
             of that guy becomes npq/n = pq/n.
        * And the standard error (standard deviation of the sampling distribution
             of the sample mean) is of course just the square root of the variance.
    """
    return np.sqrt(p * (1-p) / n)

## Investigation setup

In [75]:
DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.05
# one-tailed 5% level
CHOSEN_CRITICAL_Z_SCORE: Final[float] = 1.65

GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG: Final[float] = 0.30

NUM_PATIENTS: Final[int] = 1_000


def do_trials_conclude_drug_works(
    y: np.ndarray[np.ndarray[bool]],
    chosen_critical_z_score: float = CHOSEN_CRITICAL_Z_SCORE,
    ground_truth_survival_prob_given_no_drug: float = GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG,
) -> np.ndarray[bool]:
    """Decide whether this investigator's trials yielded a positive result,
    i.e. concluded that the drug improves survivorship.

    Args:
        y: np.ndarray[np.ndarray[bool]], a boolean matrix of dimension
            `nsims X n_patients`, with entry [n,s] indicating whether,
            during simulation number `n`, patient number `s` survived
            after taking the drug.
        chosen_critical_z_score: float = 1.65, the critical value
            that the investigator will compare his or her observed test statistic against.
            In practice, this means that, after a given simulation,
            the investigator will perform a t-test (or really,
            because he or she is using z-score as the test statistic,
            an asymptotic large-sample Wald test)
            by comparing the observed survivorship against the
            control survivorship of 0.30,
            and decide whether the t-stat is larger than 1.65.
            If it is, he or she will conclude that there is statistically-significant
            evidence against the hypothesis that the drug had no effect,
            and thus conclude that it worked.
        ground_truth_survival_prob_given_no_drug: float = 0.30,
            We need something to compare the observed survivorship against.
            If the survivorship amongst people who weren't given the drug
            is 0%, and the survivorship amongst our participants is 5%,
            that's at least better than nothing.
            But if the survivorship amongst people who weren't given the drug
            is 80%, and the survivorship amongst our patients is 75% or even 81%,
            depending on the sample size, that isn't really anything to write home about.

    Returns:
        np.ndarray[bool], a boolean vector of length `nsims`
        indicating whether, during simulation number `n`,
        the investigator concluded that the drug worked.
    """
    observed_survival_prob_given_drug = y.mean(axis=1)
    nsims = y.shape[0]
    n_patients = y.shape[1]
    stderr = get_bernoulli_stderr(
        p=ground_truth_survival_prob_given_no_drug,
        n=n_patients,
    )
    z = (observed_survival_prob_given_drug - ground_truth_survival_prob_given_no_drug) / stderr
    flag = z > chosen_critical_z_score
    assert flag.shape == (nsims,), (flag.shape, y.shape)
    return flag

## Simulation setup

In [62]:
type(np.random.default_rng())

numpy.random._generator.Generator

In [98]:
_NSIMS: Final[int] = 10_000
# Different than "chosen critical z-score" above!
# Here, we want a two-sided test, because we'd like to catch
# an alpha that is _either_ lower or higher than the desired alpha.
# _DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.05
# We never actually use `_DESIRED_PROB_FALSE_POSITIVE`,
# it's just the basis for this critical value:
_CRITICAL_Z_SCORE: Final[float] = 1.96


def _get_rng(name: str = "alice") -> np.random._generator.Generator:
    seed = sum(ord(letter) for letter in name)
    return np.random.default_rng(seed=seed)


def is_evidence_against_validity_significant(
    observed_indicator_false_positive: np.ndarray[bool],
    desired_prob_false_positive: float = DESIRED_PROB_FALSE_POSITIVE,
) -> None:
    """Determine whether we have statistically-significant evidence
    to reject the hypothesis that this investigator's proposed method
    respects the alpha = `DESIRED_PROB_FALSE_POSITIVE` significance level.

    Args:
        observed_indicator_false_positive: np.ndarray[bool], a boolean vector
            of length `nsims` indicating whether, in simulation number `n`,
            the investigator's method yielded a positive result
            allowing them to conclude that the drug works.
            Of course, we've deliberately set this up assuming that
            the drug does NOT work, so each `True` in this array is a false positive.
        desired_prob_false_positive: float = 5%,
            In order to determine whether this investigator's proposed method
            is valid, we're going to determine whether the observed frequency
            of false positives was "close enough" to the desired alpha=5% significance level.
            Now we could of course just eyeball it to determine whether it's close enough --
            But can we really determine whether an observed false-positive frequency
            of 5.3% is plausibly "close enough" to 5%? Or is it too far away?
            Well, let's do another meta-t-test to determine whether it's too far away
            to be plausible that it's close enough!

    Returns:
        Nothing. I just print the results. I know, I know. Sue me.
    """
    observed_prob_false_positive = observed_indicator_false_positive.mean()
    nsims = len(observed_indicator_false_positive)
    stderr = get_bernoulli_stderr(
        p=desired_prob_false_positive,
        n=nsims,
    )
    z = (observed_prob_false_positive - desired_prob_false_positive) / stderr
    flag = abs(z) > _CRITICAL_Z_SCORE
    assert type(flag) == np.bool_, flag
    msg = (
        "In this investigative method,"
        + f"\nWe observed a {observed_prob_false_positive :.4f} probability of false positives,"
        + f"\nAgainst an expectation of {desired_prob_false_positive :.2f}."
        + f"\nGiven that we ran {nsims :,} simulations,"
        + f"\nThe Bernoulli standard error is {stderr :.4f},"
        + f"\nYielding a z-score of {z :.2f} "
        + f"\nAgainst a critical value of {_CRITICAL_Z_SCORE :.2f}."
        + f"""\nTherefore, we {"DO" if flag else "DON'T"} have statistically-significant evidence"""
        + f"\nAgainst the validity of this investigative method at"
        + f"\nthe alpha = {_DESIRED_PROB_FALSE_POSITIVE :.2f} level."
    )
    print(msg)
    return

# Proposed investigative methods

## Alice

Alice proposes administering the drug to 1,000 patients and then t-testing for significance.

In [99]:
rng = _get_rng(name="alice")
y = rng.choice(
    [0, 1],
    p=[1-GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG, GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG],
    size=[_NSIMS, NUM_PATIENTS],
)
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)

In this investigative method,
We observed a 0.0524 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 1.10 
Against a critical value of 1.96.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.
