_FISHING_

[@sparshsah](https://github.com/sparshsah)

CONTEXT:

We have a population afflicted with some medical condition.
We want to investigate whether a drug $D$ improves survivorship.
We want to run the investigation at the $\alpha = 0.05$ significance level,
i.e. if the drug does not improve survival probability,
we want only a 5\% chance of a false positive.

We _know_ that without the drug, survival probability is exactly $0.30$, i.e.
letting $D_s$ be a boolean indicator of whether patient $s$ is administered the drug
and $y_s$ be a boolean indicator of whether patient $s$ survives,

$$
    \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

Although experimental design is a perfectly reasonable thing to scrutinize in real life,
that is not the point here, so in the following,
assume that all experimental-design considerations are airtight:
* Each investigation is double-blind,
* The participants are truly randomly selected,
* The control group administered a placebo achieved exactly
    $\hat{\Pr}[y_s = 1 \mid D_s = 0] = 0.30$,
* The Normal approximation to the Binomial is good enough (so we can use $1.65$ as the critical value of the z-score as our test statistic),
* Etc.

We will examine several investigators' proposed methods, and determine which ones are valid, i.e. which ones respect the $\alpha = 0.05$ probability of a false positive under the null hypothesis that the drug has no treatment effect, i.e.

$$
    H_0:
    \Pr[y_s = 1 \mid D_s = 1] = \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

To determine validity, we will assume that the null is true and
simulate each investigator's investigation 10,000 times,
then hypothesis-test whether his or her proposed method was valid.
But of course each investigator's investigation will itself involve
statistical inference via hypothesis-testing,
so in each simulation we're going to be running a separate inner test.
In(ference)ception!

Note: I'm going to say "t-test" a lot below,
even though really, because we're assuming convergence of
the sampling distribution of the sample proportion to the Normal
and therefore using a z-score as our test-statistic, it's like a "Wald test".
But "Wald test" sounds unfamiliar while "t-test" is taught in high-school stats classes.

In [31]:
from typing import Final

import numpy as np

In [36]:
# investigation setup

DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.05
# one-tailed 5% level
CHOSEN_CRITICAL_Z_SCORE: Final[float] = 1.65

GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG: Final[float] = 0.30

NUM_PATIENTS: Final[int] = 1_000

def is_false_positive

In [43]:
# simulation setup
_NSIMS: Final[int] = 10_000
# Different than "chosen critical z-score" above!
# Here, we want a two-sided test, because we'd like to catch
# an alpha that is _either_ lower or higher than the desired alpha.
# _DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.05
# We never actually use `_DESIRED_PROB_FALSE_POSITIVE`,
# it's just the basis for this critical value:
_CRITICAL_Z_SCORE: Final[float] = 1.96


def is_evidence_against_validity_significant(
    observed_indicator_false_positive: np.ndarray[bool],
    desired_prob_false_positive: float = DESIRED_PROB_FALSE_POSITIVE,
) -> None:
    """Determine whether we have statistically-significant evidence
    to reject the hypothesis that this investigator's proposed method
    respects the alpha = `DESIRED_PROB_FALSE_POSITIVE` significance level.

    Args:
        observed_indicator_false_positive: np.ndarray[bool], a boolean array
            of length `nsims` indicating whether, in simulation number `n`,
            the investigator's method yielded a positive result
            allowing them to conclude that the drug works.
            Of course, we've deliberately set this up assuming that
            the drug does NOT work, so each `True` in this array is a false positive.
        desired_prob_false_positive: float = 5%,
            In order to determine whether this investigator's proposed method
            is valid, we're going to determine whether the observed frequency
            of false positives was "close enough" to the desired alpha=5% significance level.
            Now we could of course just eyeball it to determine whether it's close enough --
            But can we really determine whether an observed false-positive frequency
            of 5.3% is plausibly "close enough" to 5%? Or is it too far away?
            Well, let's do another meta-t-test to determine whether it's too far away
            to be plausible that it's close enough!

    Returns:
        Nothing. I just print the results. I know, I know. Sue me.
    """
    observed_prob_false_positive = observed_indicator_false_positive.mean()
    nsims = len(observed_indicator_false_positive)
    bernoulli_stderr = (
        np.sqrt(desired_prob_false_positive * (1-desired_prob_false_positive))
        / np.sqrt(nsims)
    )
    z = (observed_prob_false_positive - desired_prob_false_positive) / bernoulli_stderr
    flag = abs(z) > _CRITICAL_Z_SCORE
    msg = (
        "In this investigative method,"
        + f" We observed a {observed_prob_false_positive :.4f} probability of false positives,"
        + f" Against an expectation of {desired_prob_false_positive :.4f}."
        + f" Given that we ran {nsims :,} simulations,"
        + f" The Bernoulli standard error is {bernoulli_stderr :.4f},"
        + f" Yielding a z-score of {z :.2f} "
        + f" Against a critical value of {_CRITICAL_Z_SCORE :.2f}."
        + f""" Therefore, we {"do" if flag else "don't"} have statistically-significant evidence"""
        + f" Against the validity of this investigative method at"
        + f" the alpha = {_DESIRED_PROB_FALSE_POSITIVE :.2f} level."
    )
    print(msg)
    return

# Alice

Alice proposes administering the drug to 1,000 patients and then t-testing for significance.

In [19]:
# gender of each patient
rng = np.random.default_rng(seed=42)
g = rng.choice(2, size=NUM_PATIENTS)