FISHING

[@sparshsah](https://github.com/sparshsah)

Any errors my own.

# Context

We have a population afflicted with some medical condition.
We want to investigate whether a drug $D$ improves survivorship.
We want to run the investigation at the $\alpha = 0.05$ significance level,
i.e. if the drug does not improve survival probability,
we want only a 5\% chance of a false positive.

We _know_ that without the drug, survival probability is exactly $0.30$, i.e.
letting $D_s$ be a boolean indicator of whether patient $s$ is administered the drug
and $y_s$ be a boolean indicator of whether patient $s$ survives,

$$
    \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

Although experimental design is a perfectly reasonable thing to scrutinize in real life,
that is not the point here, so in the following,
assume that all experimental-design considerations are airtight:
* Each investigation is double-blind,
* The participants are truly randomly selected,
* The control group administered a placebo achieved exactly
    $\hat{\Pr}[y_s = 1 \mid D_s = 0] = 0.30$,
* The Normal approximation to the Binomial is good enough (so we can use $1.65$ as the critical value of the z-score as our test statistic),
* Etc.

We will examine several investigators' proposed methods, and determine which ones are invalid, i.e. which ones do not respect the $\alpha = 0.05$ probability of a false positive under the null hypothesis that the drug has no treatment effect:

$$
    H_0:
    \Pr[y_s = 1 \mid D_s = 1] = \Pr[y_s = 1 \mid D_s = 0] = 0.30
.$$

To determine invalidity (or lack of evidence thereof), we will assume that the null is true and
simulate each investigator's investigation 10,000 times,
then hypothesis-test whether his or her proposed method yielded a $p$-value way different than the desired 5\%. So each investigator will run inner t-tests to determine whether the drug works, and then we will, independent of that, run our own meta-t-tests to determine whether his or her method was invalid. In(feren)ception!

Note: I'm going to say "t-test" a lot below,
even though really, because we're assuming convergence of
the sampling distribution of the sample proportion to the Normal
and therefore using a z-score as our test-statistic, it's like a asymptotic "large-sample Wald test".
But "large-sample Wald test" sounds unfamiliar while "t-test" is taught in high-school stats classes.

In [209]:
# very proud of how minimal this dependency is...
from typing import Final
import numpy as np

In [52]:
def get_bernoulli_stderr(
    p: float = 0.50,
    n: float = 100,
) -> float:
    """Stderr of sample proportion of successes observed among `n` Bernoulli draws,
    assuming that each draw has i.i.d. ground-truth probability of success = `p`.

    Remember:
        * Bernoulli variance is `pq`.
        * When you add `n` independent random variables, their variances add up.
        * When you scale a random variable by `c`, its variance scales by `c^2`.
    So, when calculating the sample mean:
        * You add `n` Bernoulli indicators, so the variance of the sum is npq.
        * You then divide by `n` to get the fraction of successes, so the variance
             of that guy becomes npq/n = pq/n.
        * And the standard error (standard deviation of the sampling distribution
             of the sample mean) is of course just the square root of the variance.
    """
    return np.sqrt(p * (1-p) / n)


def _get_rng(name: str = "alice") -> np.random._generator.Generator:
    seed = sum(ord(letter) for letter in name)
    return np.random.default_rng(seed=seed)

## Investigation setup

In [75]:
DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.05
# one-tailed 5% level
CHOSEN_CRITICAL_Z_SCORE: Final[float] = 1.65

GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG: Final[float] = 0.30

NUM_PATIENTS: Final[int] = 1_000


def do_trials_conclude_drug_works(
    y: np.ndarray[np.ndarray[bool]],
    chosen_critical_z_score: float = CHOSEN_CRITICAL_Z_SCORE,
    ground_truth_survival_prob_given_no_drug: float = GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG,
) -> np.ndarray[bool]:
    """Decide whether this investigator's trials yielded a positive result,
    i.e. concluded that the drug improves survivorship.

    Args:
        y: np.ndarray[np.ndarray[bool]], a boolean matrix of dimension
            `nsims X n_patients`, with entry [n,s] indicating whether,
            during simulation number `n`, patient number `s` survived
            after taking the drug.
        chosen_critical_z_score: float = 1.65, the critical value
            that the investigator will compare his or her observed test statistic against.
            In practice, this means that, after a given simulation,
            the investigator will perform a t-test (or really,
            because he or she is using z-score as the test statistic,
            an asymptotic large-sample Wald test)
            by comparing the observed survivorship against the
            control survivorship of 0.30,
            and decide whether the t-stat is larger than 1.65.
            If it is, he or she will conclude that there is statistically-significant
            evidence against the hypothesis that the drug had no effect,
            and thus conclude that it worked.
        ground_truth_survival_prob_given_no_drug: float = 0.30,
            We need something to compare the observed survivorship against.
            If the survivorship amongst people who weren't given the drug
            is 0%, and the survivorship amongst our participants is 5%,
            that's at least better than nothing.
            But if the survivorship amongst people who weren't given the drug
            is 80%, and the survivorship amongst our patients is 75% or even 81%,
            depending on the sample size, that isn't really anything to write home about.

    Returns:
        np.ndarray[bool], a boolean vector of length `nsims`
        indicating whether, during simulation number `n`,
        the investigator concluded that the drug worked.
    """
    observed_survival_prob_given_drug = y.mean(axis=1)
    nsims = y.shape[0]
    n_patients = y.shape[1]
    stderr = get_bernoulli_stderr(
        p=ground_truth_survival_prob_given_no_drug,
        n=n_patients,
    )
    z = (observed_survival_prob_given_drug - ground_truth_survival_prob_given_no_drug) / stderr
    flag = z > chosen_critical_z_score
    assert flag.shape == (nsims,), (flag.shape, y.shape)
    return flag

## Simulation setup

In [192]:
_NSIMS: Final[int] = 10_000
# Different than "chosen critical z-score" above!
# Here, we want a two-sided test, because we'd like to catch
# an alpha that is _either_ lower or higher than the desired alpha.
# It's set really low because I want to be _really_, _really_
# sure that a proposed method is invalid.
# _DESIRED_PROB_FALSE_POSITIVE: Final[float] = 0.0001
# We never actually use `_DESIRED_PROB_FALSE_POSITIVE`,
# it's just the basis for this critical value:
# https://www.criticalvaluecalculator.com/
_CRITICAL_Z_SCORE: Final[float] = 3.89


def is_evidence_against_validity_significant(
    observed_indicator_false_positive: np.ndarray[bool],
    desired_prob_false_positive: float = DESIRED_PROB_FALSE_POSITIVE,
) -> None:
    """Determine whether we have statistically-significant evidence
    to reject the hypothesis that this investigator's proposed method
    respects the alpha = `DESIRED_PROB_FALSE_POSITIVE` significance level.

    Args:
        observed_indicator_false_positive: np.ndarray[bool], a boolean vector
            of length `nsims` indicating whether, in simulation number `n`,
            the investigator's method yielded a positive result
            allowing them to conclude that the drug works.
            Of course, we've deliberately set this up assuming that
            the drug does NOT work, so each `True` in this array is a false positive.
        desired_prob_false_positive: float = 5%,
            In order to determine whether this investigator's proposed method
            is valid, we're going to determine whether the observed frequency
            of false positives was "close enough" to the desired alpha=5% significance level.
            Now we could of course just eyeball it to determine whether it's close enough --
            But can we really determine whether an observed false-positive frequency
            of 5.3% is plausibly "close enough" to 5%? Or is it too far away?
            Well, let's do another meta-t-test to determine whether it's too far away
            to be plausible that it's close enough!

    Returns:
        Nothing. I just print the results. I know, I know. Sue me.
    """
    observed_prob_false_positive = observed_indicator_false_positive.mean()
    nsims = len(observed_indicator_false_positive)
    stderr = get_bernoulli_stderr(
        p=desired_prob_false_positive,
        n=nsims,
    )
    z = (observed_prob_false_positive - desired_prob_false_positive) / stderr
    flag = abs(z) > _CRITICAL_Z_SCORE
    assert type(flag) == np.bool_, flag
    msg = (
        "In this investigative method,"
        + f"\nWe observed a {observed_prob_false_positive :.4f} probability of false positives,"
        + f"\nAgainst an expectation of {desired_prob_false_positive :.2f}."
        + f"\nGiven that we ran {nsims :,} simulations,"
        + f"\nThe Bernoulli standard error is {stderr :.4f},"
        + f"\nYielding a z-score of {z :.2f} "
        + f"\nAgainst a critical value of {_CRITICAL_Z_SCORE :.2f}."
        + f"""\nTherefore, we {"DO" if flag else "DON'T"} have statistically-significant evidence"""
        + f"\nAgainst the validity of this investigative method at"
        + f"\nthe alpha = {_DESIRED_PROB_FALSE_POSITIVE :.2f} level."
    )
    print(msg)
    return

# Proposed investigative methods

## The Alices

Each of Allie and Alicia independently proposes administering the drug to 1,000 patients and then t-testing for significance.

In [193]:
rng = _get_rng(name="alice")

In [194]:
y_allie = rng.choice(
    [0, 1],
    p=[1-GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG, GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG],
    size=[_NSIMS, NUM_PATIENTS],
)
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y_allie)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)

In this investigative method,
We observed a 0.0524 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 1.10 
Against a critical value of 3.89.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


In [195]:
y_alicia = rng.choice(
    [0, 1],
    p=[1-GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG, GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG],
    size=[_NSIMS, NUM_PATIENTS],
)
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y_alicia)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)

In this investigative method,
We observed a 0.0551 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 2.34 
Against a critical value of 3.89.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Good. This is the archetype of a valid experiment.

## The Bobs

The Bobs (Bobby and Robert), observe that Allie's results were null.
However, they discover that the observed survival probabilities differed among men and women.
Gender can obviously a medically-important factor in treatment effectiveness.
So, they split Allie's data up along that dimension to analyze separately.

In [197]:
# filter to just the simulations where there was a null result for the general populace
observed_indicator_false_positive_allie = do_trials_conclude_drug_works(y=y_allie)
is_null_result = ~observed_indicator_false_positive_allie
y_allie_where_null = y_allie[is_null_result, :]
# free
del observed_indicator_false_positive_allie, is_null_result

In [198]:
# we conditioned on null results -- so check that we did this correctly,
# i.e. that we have ZERO false positives
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y_allie_where_null)
assert not observed_indicator_false_positive.any(), observed_indicator_false_positive.mean()

In [199]:
# Note the gender of each participant, evenly split between men and women.
# We'll just re-use this across the trials, it's not a random outcome anyway,
# it's just an irrelevant predictor.
rng = _get_rng(name="bob")
g = rng.choice([False, True], size=NUM_PATIENTS)
y_allie_where_null_among_men = y_allie_where_null[:, ~g]
y_allie_where_null_among_women = y_allie_where_null[:, g]
# free
del g

In [200]:
# we conditioned on null results -- so check that we did this correctly,
# i.e. that we have ZERO false positives
y_allie_where_null_among_men_and_women = np.concatenate(
    [y_allie_where_null_among_men, y_allie_where_null_among_women],
    axis=1,
)
assert y_allie_where_null_among_men_and_women.shape == y_allie_where_null.shape
observed_indicator_false_positive = do_trials_conclude_drug_works(
    y=y_allie_where_null_among_men_and_women
)
assert not observed_indicator_false_positive.any(), observed_indicator_false_positive.mean()
# free
del y_allie_where_null_among_men_and_women

### Bobby

Bobby wants to know whether the drug can at least help one or the other gender.

In [201]:
observed_indicator_false_positive_men = do_trials_conclude_drug_works(
    y=y_allie_where_null_among_men
)
observed_indicator_false_positive_women = do_trials_conclude_drug_works(
    y=y_allie_where_null_among_women
)
observed_indicator_false_positive = (
    observed_indicator_false_positive_men
    | observed_indicator_false_positive_women
)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del observed_indicator_false_positive_men, observed_indicator_false_positive_women

In this investigative method,
We observed a 0.0700 probability of false positives,
Against an expectation of 0.05.
Given that we ran 9,476 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 8.92 
Against a critical value of 3.89.
Therefore, we DO have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Bobby's method is invalid.
He suffers from multiple-comparisons issue.
This is a classic fishing expedition.

### Robert

Robert is more mature than Bobby and knows not to make his
multiple-comparisons mistake.
Robert instead wants to know specifically whether the drug can help women in particular.

In [202]:
observed_indicator_false_positive = do_trials_conclude_drug_works(
    y=y_allie_where_null_among_women
)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)

In this investigative method,
We observed a 0.0339 probability of false positives,
Against an expectation of 0.05.
Given that we ran 9,476 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of -7.20 
Against a critical value of 3.89.
Therefore, we DO have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Robert's method is also invalid.
He makes the opposite mistake from Bobby:
His method is too conservative, and results in too few false positives.
We know based on the experimental setup that this implies it will have lower power
than he's expecting.
This makes sense: Given that I already _told_ you that the overall results were null,
it becomes slightly less likely that the results will be positive for some specific subset.

In [None]:
del y_allie_where_null_among_men, y_allie_where_null_among_women

## The Charlies

The Charlies (Chuck and Charles) strongly believe that the drug works.

In [206]:
rng = _get_rng(name="charlie")

### Chuck

Chuck strongly believes that the drug works.
He decides that if Allie's results come up positive, he'll accept them,
but if they comes up null, he'll run his own experiment to see for himself whether it works.

In [207]:
observed_indicator_false_positive_allie = do_trials_conclude_drug_works(y=y_allie)
y_chuck = rng.choice(
    [0, 1],
    p=[1-GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG, GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG],
    # Ok in theory, we want to re-run sims for _only_ the sims where Allie's results were null.
    # The logic goes like this:
    #     For sim `n`: if Allie's result was positive, stop;
    #         Otherwise, re-run the sim and see if it's positive.
    # So there's conditional logic here:
    #     If Allie's results were positive, accept them,
    #     Otherwise, ignore them and run your own sim and take the results of that.
    # But, it's much faster to do this by just generating an unconditional fresh sim vector,
    # and doing an elementwise-or.
    # The results for each `n` will be the same:
    #     If Allie's result was positive, the or will be True;
    #     If Allie's result was negative but Chuck's result was positive, the or will be True;
    #     Finally if Allie's result was negative and then so was Chuck's, the or will be False.
    size=[_NSIMS, NUM_PATIENTS],
)
observed_indicator_false_positive_chuck = do_trials_conclude_drug_works(y=y_chuck)
observed_indicator_false_positive = (
    observed_indicator_false_positive_allie
    | observed_indicator_false_positive_chuck
)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del observed_indicator_false_positive_allie, y_chuck, observed_indicator_false_positive_chuck

In this investigative method,
We observed a 0.1021 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 23.91 
Against a critical value of 3.89.
Therefore, we DO have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


This is what I call the Donald-Trump approach:
If the polls report that I'm up, then clearly I'm leading;
But if the polls report that I'm down, then clearly they're rigged.
Under our experimental conditions, this is an invalid approach.
(Of course, under different conditions it could be valid:
Suppose you know for a fact that you're leading, then this logic is indeed right.)

### Charles

Like the Bobs, Charles observes Allie's null results,
but unlike the Bobs, he knows that picking through them is not a valid way to go.
However, like Chuck, he, too, strongly believes that the drug works.
He tries Allie's experiment a second time, to give the drug what he considers a fair second chance.

In [208]:
y_charles = rng.choice(
    [0, 1],
    p=[1-GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG, GROUND_TRUTH_SURVIVAL_PROB_GIVEN_NO_DRUG],
    # Be careful -- we want to re-run sims for _only_ the sims where Allie's results were null,
    # which means we have to inspect the current `y` which stores only Allie's null results.
    size=[y_allie_where_null.shape[0], NUM_PATIENTS],
)
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y_charles)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del y_charles

In this investigative method,
We observed a 0.0528 probability of false positives,
Against an expectation of 0.05.
Given that we ran 9,476 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 1.23 
Against a critical value of 3.89.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Charles's method cannot be rejected, even though it superficially looks like Chuck's!!
Even though he, like Chuck, is giving the drug a "second chance",
the fact that Alice's experiment was already said and done means that
we can condition on her results, and now his own experiment is independent of hers.
It would be very awkward indeed if Alice doing an experiment rippled through the
spacetime continuum and made that experiment forever invalid for anyone else to do,
even people who have never met her or seen her results...

## The Daves

The Daves (Davie and David) live by the matra "more data is better".
Like the Charlies, they like to find or conduct replicating experiments to add more observations.
But therein lies the difference between them and the Charlies:
Whereas the Charlies conduct and analyze their new experiments separately from the original,
the Daves _pool_ the data from the two experiments,
so that they don't throw out perfectly-valid observations from the original run.

In [210]:
rng = _get_rng(name="dave")

### Davie

Davie does a meta-analysis: He pools the Alices' results. Notice that, given that the Alices had already committed to doing their experiments, it doesn't matter whether Davie stumbles upon their results after they've already come out and then gets the idea of doing a meta-analysis, or instead decides before they have even finished that he wants to run a meta-analysis and then just waits for their results to come out so he can do it. The code is the same. Just in the second case, add a `time.sleep(one_week)` before you run it.

In fact, there's nothing special about Davie:
* Allie could have stumbled upon Alicia's results after they had already come out, and then done the meta-analysis herself; or,
* Allie could gotten advance knowledge that Alicia was going to do her experiment, waited for Alicia's results to come out, and then done the meta-analysis herself.

In [213]:
y_davie = np.concatenate([y_allie, y_alicia], axis=1)
assert y_davie.shape == (_NSIMS, NUM_PATIENTS*2), y_davie.shape
observed_indicator_false_positive = do_trials_conclude_drug_works(y=y_davie)
is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del y_davie

In this investigative method,
We observed a 0.0561 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 2.80 
Against a critical value of 3.89.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Alright! We cannot reject Davie's method, a classic meta-analysis. The benefit, of course,
is that while he still respects the desired $\alpha = 0.05$ probability
of a false positive under the null,
by pooling two existing valid experiments,
he doubles his sample size and therefore compresses his standard error.
I do not show this, but this does mean that,
supposing the _alternative_ hypothesis were true,
i.e. the drug _did_ work,
he would have more "power", i.e. higher probability of a _true positive_.

### Aside: Schrodinger's Alice

Ok, assume that Allie has already done her experiment. Regardless of her results, she wants to increase her sample size, but lacks the funding to do so. She has no reason to believe that anyone else is planning to replicate her experiment, but resolves that, should anyone else get around to doing so within the next year, she will pounce on the opportunity to pool her data with theirs and conduct a meta-analysis. Unbeknownst to Allie, there actually _is_ another investigator named Alicia out there, but this Alicia isn't sure if she wants to do the experiment, so she will flip a coin to determine whether she does it or just binges Netflix instead.

In [215]:
# get allie's results
observed_indicator_false_positive_allie = do_trials_conclude_drug_works(y=y_allie)

# ok now for a moment let's just pretend we do the same thing as Davie and pool everything
y_pooled = np.concatenate([y_allie, y_alicia], axis=1)
assert y_pooled.shape == (_NSIMS, NUM_PATIENTS*2), y_pooled.shape
observed_indicator_false_positive_pooled = do_trials_conclude_drug_works(y=y_pooled)

# finally, simulate the coinflip: for the first half of the simulations,
# a year passes with no activity from Alicia...
observed_indicator_false_positive = observed_indicator_false_positive_allie
# ... whereas for the second half,
# Alicia does her experiment and Allie gets to pool the data.
cut = int(_NSIMS/2)
observed_indicator_false_positive[cut:] = observed_indicator_false_positive_pooled[cut:]

is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del observed_indicator_false_positive_allie, observed_indicator_false_positive_pooled, cut

In this investigative method,
We observed a 0.0528 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 1.28 
Against a critical value of 3.89.
Therefore, we DON'T have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


Quite frankly I wasn't sure about this one.
But it's comforting to see that we can't reject it.
I can totally imagine this scenario arising in real life.

### David

Like Davie, David wants to conduct a meta-analysis with a bigger sample size.
However, unlike Davie, David isn't aware that Alicia already ran her own experiment
independent of Allie.
So he says: If Allie's results are positive, there's no need for any further action.
But, if her results are inconclusive (null), then he will take it upon himself
to be resourceful and do a followup study replicating her experiment.

Note: This superficially looks like Chuck's invalid method from above,
but the important thing is that David doesn't discard Allie's data even if it comes out null.
He retains it, and directly pools hers with his.
So, he's in some sense already handicapping himself out the gate
by forcing himself to mix in results that he has already observed to be null.

In [216]:
# get allie's results
observed_indicator_false_positive_allie = do_trials_conclude_drug_works(y=y_allie)
is_null_result = ~observed_indicator_false_positive_allie

# ok now for a moment let's just pretend we do the same thing as Davie and pool everything
y_pooled = np.concatenate([y_allie, y_alicia], axis=1)
assert y_pooled.shape == (_NSIMS, NUM_PATIENTS*2), y_pooled.shape
observed_indicator_false_positive_pooled = do_trials_conclude_drug_works(y=y_pooled)

# finally simulate the "wait-and-see" dynamic:
# we know that for about 5% of the sims, Allie's results will be positive...
observed_indicator_false_positive = observed_indicator_false_positive_allie
# ... but in the other 95% of the sims, we'll take the pooled results instead
observed_indicator_false_positive[is_null_result] = (
    observed_indicator_false_positive_pooled[is_null_result]
)

is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del observed_indicator_false_positive_allie, observed_indicator_false_positive_pooled

In this investigative method,
We observed a 0.0856 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 16.33 
Against a critical value of 3.89.
Therefore, we DO have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.


This one kinda freaks me out.
I always thought to myself,
"If your results are inconclusive,
Get up and collect more data!".
This simulation shows that that advice is wrong.

#### Can we save this? Let's try a Bonferroni Correction.

In [217]:
# get allie's results
observed_indicator_false_positive_allie = do_trials_conclude_drug_works(y=y_allie)
is_null_result = ~observed_indicator_false_positive_allie

# ok now for a moment let's just pretend we do the same thing as Davie and pool everything...
# BUT! apply a Bonferroni Correction. Keep in mind we're one-tailed here.
# That is, instead of using a naive alpha = 5% -> z* = 1.65,
# use a Bonferroni-corrected "adjusted" alpha = 2.5% -> z* = 1.96.
y_pooled = np.concatenate([y_allie, y_alicia], axis=1)
assert y_pooled.shape == (_NSIMS, NUM_PATIENTS*2), y_pooled.shape
observed_indicator_false_positive_pooled = do_trials_conclude_drug_works(
    y=y_pooled,
    chosen_critical_z_score=1.96,
)

# finally simulate the "wait-and-see" dynamic:
# we know that for about 5% of the sims, Allie's results will be positive...
observed_indicator_false_positive = observed_indicator_false_positive_allie
# ... but in the other 95% of the sims, we'll take the pooled results instead
observed_indicator_false_positive[is_null_result] = (
    observed_indicator_false_positive_pooled[is_null_result]
)

is_evidence_against_validity_significant(
    observed_indicator_false_positive=observed_indicator_false_positive
)
# free
del observed_indicator_false_positive_allie, observed_indicator_false_positive_pooled

In this investigative method,
We observed a 0.0649 probability of false positives,
Against an expectation of 0.05.
Given that we ran 10,000 simulations,
The Bernoulli standard error is 0.0022,
Yielding a z-score of 6.84 
Against a critical value of 3.89.
Therefore, we DO have statistically-significant evidence
Against the validity of this investigative method at
the alpha = 0.05 level.
