<a href="https://colab.research.google.com/github/sokrypton/7.571/blob/main/L3/hypothesis_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing

In this notebook, we'll explore:
1. The basic framework for statistical inference
2. Type I errors (false positives) and Type II errors (false negatives)
3. Statistical power

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

## 1 | Basic framework for statistical inference

**Scenario:** We want to test if knocking out a protease affects protein X levels.

- **Null hypothesis (H₀):** μ_control = μ_ko (no difference)
- **Alternative hypothesis (H₁):** μ_control ≠ μ_ko (there is a difference)

We'll use a **2-sample t-test** to compare the means.

Let's start with a case where **H₀ is true** (the knockout has no effect).

In [None]:
# Define the TRUE populations (in real life, we don't know these!)
control_mu = 1000
control_sigma = 200

ko_mu = 1000      # Same as control - NO EFFECT
ko_sigma = 200

# Visualize the populations
population_size = 100000
control_population = np.random.normal(control_mu, control_sigma, population_size)
ko_population = np.random.normal(ko_mu, ko_sigma, population_size)

plt.hist(control_population, bins=100, alpha=0.5, label='control')
plt.hist(ko_population, bins=100, alpha=0.5, label='knockout')
plt.xlabel('Protein X levels')
plt.ylabel('Count')
plt.legend()
plt.title('True populations (H₀ is TRUE: no difference)')
plt.show()

### Run ONE experiment

In real life, we can only measure a small sample from each population.

In [None]:
# Take a small sample (like a real experiment)
n = 3  # number of replicates

control_sample = np.random.choice(control_population, size=n)
ko_sample = np.random.choice(ko_population, size=n)

# Visualize the samples with boxplot
plt.figure(figsize=(6, 5))
plt.boxplot([control_sample, ko_sample], labels=['Control', 'Knockout'])

# Add individual points with jitter
for i, (data, color) in enumerate([(control_sample, 'blue'), (ko_sample, 'orange')]):
    x = np.random.normal(i + 1, 0.04, size=len(data))  # jitter
    plt.scatter(x, data, color=color, alpha=0.7, s=50)

plt.ylabel('Protein X levels')
plt.title('Your experiment (n=3 per group)')
plt.show()

# Perform t-test
t_stat, p_value = ttest_ind(control_sample, ko_sample)

print(f"Control: {control_sample.round(1)}")
print(f"Knockout: {ko_sample.round(1)}")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.3f}")

if p_value < 0.05:
    print("\n⚠️ p < 0.05: We would reject H₀ (FALSE POSITIVE!)")
else:
    print("\n✓ p >= 0.05: We fail to reject H₀ (correct)")

### Key question

**Did anyone in the class get a "significant" result?**

Since H₀ is actually true, any p < 0.05 is a **false positive** (Type I error).

How often should this happen? Let's simulate many experiments:

In [None]:
# Simulate 1000 experiments
num_experiments = 1000
n = 3

p_values = []
for i in range(num_experiments):
    control_sample = np.random.choice(control_population, size=n)
    ko_sample = np.random.choice(ko_population, size=n)
    _, p_value = ttest_ind(control_sample, ko_sample)
    p_values.append(p_value)

p_values = np.array(p_values)

# Count false positives at α = 0.05
false_positives = np.sum(p_values < 0.05)
false_positive_rate = false_positives / num_experiments

print(f"Number of experiments: {num_experiments}")
print(f"False positives (p < 0.05): {false_positives}")
print(f"False positive rate: {false_positive_rate:.3f}")
print(f"\nExpected (α = 0.05): {0.05}")

**This is the definition of α!** When H₀ is true, we expect to incorrectly reject it α% of the time.

---

## 2 | Type II errors and statistical power

Now let's consider a case where **H₀ is false** (the knockout DOES have an effect).

In [None]:
# Now the KO has a REAL effect
control_mu = 1000
control_sigma = 200

ko_mu = 1200      # Different from control - REAL EFFECT!
ko_sigma = 200

# Regenerate populations
control_population = np.random.normal(control_mu, control_sigma, population_size)
ko_population = np.random.normal(ko_mu, ko_sigma, population_size)

plt.hist(control_population, bins=100, alpha=0.5, label='control')
plt.hist(ko_population, bins=100, alpha=0.5, label='knockout')
plt.xlabel('Protein X levels')
plt.ylabel('Count')
plt.legend()
plt.title('True populations (H₀ is FALSE: real difference!)')
plt.show()

print(f"True effect size: {ko_mu - control_mu} ({(ko_mu - control_mu)/control_sigma:.1f} standard deviations)")

In [None]:
# Simulate 1000 experiments with a REAL effect
num_experiments = 1000
n = 3

p_values = []
for i in range(num_experiments):
    control_sample = np.random.choice(control_population, size=n)
    ko_sample = np.random.choice(ko_population, size=n)
    _, p_value = ttest_ind(control_sample, ko_sample)
    p_values.append(p_value)

p_values = np.array(p_values)

# Count detections (true positives) and misses (false negatives)
alpha = 0.05
detected = np.sum(p_values < alpha)      # True positives
missed = np.sum(p_values >= alpha)       # False negatives (Type II errors)

power = detected / num_experiments
beta = missed / num_experiments

print(f"With n = {n} replicates:")
print(f"  Detected the effect (p < 0.05): {detected} / {num_experiments}")
print(f"  Missed the effect (p >= 0.05): {missed} / {num_experiments}")
print(f"\n  Power (1-β): {power:.3f}")
print(f"  β (false negative rate): {beta:.3f}")

### Key insight

With only 3 replicates, we often **miss** the real effect! This is a **Type II error** (false negative).

**Power** = probability of detecting a real effect = 1 - β

---

## 3 | How does sample size affect power?

In [None]:
# Test different sample sizes
sample_sizes = [2, 4, 8, 16, 32, 64]
num_experiments = 500
alpha = 0.05

powers = []

for n in sample_sizes:
    detected = 0
    for i in range(num_experiments):
        control_sample = np.random.choice(control_population, size=n)
        ko_sample = np.random.choice(ko_population, size=n)
        _, p_value = ttest_ind(control_sample, ko_sample)
        if p_value < alpha:
            detected += 1

    power = detected / num_experiments
    powers.append(power)
    print(f"n = {n:3d}: power = {power:.3f}")

# Plot
plt.figure(figsize=(8, 5))
plt.plot(sample_sizes, powers, 'bo-', markersize=10)
plt.axhline(0.8, color='red', linestyle='--', label='80% power')
plt.xlabel('Sample size (n per group)')
plt.ylabel('Power')
plt.title('Power vs Sample Size')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Key question

**How many samples do we need for 80% power?**

---

## Summary

| | H₀ True (no effect) | H₀ False (real effect) |
|---|---|---|
| **Reject H₀** | Type I Error (α) | ✓ Correct (Power = 1-β) |
| **Fail to reject** | ✓ Correct | Type II Error (β) |

**Power depends on:**
1. **α** (significance threshold) - higher α → more power, but more false positives
2. **Sample size (n)** - more samples → more power
3. **Effect size** - larger effects are easier to detect

**Always do a power analysis BEFORE your experiment!**