# **Post-Read 2: Advanced Topics in Two-Sample and Proportion Tests**

*Welcome to **Post-Read 2**, a **supplementary** notebook expanding on **Main Lecture 2**. In Lecture 2, you learned how to perform two-sample tests (Z-Tests, T-Tests) and basic proportion tests. This notebook delves deeper into optional or advanced points, providing **conceptual clarity**, **practical code snippets**, and **mini-exercises** for further practice.*

In [None]:
!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/102/634/original/medical_cost.zip

--2025-01-21 04:55:21--  https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/102/634/original/medical_cost.zip
Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 3.167.84.28, 3.167.84.148, 3.167.84.9, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|3.167.84.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16425 (16K) [application/zip]
Saving to: ‘medical_cost.zip’


2025-01-21 04:55:22 (202 MB/s) - ‘medical_cost.zip’ saved [16425/16425]



In [None]:
!unzip medical_cost.zip

Archive:  medical_cost.zip
  inflating: insurance.csv           


In [None]:
import pandas as pd

df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


---
## 1. Introduction

### 1.1 What We’ll Cover

1. **Paired T-Test**: Understand how to handle before/after or matched-pair scenarios (e.g., medical charges before and after an intervention).
2. **Effect Size for Two-Sample Tests**: Review **Cohen’s d** and others for measuring the magnitude of differences.
3. **Power Analysis**: Learn how sample size relates to detecting differences with a certain probability (power).
4. **Confidence Intervals for Differences**: Expand beyond basics to interpret intervals for the difference in means (e.g., smoker vs. non-smoker).
5. **Advanced Proportion Testing**: Explore potential pitfalls (small samples, continuity corrections, or chi-square approaches).
6. **Mini Exercises**: Solidify these ideas with short practice questions.

---

## 2. Paired T-Test

### 2.1 Conceptual Explanation

A <font color="magenta">Paired T-Test</font> is used when:
- You have **two related measurements** from the **same** subject or matched pairs (e.g., “before and after” an intervention).
- Data are **dependent** (not independent samples).

**Null Hypothesis (H₀)** in a paired test often states: “The **mean difference** between paired measurements is 0.”

### 2.2 Example: Artificial Before/After Charges

For demonstration, let’s create a **hypothetical** scenario:
- Suppose we simulate a small subset of patients (n=50).
- We label their initial charges as `charges_before`.
- We simulate a hypothetical intervention that (hopefully) reduces their charges, calling it `charges_after`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# For demonstration, let's create a small dataframe
np.random.seed(123)  # for reproducible results
n_sample = 50

# Randomly sample 50 individuals' charges from df
charges_subset = df['charges'].sample(n=n_sample, random_state=123).values

# Create a "before" scenario
charges_before = charges_subset

# Create an "after" scenario with some hypothetical reduction
# e.g., reduce each by a random value between 100 and 1000
reductions = np.random.uniform(100, 1000, n_sample)
charges_after = charges_before - reductions

paired_df = pd.DataFrame({
    'before': charges_before,
    'after': charges_after
})

paired_df.head()

Unnamed: 0,before,after
0,9800.8882,9074.065933
1,4667.60765,4310.082249
2,34838.873,34534.706692
3,5125.2157,4529.032408
4,12142.5786,11395.056527


### 2.3 Performing a Paired T-Test

In [None]:
# SciPy's paired t-test can be done with ttest_rel
t_stat, p_value = stats.ttest_rel(paired_df['before'], paired_df['after'])

print(f"Paired T-Test Statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.6f}")

Paired T-Test Statistic: 18.457
P-value: 0.000000


- **Interpretation**: If p < 0.05, we reject H₀ and conclude there is a statistically significant difference between *before* and *after* charges (on average).
- In real scenarios, `before` and `after` data come from legitimate repeated measures or matched pairs.

💡 **Tip**: Always ensure the *pairing* is correct—mix-ups in matched pairs can invalidate your test.

---

## 3. Effect Size for Two-Sample Tests

### 3.1 Cohen’s d for Independent Samples

When dealing with **two independent samples** (e.g., smokers vs. non-smokers), Cohen’s d can be computed as:

$
d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s_{\text{pooled}} }
$

where $ s_{\text{pooled}} $ is a **pooled** standard deviation of both groups.

$
s_{\text{pooled}} = \sqrt{ \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} }
$

### 3.2 Quick Code Example: Cohen’s d for Smokers vs. Non-Smokers

Let’s compare <font color="blue">charges</font> for smokers and non-smokers:

In [None]:
smokers = df[df['smoker'] == 'yes']['charges'].dropna()
non_smokers = df[df['smoker'] == 'no']['charges'].dropna()

mean_smokers = smokers.mean()
mean_non_smokers = non_smokers.mean()
std_smokers = smokers.std(ddof=1)
std_non_smokers = non_smokers.std(ddof=1)
n_smokers = len(smokers)
n_non_smokers = len(non_smokers)

# Pooled standard deviation
s_pooled = np.sqrt(
    ((n_smokers - 1)*std_smokers**2 + (n_non_smokers - 1)*std_non_smokers**2) /
    (n_smokers + n_non_smokers - 2)
)

cohens_d = (mean_smokers - mean_non_smokers) / s_pooled

print(f"Mean (Smokers): {mean_smokers:.2f}")
print(f"Mean (Non-Smokers): {mean_non_smokers:.2f}")
print(f"Cohen's d: {cohens_d:.3f}")

# Optional: interpret the effect size
if abs(cohens_d) < 0.2:
    interpretation = "Very small/negligible effect"
elif abs(cohens_d) < 0.5:
    interpretation = "Small effect"
elif abs(cohens_d) < 0.8:
    interpretation = "Medium effect"
else:
    interpretation = "Large effect"
print(f"Interpretation of effect size: {interpretation}")

Mean (Smokers): 32050.23
Mean (Non-Smokers): 8434.27
Cohen's d: 3.161
Interpretation of effect size: Large effect


### 3.3 Effect Size in Paired Tests

For **paired** data, Cohen’s d can be computed similarly, but we focus on **differences** (e.g., `before - after`) and use the **standard deviation of those differences** instead of a pooled standard deviation across two separate groups.

---

## 4. Power Analysis for Two-Sample Tests

### 4.1 Why Power Analysis?

A test with insufficient sample size might **fail to detect a meaningful difference** (high Type II error). Power analysis helps:
- Estimate how large your sample needs to be to achieve a desired power (e.g., 80% or 90%) for detecting a particular effect size.

### 4.2 Example Using `statsmodels`

In [None]:
from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()

# Suppose we want to detect a difference of 4000 in mean charges
# between two groups (like smokers vs. non-smokers),
# and we estimate the common standard deviation based on previous data.
estimated_std = s_pooled  # from the earlier calculation
effect_size = 4000 / estimated_std  # difference / pooled std

alpha = 0.05
desired_power = 0.80

required_n_per_group = analysis.solve_power(effect_size=effect_size,
                                           alpha=alpha,
                                           power=desired_power,
                                           alternative='two-sided')

print(f"Required sample size per group for 80% power: {required_n_per_group:.2f}")

Required sample size per group for 80% power: 55.73



- **Interpretation**: If your effect size is smaller or your alpha is stricter, you’ll need **larger** sample sizes for the same power.

🚀 **Note**: Real-world charges data can be skewed, so sometimes you’d consider transformations or non-parametric power analysis. This is just a demonstration of the general approach.

---

## 5. Confidence Intervals for Differences

### 5.1 Independent Two-Sample CI

When you do a **two-sample t-test**, you can also construct a confidence interval for the **difference in means**. For large enough samples or known variances, you might use a z-based approach, but typically:

$
(\bar{x}_1 - \bar{x}_2) \; \pm \; t_{\alpha/2, \, df} \; \times \; SE
$

where $ SE $ is the standard error for the difference in means (depending on equal/unequal variances assumptions).

### 5.2 Code Example: CI for Difference Between Smokers & Non-Smokers

In [None]:
t_stat, p_value = stats.ttest_ind(smokers, non_smokers, equal_var=False)
print(f"Two-sample t-test (unequal variance) -> t_stat: {t_stat:.3f}, p_value: {p_value:.6f}")

# Confidence interval for difference
# We can compute the difference in means, and approximate the CI manually or use statsmodels
difference_in_means = mean_smokers - mean_non_smokers
smokers_nonsmokers_se = np.sqrt(std_smokers**2 / n_smokers + std_non_smokers**2 / n_non_smokers)

# Approx degrees of freedom using Welch–Satterthwaite
df_approx = (
    (std_smokers**2 / n_smokers + std_non_smokers**2 / n_non_smokers)**2 /
    ((std_smokers**2 / n_smokers)**2 / (n_smokers - 1) + (std_non_smokers**2 / n_non_smokers)**2 / (n_non_smokers - 1))
)

alpha = 0.05
t_crit = stats.t.ppf(1 - alpha/2, df_approx)
margin_of_error = t_crit * smokers_nonsmokers_se

lower = difference_in_means - margin_of_error
upper = difference_in_means + margin_of_error

print(f"95% CI for (Smokers - Non-Smokers): [{lower:.2f}, {upper:.2f}]")

Two-sample t-test (unequal variance) -> t_stat: 32.752, p_value: 0.000000
95% CI for (Smokers - Non-Smokers): [22197.21, 25034.71]


- If the **95% CI** doesn’t include 0, we have evidence that the two groups’ means are different at α=0.05.

### 5.3 CI Comparisons

You can also explore **90%** or **99%** intervals, noting how the **confidence level** affects the **interval width**.


---

## 6. Advanced Proportion Testing

### 6.1 Two-Proportion Z-Test

If analyzing categorical proportions (e.g., proportion of smokers in two different regions), we might use a <font color="magenta">two-proportion z-test</font>.

**Key formula** for the test statistic (for large samples):

$
z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
$

where:
- $\hat{p}_1$ and $\hat{p}_2$ are sample proportions,
- $\hat{p}$ is the pooled proportion $\frac{x_1 + x_2}{n_1 + n_2}$.

### 6.2 Code Example: Comparing Proportions (Smoker Rate in Two Regions)

Imagine we want to compare the proportion of **smokers** in the **southeast** vs. **southwest** region:

In [None]:
region_se = df[df['region'] == 'southeast']
region_sw = df[df['region'] == 'southwest']

n_se = len(region_se)
n_sw = len(region_sw)

smokers_se = region_se[region_se['smoker'] == 'yes']
smokers_sw = region_sw[region_sw['smoker'] == 'yes']

x_se = len(smokers_se)  # number of smokers in southeast
x_sw = len(smokers_sw)  # number of smokers in southwest

p_se = x_se / n_se
p_sw = x_sw / n_sw

# Two-proportion z-test using statsmodels
from statsmodels.stats.proportion import proportions_ztest

count = np.array([x_se, x_sw])
nobs = np.array([n_se, n_sw])

stat, pval = proportions_ztest(count, nobs, alternative='two-sided')
print(f"Two-proportion z-test -> z-stat: {stat:.3f}, p-value: {pval:.6f}")
print(f"Smoker proportion in southeast: {p_se:.3f}, southwest: {p_sw:.3f}")

Two-proportion z-test -> z-stat: 2.277, p-value: 0.022790
Smoker proportion in southeast: 0.250, southwest: 0.178


### 6.3 Small Sample or Edge Cases

- If sample sizes are **very small** (e.g., < 10 in some categories), consider using **continuity corrections** or **Fisher’s Exact Test** (especially for categorical data in 2×2 tables).
- Some prefer a **chi-square test** for 2×2 contingency tables as an alternative approach to two-proportion z-tests.

---

## 7. Additional Practice Problems

Here are a few **short quizzes** and **reflection questions** to reinforce these advanced topics. Try them before viewing hints!

### 7.1 Paired T-Test Concept

**Q:** In a **paired** scenario of patient charges *before* and *after* a diet program, your p-value is 0.03 at α=0.05. What does that suggest about the diet program’s effect on mean charges?

<details>
<summary><em>Hint/Answer</em></summary>
Since p < 0.05, we **reject** the null hypothesis that the mean difference is 0 and conclude that the diet program **significantly** changed (likely reduced) the charges on average (assuming the difference is before minus after).
</details>

---

### 7.2 Effect Size & Interpretation

**Q:** If Cohen’s d for smokers vs. non-smokers is **1.2**, how would you interpret this effect size?

<details>
<summary><em>Hint/Answer</em></summary>
A Cohen’s d of 1.2 is typically considered **large** (anything above 0.8 is large). It suggests a substantial difference in mean charges between the two groups.
</details>

---

### 7.3 Power Analysis

**Q:** For detecting a smaller difference in means (e.g., 2000 instead of 4000), would the required sample size **increase** or **decrease** (keeping α and power the same)? Why?

<details>
<summary><em>Hint/Answer</em></summary>
It would **increase**. A smaller difference means a smaller effect size, so you need a larger sample to detect that difference with the same statistical power.
</details>

---

### 7.4 CI for Differences

**Q:** If a 95% CI for (Group1 - Group2) is $-1000$ to $+500$, does this indicate a significant difference at α=0.05?

<details>
<summary><em>Hint/Answer</em></summary>
No. Because the interval includes **0**, we fail to reject the null hypothesis at α=0.05. The possible difference in means could be positive or negative.
</details>

---

### 7.5 Proportion Testing

**Q:** You compare the proportion of **children** (i.e., `children > 0` vs. `children == 0`) in two regions. One region’s sample size is tiny (n=8). Which test or approach might be safer than a large-sample z-test?

<details>
<summary><em>Hint/Answer</em></summary>
Consider **Fisher’s Exact Test** if you have a 2×2 contingency table, or be very cautious and use a continuity-corrected version of the z-test. But often, Fisher’s Exact is preferred for extremely small counts.
</details>

---