# Week 13: In-Class Exercise - Confidence Intervals & Hypothesis Testing

## Objective
Apply inferential statistics to the Education Statistics dataset to draw conclusions about population parameters.

## Time: ~30 minutes

## Dataset
Education Statistics from the Colombian Ministry of Education (MEN_ESTADISTICAS) - the same dataset we cleaned in Week 3.

### What You Will Do:
1. Calculate confidence intervals for enrollment mean
2. Perform a t-test comparing urban vs rural dropout rates
3. Interpret p-values correctly
4. Make data-driven conclusions

---

## Setup
Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Load the Education Statistics dataset from datos.gov.co
url = "https://www.datos.gov.co/resource/ji8i-4anb.csv?$limit=15000"
df = pd.read_csv(url)

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Quick data inspection
df.head()

In [None]:
# Check for numeric columns we can use
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns available:")
for col in numeric_cols:
    print(f"  - {col}: mean = {df[col].mean():.2f}, std = {df[col].std():.2f}")

---

## Part 1: Understanding Confidence Intervals (10 minutes)

A **confidence interval** gives us a range of plausible values for a population parameter.

**Key concept:** A 95% confidence interval means that if we repeated this sampling process many times, about 95% of the intervals would contain the true population mean.

---

### Task 1.1: Calculate Confidence Interval for Enrollment Mean

Calculate a 95% confidence interval for the mean enrollment.

**Formula (using t-distribution):**
$$CI = \bar{x} \pm t_{\alpha/2} \times \frac{s}{\sqrt{n}}$$

Where:
- $\bar{x}$ = sample mean
- $t_{\alpha/2}$ = t-critical value
- $s$ = sample standard deviation
- $n$ = sample size

In [None]:
# First, identify an enrollment-related column
# Look for columns with names like: matricula, estudiantes, enrollment, total

# List columns that might be enrollment
enrollment_candidates = [col for col in df.columns if any(x in col.lower() for x in ['matricula', 'estudiante', 'total', 'enrollment'])]
print(f"Potential enrollment columns: {enrollment_candidates}")

# Choose one to work with (update this based on your dataset)
# If no specific column found, use the first large numeric column
if enrollment_candidates:
    enrollment_col = enrollment_candidates[0]
else:
    # Use a numeric column with large values (likely enrollment)
    enrollment_col = df[numeric_cols].max().idxmax()

print(f"\nUsing column: {enrollment_col}")
print(f"Sample statistics:")
print(f"  Mean: {df[enrollment_col].mean():.2f}")
print(f"  Std: {df[enrollment_col].std():.2f}")
print(f"  n: {df[enrollment_col].notna().sum()}")

In [None]:
# Task 1.1: Calculate 95% confidence interval manually

# Step 1: Calculate sample statistics
data = df[enrollment_col].dropna()

sample_mean = data.mean()
sample_std = data.std(ddof=1)  # ddof=1 for sample standard deviation
n = len(data)

print(f"Sample mean (x-bar): {sample_mean:.2f}")
print(f"Sample std (s): {sample_std:.2f}")
print(f"Sample size (n): {n}")

# Step 2: Calculate standard error
# YOUR CODE HERE: standard_error = sample_std / sqrt(n)
standard_error = ___

print(f"Standard error: {standard_error:.4f}")

In [None]:
# Step 3: Find the t-critical value for 95% confidence
# Degrees of freedom = n - 1
# For 95% confidence, alpha = 0.05, so we need t at alpha/2 = 0.025

confidence_level = 0.95
alpha = 1 - confidence_level
degrees_of_freedom = n - 1

# Get t-critical value using scipy.stats
t_critical = stats.t.ppf(1 - alpha/2, df=degrees_of_freedom)

print(f"Confidence level: {confidence_level*100}%")
print(f"Alpha: {alpha}")
print(f"Degrees of freedom: {degrees_of_freedom}")
print(f"t-critical value: {t_critical:.4f}")

In [None]:
# Step 4: Calculate the confidence interval
# YOUR CODE HERE: Calculate margin of error and CI bounds

margin_of_error = t_critical * standard_error

ci_lower = ___  # sample_mean - margin_of_error
ci_upper = ___  # sample_mean + margin_of_error

print(f"\n=== 95% CONFIDENCE INTERVAL ===")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"Lower bound: {ci_lower:.2f}")
print(f"Upper bound: {ci_upper:.2f}")
print(f"\nInterpretation: We are 95% confident that the true population mean")
print(f"enrollment is between {ci_lower:.2f} and {ci_upper:.2f}")

In [None]:
# Verify using scipy's built-in function
# This should give the same result!

ci_scipy = stats.t.interval(
    confidence=0.95,
    df=n-1,
    loc=sample_mean,
    scale=stats.sem(data)  # sem = standard error of the mean
)

print(f"Verification using scipy.stats.t.interval():")
print(f"95% CI: ({ci_scipy[0]:.2f}, {ci_scipy[1]:.2f})")

### Task 1.2: Visualize the Confidence Interval

Create a visualization showing the distribution and the confidence interval.

In [None]:
# Visualize the confidence interval
fig, ax = plt.subplots(figsize=(10, 6))

# Plot histogram of the data
ax.hist(data, bins=50, alpha=0.7, color='steelblue', edgecolor='white')

# Add vertical lines for mean and CI
ax.axvline(sample_mean, color='red', linewidth=2, label=f'Sample Mean: {sample_mean:.2f}')
ax.axvline(ci_lower, color='green', linewidth=2, linestyle='--', label=f'95% CI Lower: {ci_lower:.2f}')
ax.axvline(ci_upper, color='green', linewidth=2, linestyle='--', label=f'95% CI Upper: {ci_upper:.2f}')

# Shade the CI region
ax.axvspan(ci_lower, ci_upper, alpha=0.2, color='green', label='95% CI Region')

ax.set_xlabel(enrollment_col, fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title(f'Distribution of {enrollment_col} with 95% Confidence Interval', fontsize=14)
ax.legend(loc='upper right')

plt.tight_layout()
plt.show()

---

## Part 2: Hypothesis Testing - Comparing Two Groups (15 minutes)

**Scenario:** We want to test if there is a statistically significant difference in dropout rates between urban and rural areas.

**Hypotheses:**
- $H_0$ (Null): There is no difference in dropout rates between urban and rural areas ($\mu_{urban} = \mu_{rural}$)
- $H_1$ (Alternative): There IS a difference in dropout rates ($\mu_{urban} \neq \mu_{rural}$)

---

### Task 2.1: Prepare the Data

Identify the urban/rural classification and dropout rate columns.

In [None]:
# Find urban/rural and dropout columns
print("Looking for relevant columns...\n")

# Urban/rural might be: zona, area, urbano, rural
zone_candidates = [col for col in df.columns if any(x in col.lower() for x in ['zona', 'area', 'urban', 'rural', 'sector'])]
print(f"Zone/Area columns: {zone_candidates}")

# Dropout might be: desercion, tasa_desercion, dropout
dropout_candidates = [col for col in df.columns if any(x in col.lower() for x in ['desercion', 'desertor', 'dropout', 'abandono'])]
print(f"Dropout columns: {dropout_candidates}")

# If we have candidates, check their unique values
for col in zone_candidates[:2]:  # Check first 2
    if col in df.columns:
        print(f"\nUnique values in {col}: {df[col].unique()[:10]}")

In [None]:
# Set up the columns for our test
# UPDATE THESE based on your dataset's column names

zone_col = zone_candidates[0] if zone_candidates else None
dropout_col = dropout_candidates[0] if dropout_candidates else None

# If columns not found, use alternatives
if zone_col is None:
    print("No zone column found. Check columns manually:")
    print(df.columns.tolist())
    # You may need to set zone_col manually, e.g.:
    # zone_col = 'YOUR_COLUMN_NAME'

if dropout_col is None:
    # Use any numeric column for demonstration
    print("No dropout column found. Using first numeric column for demonstration.")
    dropout_col = numeric_cols[0] if numeric_cols else None

print(f"\nUsing zone column: {zone_col}")
print(f"Using dropout column: {dropout_col}")

In [None]:
# Create two groups for comparison
# This assumes zone_col has values like 'URBANA'/'RURAL' or 'URBANO'/'RURAL'

if zone_col and dropout_col:
    # Check unique values
    print(f"Unique values in {zone_col}:")
    print(df[zone_col].value_counts())
    
    # Create groups (adjust the filter values based on your data)
    # Common patterns: URBANA/RURAL, URBANO/RURAL, 1/2, U/R
    
    # Try to identify urban and rural values
    unique_zones = df[zone_col].dropna().unique()
    urban_val = [z for z in unique_zones if 'urban' in str(z).lower()]
    rural_val = [z for z in unique_zones if 'rural' in str(z).lower()]
    
    if urban_val and rural_val:
        urban_val = urban_val[0]
        rural_val = rural_val[0]
    else:
        # Fallback: use first two unique values
        urban_val = unique_zones[0]
        rural_val = unique_zones[1] if len(unique_zones) > 1 else unique_zones[0]
    
    print(f"\nGroup 1 (Urban): {urban_val}")
    print(f"Group 2 (Rural): {rural_val}")

In [None]:
# Extract data for each group
group_urban = df[df[zone_col] == urban_val][dropout_col].dropna()
group_rural = df[df[zone_col] == rural_val][dropout_col].dropna()

print(f"Urban group: n = {len(group_urban)}, mean = {group_urban.mean():.4f}, std = {group_urban.std():.4f}")
print(f"Rural group: n = {len(group_rural)}, mean = {group_rural.mean():.4f}, std = {group_rural.std():.4f}")

### Task 2.2: Perform the Independent Samples t-Test

Use `scipy.stats.ttest_ind()` to compare the two groups.

In [None]:
# Task 2.2: Perform the t-test

# The independent samples t-test compares means of two unrelated groups
# H0: The means are equal (no difference)
# H1: The means are different

# YOUR CODE HERE: Perform the t-test
# Hint: t_statistic, p_value = stats.ttest_ind(group1, group2)

t_statistic, p_value = stats.ttest_ind(___, ___)

print("=== INDEPENDENT SAMPLES T-TEST RESULTS ===")
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.6f}")

### Task 2.3: Interpret the P-Value

**Critical question:** Is the difference statistically significant?

**Decision rule:** At significance level alpha = 0.05:
- If p-value < 0.05: Reject $H_0$ (significant difference)
- If p-value >= 0.05: Fail to reject $H_0$ (no significant difference)

In [None]:
# Task 2.3: Interpret the results

alpha = 0.05  # Significance level

print("=== INTERPRETATION ===")
print(f"Significance level (alpha): {alpha}")
print(f"P-value: {p_value:.6f}")
print()

if p_value < alpha:
    print(f"Decision: REJECT the null hypothesis (p = {p_value:.4f} < {alpha})")
    print(f"\nConclusion: There IS a statistically significant difference")
    print(f"in {dropout_col} between urban and rural areas.")
    print(f"\nThe difference of {abs(group_urban.mean() - group_rural.mean()):.4f}")
    print(f"is unlikely to have occurred by chance alone.")
else:
    print(f"Decision: FAIL TO REJECT the null hypothesis (p = {p_value:.4f} >= {alpha})")
    print(f"\nConclusion: There is NO statistically significant difference")
    print(f"in {dropout_col} between urban and rural areas.")
    print(f"\nThe observed difference could have occurred by random chance.")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot comparison
plot_data = pd.DataFrame({
    'Group': ['Urban'] * len(group_urban) + ['Rural'] * len(group_rural),
    'Value': list(group_urban) + list(group_rural)
})

sns.boxplot(x='Group', y='Value', data=plot_data, ax=axes[0], palette=['steelblue', 'coral'])
axes[0].set_xlabel('Area Type', fontsize=12)
axes[0].set_ylabel(dropout_col, fontsize=12)
axes[0].set_title(f'Comparison: Urban vs Rural\n(p-value = {p_value:.4f})', fontsize=13)

# Distribution overlay
axes[1].hist(group_urban, bins=30, alpha=0.6, label=f'Urban (mean={group_urban.mean():.2f})', color='steelblue')
axes[1].hist(group_rural, bins=30, alpha=0.6, label=f'Rural (mean={group_rural.mean():.2f})', color='coral')
axes[1].axvline(group_urban.mean(), color='steelblue', linestyle='--', linewidth=2)
axes[1].axvline(group_rural.mean(), color='coral', linestyle='--', linewidth=2)
axes[1].set_xlabel(dropout_col, fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution Comparison', fontsize=13)
axes[1].legend()

plt.tight_layout()
plt.show()

---

## Part 3: Common P-Value Misinterpretations (5 minutes)

**IMPORTANT:** Avoid these common mistakes!

---

### What the P-Value IS and IS NOT:

| CORRECT | INCORRECT |
|---------|----------|
| P-value is the probability of observing data this extreme IF H0 is true | P-value is NOT the probability that H0 is true |
| A small p-value means the data is unlikely under H0 | A small p-value does NOT prove H1 is true |
| Statistical significance depends on your chosen alpha | p < 0.05 is NOT a universal truth |
| P-value says nothing about practical importance | A significant result may not be practically meaningful |

### The Courtroom Analogy:

- **H0** = Defendant is innocent (null hypothesis)
- **H1** = Defendant is guilty (alternative hypothesis)
- **Evidence** = Your data
- **P-value** = How surprising is this evidence IF the defendant is innocent?

A small p-value means: "This evidence would be very surprising if the defendant were innocent."

It does NOT mean: "The defendant is definitely guilty."

### Quick Check: True or False?

Answer in the markdown cell below:

1. A p-value of 0.03 means there is a 3% chance the null hypothesis is true.
2. If p = 0.06, we should conclude there is no effect.
3. A statistically significant result is always practically important.
4. The p-value depends on sample size.

### Your Answers:

1. ___ (True/False) - Explanation: ___
2. ___ (True/False) - Explanation: ___
3. ___ (True/False) - Explanation: ___
4. ___ (True/False) - Explanation: ___

---

## Summary

In this exercise, you learned:

1. **Confidence Intervals**
   - How to calculate a CI for the mean
   - Components: sample mean, standard error, t-critical value
   - Interpretation: Range of plausible population values

2. **Hypothesis Testing (t-test)**
   - How to compare two independent groups
   - The t-statistic measures how different the groups are
   - The p-value tells us how surprising the result is under H0

3. **P-Value Interpretation**
   - P-value is NOT the probability that H0 is true
   - Statistical significance is not the same as practical importance
   - Always consider context and effect size

---

*End of Exercise*