# Week 13 Workshop: Inferential Statistics

## Confidence Intervals & Hypothesis Testing

### Education Statistics Dataset (MEN_ESTADISTICAS)

### Objectives
1. Calculate confidence intervals for 3 metrics
2. Perform 2 hypothesis tests
3. Interpret results in business language
4. Visualize statistical significance

### Duration: 2-3 hours

---

## Setup

Run this cell to load the libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Load the Education Statistics dataset from datos.gov.co
url = "https://www.datos.gov.co/resource/ji8i-4anb.csv?$limit=15000"
df = pd.read_csv(url)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns:")
print(df.columns.tolist())

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types and identify numeric columns
print("Data types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nNumeric columns summary:")
df.describe()

In [None]:
# Identify key columns for our analysis
# Find enrollment, dropout, and approval columns

print("Looking for relevant columns...\n")

# Enrollment columns
enrollment_cols = [col for col in df.columns if any(x in col.lower() for x in ['matricula', 'estudiante', 'enrollment', 'total'])]
print(f"Enrollment columns: {enrollment_cols}")

# Dropout columns
dropout_cols = [col for col in df.columns if any(x in col.lower() for x in ['desercion', 'desertor', 'dropout', 'abandono'])]
print(f"Dropout columns: {dropout_cols}")

# Approval/pass rate columns
approval_cols = [col for col in df.columns if any(x in col.lower() for x in ['aprobado', 'aprobacion', 'approval', 'promovido'])]
print(f"Approval columns: {approval_cols}")

# Zone columns (urban/rural)
zone_cols = [col for col in df.columns if any(x in col.lower() for x in ['zona', 'area', 'urban', 'rural', 'sector'])]
print(f"Zone columns: {zone_cols}")

# Year columns
year_cols = [col for col in df.columns if any(x in col.lower() for x in ['anio', 'ano', 'year', 'periodo'])]
print(f"Year columns: {year_cols}")

---

# Part 1: Confidence Intervals for 3 Metrics

Calculate and interpret 95% confidence intervals for:
1. Mean enrollment
2. Mean dropout rate
3. Mean approval rate

---

## Metric 1: Mean Enrollment

Calculate a 95% confidence interval for the average enrollment.

In [None]:
# Step 1: Select the enrollment column and extract clean data
# UPDATE the column name based on what you found above

enrollment_col = ___  # e.g., 'MATRICULA_TOTAL' or the first from enrollment_cols

# Extract data and remove missing values
enrollment_data = df[enrollment_col].dropna()

print(f"Column: {enrollment_col}")
print(f"Sample size (n): {len(enrollment_data)}")
print(f"Sample mean: {enrollment_data.mean():.2f}")
print(f"Sample std: {enrollment_data.std():.2f}")

In [None]:
# Step 2: Calculate the 95% CI manually

n = len(enrollment_data)
sample_mean = enrollment_data.mean()
sample_std = enrollment_data.std(ddof=1)  # ddof=1 for sample std

# Calculate standard error
# YOUR CODE HERE
standard_error = ___

# Get t-critical value for 95% confidence
confidence_level = 0.95
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, df=n-1)

# Calculate margin of error
# YOUR CODE HERE
margin_of_error = ___

# Calculate CI bounds
# YOUR CODE HERE
ci_lower = ___
ci_upper = ___

print(f"=== 95% CONFIDENCE INTERVAL FOR ENROLLMENT ===")
print(f"Standard Error: {standard_error:.4f}")
print(f"t-critical value: {t_critical:.4f}")
print(f"Margin of Error: {margin_of_error:.2f}")
print(f"\n95% CI: ({ci_lower:.2f}, {ci_upper:.2f})")

In [None]:
# Step 3: Verify using scipy.stats

ci_scipy = stats.t.interval(
    confidence=0.95,
    df=n-1,
    loc=sample_mean,
    scale=stats.sem(enrollment_data)
)

print(f"Verification (scipy): 95% CI = ({ci_scipy[0]:.2f}, {ci_scipy[1]:.2f})")
print(f"\nDo they match? {'Yes!' if abs(ci_lower - ci_scipy[0]) < 0.01 else 'Check your calculations'}")

In [None]:
# Step 4: Visualize the CI

fig, ax = plt.subplots(figsize=(10, 6))

# Histogram
ax.hist(enrollment_data, bins=50, alpha=0.7, color='steelblue', edgecolor='white')

# Mean and CI lines
ax.axvline(sample_mean, color='red', linewidth=2, label=f'Mean: {sample_mean:.2f}')
ax.axvline(ci_lower, color='green', linewidth=2, linestyle='--', label=f'95% CI Lower: {ci_lower:.2f}')
ax.axvline(ci_upper, color='green', linewidth=2, linestyle='--', label=f'95% CI Upper: {ci_upper:.2f}')
ax.axvspan(ci_lower, ci_upper, alpha=0.2, color='green')

ax.set_xlabel(enrollment_col, fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title(f'Distribution of {enrollment_col} with 95% CI', fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

### Interpretation: Enrollment CI

**Write your interpretation below:**

*We are 95% confident that the true population mean enrollment is between ___ and ___. This means...*

*YOUR INTERPRETATION HERE*

---

## Metric 2: Mean Dropout Rate

Calculate a 95% confidence interval for the average dropout rate.

In [None]:
# Step 1: Select the dropout column
# UPDATE the column name based on what you found above

dropout_col = ___  # e.g., 'TASA_DESERCION' or first from dropout_cols

# Extract and clean data
dropout_data = df[dropout_col].dropna()

print(f"Column: {dropout_col}")
print(f"Sample size (n): {len(dropout_data)}")
print(f"Sample mean: {dropout_data.mean():.4f}")
print(f"Sample std: {dropout_data.std():.4f}")

In [None]:
# Step 2: Calculate the 95% CI
# YOUR CODE HERE: Follow the same steps as for enrollment

n_dropout = len(dropout_data)
mean_dropout = dropout_data.mean()
std_dropout = dropout_data.std(ddof=1)

# Standard error
se_dropout = ___

# t-critical
t_crit_dropout = stats.t.ppf(0.975, df=n_dropout-1)

# Margin of error
moe_dropout = ___

# CI bounds
ci_lower_dropout = ___
ci_upper_dropout = ___

print(f"=== 95% CONFIDENCE INTERVAL FOR DROPOUT RATE ===")
print(f"95% CI: ({ci_lower_dropout:.4f}, {ci_upper_dropout:.4f})")

In [None]:
# Step 3: Verify using scipy.stats

ci_dropout_scipy = stats.t.interval(
    confidence=0.95,
    df=n_dropout-1,
    loc=mean_dropout,
    scale=stats.sem(dropout_data)
)

print(f"Verification (scipy): 95% CI = ({ci_dropout_scipy[0]:.4f}, {ci_dropout_scipy[1]:.4f})")

In [None]:
# Step 4: Visualize the CI
# YOUR CODE HERE: Create a similar visualization as for enrollment

fig, ax = plt.subplots(figsize=(10, 6))

# YOUR CODE HERE

plt.tight_layout()
plt.show()

### Interpretation: Dropout Rate CI

**Write your interpretation below:**

*YOUR INTERPRETATION HERE*

---

## Metric 3: Mean Approval Rate

Calculate a 95% confidence interval for the average approval/pass rate.

In [None]:
# Step 1: Select the approval column
# UPDATE the column name

approval_col = ___  # e.g., 'TASA_APROBACION' or first from approval_cols

# If no approval column exists, you can use any other numeric metric
# or calculate approval rate from counts if available

# Extract and clean data
approval_data = df[approval_col].dropna()

print(f"Column: {approval_col}")
print(f"Sample size (n): {len(approval_data)}")
print(f"Sample mean: {approval_data.mean():.4f}")
print(f"Sample std: {approval_data.std():.4f}")

In [None]:
# Step 2: Calculate the 95% CI
# YOUR CODE HERE

n_approval = len(approval_data)
mean_approval = approval_data.mean()
std_approval = approval_data.std(ddof=1)

# Calculate SE, MOE, and CI bounds
# YOUR CODE HERE

se_approval = ___
t_crit_approval = stats.t.ppf(0.975, df=n_approval-1)
moe_approval = ___
ci_lower_approval = ___
ci_upper_approval = ___

print(f"=== 95% CONFIDENCE INTERVAL FOR APPROVAL RATE ===")
print(f"95% CI: ({ci_lower_approval:.4f}, {ci_upper_approval:.4f})")

In [None]:
# Step 3: Verify using scipy.stats
# YOUR CODE HERE


In [None]:
# Step 4: Visualize the CI
# YOUR CODE HERE


### Interpretation: Approval Rate CI

**Write your interpretation below:**

*YOUR INTERPRETATION HERE*

---

## Summary: All Three Confidence Intervals

Create a summary visualization showing all three CIs.

In [None]:
# Summary table
summary_data = {
    'Metric': ['Enrollment', 'Dropout Rate', 'Approval Rate'],
    'Sample Mean': [sample_mean, mean_dropout, mean_approval],
    'CI Lower': [ci_lower, ci_lower_dropout, ci_lower_approval],
    'CI Upper': [ci_upper, ci_upper_dropout, ci_upper_approval],
    'Sample Size': [n, n_dropout, n_approval]
}

summary_df = pd.DataFrame(summary_data)
print("=== CONFIDENCE INTERVALS SUMMARY ===")
print(summary_df.to_string(index=False))

In [None]:
# Create error bar plot for rates (dropout and approval)
# Note: Enrollment is on a different scale, so we show rates separately

fig, ax = plt.subplots(figsize=(8, 6))

metrics = ['Dropout Rate', 'Approval Rate']
means = [mean_dropout, mean_approval]
errors = [
    [mean_dropout - ci_lower_dropout, ci_upper_dropout - mean_dropout],
    [mean_approval - ci_lower_approval, ci_upper_approval - mean_approval]
]
errors = np.array(errors).T  # Reshape for errorbar

x_pos = np.arange(len(metrics))

ax.errorbar(x_pos, means, yerr=errors, fmt='o', markersize=10, 
            capsize=10, capthick=2, color='steelblue', ecolor='gray')

ax.set_xticks(x_pos)
ax.set_xticklabels(metrics, fontsize=12)
ax.set_ylabel('Rate', fontsize=12)
ax.set_title('95% Confidence Intervals for Education Rates', fontsize=14)
ax.grid(axis='y', alpha=0.3)

# Add value labels
for i, (m, lower, upper) in enumerate(zip(means, [ci_lower_dropout, ci_lower_approval], [ci_upper_dropout, ci_upper_approval])):
    ax.annotate(f'{m:.3f}\n[{lower:.3f}, {upper:.3f}]', 
                xy=(i, m), xytext=(10, 10), 
                textcoords='offset points', fontsize=9)

plt.tight_layout()
plt.show()

---

# Part 2: Hypothesis Testing (2 Tests)

Perform two hypothesis tests to compare groups.

---

## Test 1: Urban vs Rural Dropout Rates

**Research Question:** Is there a statistically significant difference in dropout rates between urban and rural areas?

**Hypotheses:**
- $H_0$: There is no difference in mean dropout rates ($\mu_{urban} = \mu_{rural}$)
- $H_1$: There IS a difference in mean dropout rates ($\mu_{urban} \neq \mu_{rural}$)

In [None]:
# Step 1: Identify and prepare the groups

# Find the zone column
zone_col = ___  # e.g., 'ZONA' or first from zone_cols

print(f"Zone column: {zone_col}")
print(f"\nUnique values:")
print(df[zone_col].value_counts())

In [None]:
# Step 2: Create the two groups
# Identify the values for urban and rural

unique_zones = df[zone_col].dropna().unique()
print(f"Available zone values: {unique_zones}")

# Set the urban and rural values (UPDATE based on your data)
urban_value = ___  # e.g., 'URBANA' or 'URBANO'
rural_value = ___  # e.g., 'RURAL'

# Extract dropout rates for each group
group_urban = df[df[zone_col] == urban_value][dropout_col].dropna()
group_rural = df[df[zone_col] == rural_value][dropout_col].dropna()

print(f"\nUrban group: n = {len(group_urban)}, mean = {group_urban.mean():.4f}, std = {group_urban.std():.4f}")
print(f"Rural group: n = {len(group_rural)}, mean = {group_rural.mean():.4f}, std = {group_rural.std():.4f}")

In [None]:
# Step 3: Perform the independent samples t-test
# YOUR CODE HERE

t_stat_1, p_value_1 = stats.ttest_ind(___, ___)

print("=== HYPOTHESIS TEST 1: URBAN VS RURAL DROPOUT RATES ===")
print(f"\nTest statistic (t): {t_stat_1:.4f}")
print(f"P-value: {p_value_1:.6f}")

In [None]:
# Step 4: Make a decision and interpret

alpha = 0.05

print(f"Significance level (alpha): {alpha}")
print(f"P-value: {p_value_1:.6f}")
print()

if p_value_1 < alpha:
    decision_1 = "REJECT"
    print(f"Decision: {decision_1} the null hypothesis")
    print(f"\nConclusion: There IS a statistically significant difference")
    print(f"in dropout rates between urban and rural areas.")
else:
    decision_1 = "FAIL TO REJECT"
    print(f"Decision: {decision_1} the null hypothesis")
    print(f"\nConclusion: There is NO statistically significant difference")
    print(f"in dropout rates between urban and rural areas.")

# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(group_urban)-1)*group_urban.std()**2 + (len(group_rural)-1)*group_rural.std()**2) / 
                     (len(group_urban) + len(group_rural) - 2))
cohens_d = (group_urban.mean() - group_rural.mean()) / pooled_std

print(f"\nEffect size (Cohen's d): {cohens_d:.4f}")
if abs(cohens_d) < 0.2:
    print("Interpretation: Negligible effect")
elif abs(cohens_d) < 0.5:
    print("Interpretation: Small effect")
elif abs(cohens_d) < 0.8:
    print("Interpretation: Medium effect")
else:
    print("Interpretation: Large effect")

In [None]:
# Step 5: Visualize the comparison

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Boxplot
plot_data = pd.DataFrame({
    'Zone': ['Urban'] * len(group_urban) + ['Rural'] * len(group_rural),
    'Dropout Rate': list(group_urban) + list(group_rural)
})

sns.boxplot(x='Zone', y='Dropout Rate', data=plot_data, ax=axes[0], palette=['steelblue', 'coral'])
axes[0].set_title(f'Dropout Rate: Urban vs Rural\n(p-value = {p_value_1:.4f})', fontsize=13)

# Add significance annotation
if p_value_1 < 0.05:
    axes[0].annotate('*', xy=(0.5, max(group_urban.max(), group_rural.max())), 
                     fontsize=20, ha='center')
    axes[0].annotate('Significant', xy=(0.5, max(group_urban.max(), group_rural.max()) * 0.95), 
                     fontsize=10, ha='center', color='red')

# Distribution overlay
axes[1].hist(group_urban, bins=30, alpha=0.6, label=f'Urban (mean={group_urban.mean():.4f})', color='steelblue')
axes[1].hist(group_rural, bins=30, alpha=0.6, label=f'Rural (mean={group_rural.mean():.4f})', color='coral')
axes[1].axvline(group_urban.mean(), color='steelblue', linestyle='--', linewidth=2)
axes[1].axvline(group_rural.mean(), color='coral', linestyle='--', linewidth=2)
axes[1].set_xlabel('Dropout Rate', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution Comparison', fontsize=13)
axes[1].legend()

plt.tight_layout()
plt.show()

### Business Interpretation: Test 1

**Write your interpretation in business language (for a Ministry official):**

*YOUR INTERPRETATION HERE*

---

## Test 2: Enrollment Comparison Between Years

**Research Question:** Is there a statistically significant difference in enrollment between two different years?

**Hypotheses:**
- $H_0$: There is no difference in mean enrollment ($\mu_{year1} = \mu_{year2}$)
- $H_1$: There IS a difference in mean enrollment ($\mu_{year1} \neq \mu_{year2}$)

In [None]:
# Step 1: Identify the year column and available years

year_col = ___  # e.g., 'ANIO' or first from year_cols

print(f"Year column: {year_col}")
print(f"\nAvailable years:")
print(df[year_col].value_counts().sort_index())

In [None]:
# Step 2: Select two years to compare
# Choose years with sufficient data

year_1 = ___  # e.g., 2019
year_2 = ___  # e.g., 2020

# Extract enrollment for each year
enrollment_year1 = df[df[year_col] == year_1][enrollment_col].dropna()
enrollment_year2 = df[df[year_col] == year_2][enrollment_col].dropna()

print(f"Year {year_1}: n = {len(enrollment_year1)}, mean = {enrollment_year1.mean():.2f}")
print(f"Year {year_2}: n = {len(enrollment_year2)}, mean = {enrollment_year2.mean():.2f}")

In [None]:
# Step 3: Perform the t-test
# YOUR CODE HERE

t_stat_2, p_value_2 = stats.ttest_ind(___, ___)

print(f"=== HYPOTHESIS TEST 2: YEAR {year_1} VS YEAR {year_2} ENROLLMENT ===")
print(f"\nTest statistic (t): {t_stat_2:.4f}")
print(f"P-value: {p_value_2:.6f}")

In [None]:
# Step 4: Make a decision and interpret
# YOUR CODE HERE

alpha = 0.05

print(f"Significance level (alpha): {alpha}")
print(f"P-value: {p_value_2:.6f}")
print()

# YOUR CODE HERE: Write the decision logic



In [None]:
# Step 5: Visualize the comparison
# YOUR CODE HERE: Create a similar visualization as Test 1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# YOUR CODE HERE

plt.tight_layout()
plt.show()

### Business Interpretation: Test 2

**Write your interpretation in business language:**

*YOUR INTERPRETATION HERE*

---

---

# Part 3: Business Interpretation

Write an executive summary of your statistical findings for a non-technical audience.

---

## Executive Summary

**Write 5-7 sentences summarizing your findings for a Ministry of Education official:**

*Avoid statistical jargon. Focus on:*
- *What did you find?*
- *How confident are you in these findings?*
- *What does this mean for education policy?*
- *What action should be taken?*

---

*YOUR EXECUTIVE SUMMARY HERE*





---

---

# Part 4: Visualization of Statistical Significance

Create publication-quality visualizations that effectively communicate your findings.

---

## Visualization 1: Combined Hypothesis Test Results

Create a single figure showing both hypothesis test results with clear significance annotations.

In [None]:
# Visualization 1: Combined comparison plot

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# --- Test 1: Urban vs Rural ---
means_test1 = [group_urban.mean(), group_rural.mean()]
stds_test1 = [group_urban.std(), group_rural.std()]
x_test1 = ['Urban', 'Rural']

bars1 = axes[0].bar(x_test1, means_test1, yerr=stds_test1, capsize=5, 
                    color=['steelblue', 'coral'], edgecolor='black', alpha=0.8)

# Add significance bracket
max_height = max(means_test1) + max(stds_test1) * 1.2
axes[0].plot([0, 0, 1, 1], [max_height*0.95, max_height, max_height, max_height*0.95], 'k-', linewidth=1.5)

if p_value_1 < 0.001:
    sig_text = '***'
elif p_value_1 < 0.01:
    sig_text = '**'
elif p_value_1 < 0.05:
    sig_text = '*'
else:
    sig_text = 'ns'

axes[0].text(0.5, max_height * 1.02, sig_text, ha='center', fontsize=14, fontweight='bold')

axes[0].set_ylabel('Dropout Rate', fontsize=12)
axes[0].set_title(f'Test 1: Urban vs Rural Dropout Rates\np = {p_value_1:.4f}', fontsize=13)
axes[0].set_ylim(0, max_height * 1.15)

# --- Test 2: Year Comparison ---
# YOUR CODE HERE: Create a similar bar chart for Test 2


# Add legend for significance
fig.text(0.5, 0.02, '* p < 0.05, ** p < 0.01, *** p < 0.001, ns = not significant', 
         ha='center', fontsize=10, style='italic')

plt.tight_layout(rect=[0, 0.05, 1, 1])
plt.show()

## Visualization 2: Confidence Intervals Summary

Create a forest plot or similar visualization showing all confidence intervals.

In [None]:
# Visualization 2: Forest plot of confidence intervals
# YOUR CODE HERE

# Create a forest-style plot showing the 3 CIs
# (This is more applicable for rate metrics; enrollment is on different scale)

fig, ax = plt.subplots(figsize=(10, 6))

# Data for plotting (using rates which are on similar scales)
metrics = ['Dropout Rate', 'Approval Rate']
means = [mean_dropout, mean_approval]
ci_lowers = [ci_lower_dropout, ci_lower_approval]
ci_uppers = [ci_upper_dropout, ci_upper_approval]

y_positions = np.arange(len(metrics))

# YOUR CODE HERE: Create horizontal error bars (forest plot style)
# Hint: use ax.errorbar with horizontal orientation or ax.hlines + ax.scatter


plt.tight_layout()
plt.show()

---

# Part 5: Reflection

Answer the following questions about your analysis.

---

## Reflection Questions

### 1. What assumptions did you make when using the t-test? Were they reasonable?

*YOUR ANSWER HERE*

### 2. How might the results change if you used a different significance level (e.g., alpha = 0.01)?

*YOUR ANSWER HERE*

### 3. What is the difference between statistical significance and practical significance in your findings?

*YOUR ANSWER HERE*

### 4. What additional data or analysis would strengthen your conclusions?

*YOUR ANSWER HERE*

---

## Final Checklist

Before submitting, verify:

- [ ] All cells have been executed (Kernel > Restart & Run All)
- [ ] Part 1: Three confidence intervals calculated and interpreted
- [ ] Part 2: Two hypothesis tests completed with conclusions
- [ ] Part 3: Executive summary written in business language
- [ ] Part 4: At least 2 publication-quality visualizations created
- [ ] Part 5: Reflection questions answered

---

*Week 13 Workshop - Data Analytics Course - Universidad Cooperativa de Colombia*