# Week 4 ‚Äî Statistical Analysis & Hypothesis Testing

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Move from descriptive to inferential statistics‚Äîtest hypotheses, quantify uncertainty, and make data-driven decisions.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Understand statistical significance and p-values
- Conduct t-tests, chi-squared tests, and ANOVA
- Compare customer segments statistically
- Build confidence intervals for key metrics
- Avoid common statistical pitfalls (multiple comparison bias, p-hacking)
- Interpret results for business decisions

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî Is the Premium Plan Worth It?

Your Product Manager proposes a new premium plan. Early data shows:
- Premium: 8 out of 50 customers churned (16%)
- Standard: 12 out of 60 customers churned (20%)

Question: Is this difference real or just noise? Should we roll it out company-wide?

Use hypothesis testing to decide.

## ‚úçÔ∏è Hands-on Exercises

1. **Chi-Squared Test**: Is churn significantly different across plan tiers? (œá¬≤ test)
2. **T-Test**: Do premium customers spend significantly more? Compare ARPU by segment
3. **ANOVA**: Are feature adoption rates significantly different across regions?
4. **Confidence Intervals**: Estimate churn rate with 95% CI for each segment

<details>
<summary>üí° Hint ‚Äî Hypothesis Testing Framework</summary>

**Setup:**
1. Define H‚ÇÄ (null): "No difference" vs H‚ÇÅ (alternative): "Difference exists"
2. Choose significance level Œ± (usually 0.05)
3. Compute test statistic (t, œá¬≤, F, etc.)
4. Compare p-value to Œ±: if p < Œ±, reject H‚ÇÄ

**Common Tests:**
- **Chi-squared**: categorical vs categorical (plan_type vs churned)
- **T-test**: continuous vs binary (ARPU vs churned)
- **ANOVA**: continuous across 3+ groups (adoption by region)

**Key insight:** Small sample? Use power analysis. Multiple tests? Adjust p-value (Bonferroni).

</details>

<details>
<summary>‚úÖ Solution ‚Äî Chi-Squared Test for Plan Churn</summary>

```python
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Load subscriptions
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
subs['is_churned'] = subs['churn_date'].notna()

# Create contingency table: plan_tier vs is_churned
contingency = pd.crosstab(subs['plan_tier'], subs['is_churned'])
print("Contingency Table:")
print(contingency)
print()

# Chi-squared test
chi2, p_value, dof, expected = chi2_contingency(contingency)

print(f"Chi-squared statistic: {chi2:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Degrees of freedom: {dof}")
print()

alpha = 0.05
if p_value < alpha:
    print(f"‚úì SIGNIFICANT (p={p_value:.4f} < {alpha})")
    print("  Conclusion: Churn rates differ significantly across plan types")
else:
    print(f"‚úó NOT SIGNIFICANT (p={p_value:.4f} >= {alpha})")
    print("  Conclusion: No evidence of difference in churn rates")

# Show churn rates for context
print("\nChurn Rate by Plan:")
churn_rates = subs.groupby('plan_tier')['is_churned'].agg(['sum', 'count'])
churn_rates['churn_rate'] = churn_rates['sum'] / churn_rates['count']
print(churn_rates[['churn_rate']])
```

**Key insight:** Statistical significance ‚â† practical significance. A 2% difference with p=0.03 might not warrant action.

</details>

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, ttest_ind

print("=" * 70)
print("WEEK 4: STATISTICAL HYPOTHESIS TESTING DEMO")
print("=" * 70)

# Load data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
subs['is_churned'] = subs['churn_date'].notna()

# 1. Chi-squared test for independence
print("\n1. CHI-SQUARED TEST: Does churn differ by plan type?")
print("-" * 70)
contingency = pd.crosstab(subs['plan_tier'], subs['is_churned'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"Chi-squared = {chi2:.4f}, p-value = {p_value:.6f}")

churn_by_plan = subs.groupby('plan_tier')['is_churned'].mean()
print("\nChurn rates by plan:")
for plan, rate in churn_by_plan.items():
    print(f"  {plan}: {rate:.1%}")

if p_value < 0.05:
    print("‚úì RESULT: Statistically significant difference (p < 0.05)")
else:
    print("‚úó RESULT: No significant difference (p >= 0.05)")

# 2. Confidence interval for churn rate
print("\n2. CONFIDENCE INTERVALS: Churn rates with uncertainty")
print("-" * 70)
from scipy import stats

for plan in subs['plan_tier'].unique():
    plan_data = subs[subs['plan_tier'] == plan]
    churned = plan_data['is_churned'].sum()
    total = len(plan_data)
    rate = churned / total
    
    # Binomial 95% CI
    ci_lower, ci_upper = stats.binom.interval(0.95, total, rate)
    ci_lower /= total
    ci_upper /= total
    
    print(f"{plan}:")
    print(f"  Churn rate: {rate:.1%}")
    print(f"  95% CI: [{ci_lower:.1%}, {ci_upper:.1%}]")

print("\n" + "=" * 70)

## üìö Key Concepts ‚Äî Statistical Decision Making

### P-Values Explained
- **p-value**: Probability of seeing this data IF the null hypothesis were true
- **Small p-value** (< 0.05): Unlikely under null; reject H‚ÇÄ
- **Large p-value** (‚â• 0.05): Could happen under null; fail to reject H‚ÇÄ

### Common Mistakes
1. **P-hacking**: Running 100 tests, finding 5 "significant" ones by chance
   - **Solution**: Pre-register hypotheses, adjust p-values for multiple comparisons
2. **Statistical ‚â† Practical significance**
   - **Example**: 1M users, 2% vs 2.1% churn ‚Üí p < 0.001, but 0.1% difference?
3. **Confounding variables**: Association ‚â† causation
   - **Example**: Premium users might have different product fit, not plan quality

## ü§î Reflection & Application

**Question 1:** Your test shows p = 0.07. Can you conclude no effect?
- No! You can only say "insufficient evidence at Œ±=0.05 level"
- With bigger sample, p might drop below 0.05
- Absence of evidence ‚â† evidence of absence

**Question 2:** When should you run statistical tests on SaaS data?
- **Yes:** A/B tests, comparing segments, validating model assumptions
- **No:** Exploring data (causes p-hacking), sample sizes < 30 per group

**Question 3:** How do you choose Œ± (significance level)?
- Standard: 0.05 (5% false positive rate)
- Medical studies: 0.01 (stricter)
- Exploratory analysis: 0.10 (more lenient)

## üìù Practice Assignment

**Problem:** Customers report different satisfaction across regions. Test if this is statistically significant.
1. Load subscriptions data
2. For each region, compute average customer lifetime
3. Conduct ANOVA: is there significant difference across regions?
4. Post-hoc test: which specific regions differ?
5. Report with confidence intervals

## üîó Next Steps

In Week 5, we'll prepare data scientifically for modeling through feature engineering and preprocessing.