# Week 3 ‚Äî Data Visualization & Exploratory Data Analysis

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Create compelling visualizations to uncover patterns, trends, and anomalies in customer and product data.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Create effective visualizations for different data types
- Build dashboards that tell data-driven stories
- Detect trends, seasonality, and anomalies visually
- Compare segments and cohorts graphically
- Use visualization to guide statistical analysis and modeling

## üìä Real-World Context

Visualizations are your primary tool for:
- **Communicating with stakeholders**: executives understand charts better than tables
- **Hypothesis generation**: seeing patterns guides your analysis
- **Quality assurance**: spotting data anomalies before modeling
- **Storytelling**: showing before/after, winners/losers, trends over time

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî Dashboard for Executive Review

Your CFO wants a weekly dashboard showing:
1. Churn trends by plan type and region
2. Feature adoption curves for new customers
3. Revenue impact: MRR by cohort
4. Customer retention: 30/60/90 day retention rates
5. Alerts: which regions/segments are declining?

You need compelling visuals that fit on 1 page and drive decision-making.

## üìö Key Concepts ‚Äî Visualization for Analytics

### Chart Types & When to Use Them
- **Line chart**: Trends over time (DAU, revenue, adoption)
- **Bar chart**: Comparing categories (plan type, region, feature)
- **Scatter plot**: Relationships between two continuous variables
- **Heatmap**: Patterns across two dimensions (cohort √ó time)
- **Box plot**: Distribution comparison across groups
- **Histogram**: Distribution of a single variable
- **Pie chart**: Avoid! Use bar charts instead

### Visualization Libraries
- **Matplotlib**: Low-level, very flexible
- **Seaborn**: Higher-level, statistical visualizations
- **Plotly**: Interactive, beautiful defaults
- **Pandas plotting**: Quick exploratory plots

### Effective Visualization Principles
1. **Simplicity**: Remove clutter, focus on the insight
2. **Clarity**: Clear labels, legends, and units
3. **Color**: Use color purposefully (accessibility first)
4. **Context**: Show baselines, targets, comparisons

<details>
<summary>üí° Hint ‚Äî Building a Retention Cohort Chart</summary>

**Steps:**
1. Create signup cohorts (by month or week)
2. For each cohort, track what % return after 30/60/90 days
3. Visualize as heatmap: cohorts on rows, days on columns, % in cells

**Common mistakes:**
- Forgetting to normalize: compare cohorts only up to age they've lived
- Not filtering correctly: "30 day retention" = active 30+ days after signup
- Ignoring recent cohorts: they haven't had time to churn yet (bias!)

</details>

<details>
<summary>‚úÖ Solution ‚Äî Churn Trends by Plan & Region</summary>

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])

# Prepare data
subs['is_churned'] = subs['churn_date'].notna()
subs['signup_month'] = subs['signup_date'].dt.to_period('M')

# Churn rate by plan type
churn_by_plan = subs.groupby('plan_tier')['is_churned'].mean()

# Churn rate over time
churn_timeline = subs.groupby('signup_month')['is_churned'].mean()

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Churn by plan
churn_by_plan.plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Churn Rate by Plan Type')
axes[0].set_ylabel('Churn Rate')
axes[0].set_xlabel('Plan')

# Plot 2: Churn over time
churn_timeline.plot(ax=axes[1], marker='o', linewidth=2)
axes[1].set_title('Monthly Churn Rate Trend')
axes[1].set_ylabel('Churn Rate')
axes[1].set_xlabel('Signup Month')

plt.tight_layout()
plt.show()

print("Churn by Plan:")
print(churn_by_plan.sort_values(ascending=False))
```

**Key insight:** If free tier has higher churn, consider engagement tactics vs pricing changes.

</details>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')

print("=" * 70)
print("WEEK 3: DATA VISUALIZATION DEMO")
print("=" * 70)

# Prepare churn analysis
subs['is_churned'] = subs['churn_date'].notna()
subs['signup_month'] = subs['signup_date'].dt.to_period('M')

# 1. Churn by plan
print("\n1. CHURN RATE BY PLAN TYPE")
churn_by_plan = subs.groupby('plan_tier')['is_churned'].mean().sort_values(ascending=False)
print(churn_by_plan.round(3))

# 2. Churn timeline
print("\n2. MONTHLY CHURN TREND")
churn_timeline = subs.groupby('signup_month')['is_churned'].mean()
print(churn_timeline.round(3).tail(6))

# 3. Feature adoption vs churn
print("\n3. FEATURE ADOPTION ANALYSIS")
user_feature_count = feature_usage.groupby('user_id')['feature_name'].nunique()
subs_merged = subs.merge(
    user_feature_count.reset_index().rename(columns={'feature_name': 'feature_count'}),
    on='user_id',
    how='left'
)
subs_merged['feature_count'] = subs_merged['feature_count'].fillna(0)

adoption_vs_churn = subs_merged.groupby('feature_count')['is_churned'].agg(['mean', 'count'])
print(adoption_vs_churn)

print("\n4. INSIGHTS")
print(f"   ‚Ä¢ Churn varies significantly by plan: {churn_by_plan.max() - churn_by_plan.min():.1%}")
print(f"   ‚Ä¢ Users with more feature adoption churn less")
print(f"   ‚Ä¢ Visualizations reveal: focus acquisition & engagement efforts!")

print("=" * 70)

## ü§î Reflection & Application

**Question 1:** If churn is higher in a region, what would you investigate?
- Product: Does the product work well for that region's use case?
- Pricing: Is pricing too high for the market?
- Support: Are they getting adequate help/response times?
- Competition: Did a competitor enter that market?
- Sales: Was the customer expected to succeed?

**Question 2:** Why might "feature adoption" correlate with lower churn?
- Causation: Using more features ‚Üí more value ‚Üí less likely to leave
- Selection: We acquired engaged customers who naturally use more
- Both could be true! Need deeper analysis

**Question 3:** How do you communicate findings to non-technical stakeholders?
- Avoid jargon; use business terminology
- Lead with the headline insight
- Use visuals; minimize tables
- Provide actionable next steps

## üìù Practice Assignment

**Problem:** Create a 1-page dashboard showing:
1. Line chart: Monthly churn trend
2. Bar chart: Churn by plan tier
3. Scatter plot: Feature adoption vs retention
4. Text summary: Top 3 insights and recommended actions

**Deliverable:** Jupyter notebook with clean visualizations and narrative.

## ‚úçÔ∏è Hands-on Exercises

1. **Retention Cohort Analysis**: Build a heatmap showing 30/60/90-day retention by signup month
2. **Feature Adoption Curves**: Plot cumulative feature adoption over customer lifetime for different segments
3. **Revenue Impact**: Visualize MRR by cohort and show correlation with engagement metrics

## üîó Next Steps

In Week 4, we'll add statistical rigor: hypothesis testing, significance, and confidence to our visual insights.