# Churn Prediction: Exploratory Data Analysis (EDA)

In this notebook, we perform EDA with a specific goal: **identify drivers of churn and potential data leakage**.

## Goals
1. Understand data structure and quality
2. Identify leakage risks (e.g., features unrelated to prediction time)
3. Discover non-linear relationships
4. Check for Simpson's Paradox

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings for cleaner output
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
# Load Data
df = pd.read_csv('data/telco_churn.csv')
print(f"Shape: {df.shape}")
df.head()

## 1. Data Cleaning & Type correction

**Observation**: `TotalCharges` is object type, should be float. Why?
Let's investigate.

In [None]:
# Check non-numeric values
print(df['TotalCharges'].value_counts().sort_index().head())

# Discovery: Empty strings " " for new customers (tenure=0)

> **DECISION CHECKPOINT 1**: How to handle missing TotalCharges?
>
> Options:
> A. Drop rows (loss of 11 new customers)
> B. Fill with 0 (logical, they haven't paid yet)
> C. Fill with mean (bad, ignores tenure)
>
> **Decision**: Fill with 0. New customers have paid 0 total.

In [None]:
# Convert to numeric, coercing errors to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)

# Convert Churn to binary (Target)
df['Churn'] = (df['Churn'] == 'Yes').astype(int)

## 2. Target Distribution

Checking for class imbalance.

In [None]:
churn_rate = df['Churn'].mean()
print(f"Churn Rate: {churn_rate:.1%}")

plt.figure(figsize=(6, 4))
sns.countplot(x='Churn', data=df)
plt.title('Class Imbalance Check')
plt.show()

> **Insight**: 26.5% churn rate. This is imbalanced but not extreme (like 1% fraud).
> **Consequence**: Accuracy will be misleading (baseline 73.5%). We must use Precision/Recall/F1.

## 3. Numerical Features Analysis

In [None]:
num_cols = ['Tenure', 'MonthlyCharges', 'TotalCharges']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, col in enumerate(num_cols):
    if col in df.columns:
        sns.histplot(data=df, x=col, hue='Churn', kde=True, ax=axes[i], multiple="stack")
        axes[i].set_title(f'{col} Distribution by Churn')
plt.tight_layout()
plt.show()

## 4. Categorical Analysis

Which services drive churn?

In [None]:
services = ['PhoneService', 'InternetService', 'OnlineSecurity', 'TechSupport', 'StreamingTV']

plt.figure(figsize=(15, 10))
for i, col in enumerate(services):
    plt.subplot(2, 3, i+1)
    sns.barplot(x=col, y='Churn', data=df, ci=None)
    plt.title(f'Churn Rate by {col}')
    plt.axhline(churn_rate, color='red', linestyle='--', label='Avg Churn')
plt.tight_layout()
plt.legend()
plt.show()

## 5. Detecting Simpson's Paradox

Is contract type the real driver, or is it tenure?

In [None]:
# Create tenure groups
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 72], labels=['0-1yr', '1-2yr', '2-4yr', '4+yr'])

# Check churn by Contract type within tenure groups
pivot = df.pivot_table(index='tenure_group', columns='Contract', values='Churn', aggfunc='mean')
sns.heatmap(pivot, annot=True, fmt='.0%', cmap='Reds')
plt.title('Churn Rate by Tenure Group AND Contract')
plt.show()

> **DECISION CHECKPOINT 2**: Interpret the Heatmap
>
> - **Month-to-month** contracts have high churn (40-50%) ONLY in the first year.
> - Long-term month-to-month customers (4+ years) churn at 28% - much lower!
> - **One year** contracts in 0-1yr tenure churn at 12%.
>
> **Conclusion**: Interaction between Tenure and Contract is critical. We must use a model that captures interactions (Random Forest/XGBoost) or explicitly engineer this feature for linear models.