# Survival Analysis: Telco Customer Churn Prediction
## Time-to-Churn Analysis

**Objective**: Predict when customers will churn and identify risk factors for early churn.

**Dataset**: Telco Customer Churn
- Available on Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

**Survival Analysis Setup**:
- **Time variable**: Tenure (months as customer)
- **Event**: Churn (customer left = 1, stayed = 0)
- **Censoring**: Active customers (still with company)
- **Features**: Demographics, services, contract details, charges

In [None]:
# Install required packages
!pip install lifelines pandas numpy matplotlib seaborn scikit-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines import KaplanMeierFitter, CoxPHFitter, WeibullAFTFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
from lifelines.utils import median_survival_times
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

## 1. Data Loading and Exploration

In [None]:
# Load Telco churn dataset
# Download from: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nChurn distribution:")
print(df['Churn'].value_counts())
print(f"\nChurn rate: {df['Churn'].value_counts(normalize=True)['Yes']:.1%}")

In [None]:
# Examine data types and missing values
print("Data info:")
print(df.info())
print(f"\nMissing values:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Check first few rows
df.head()

## 2. Data Preprocessing for Survival Analysis

In [None]:
# Clean and prepare data
survival_df = df.copy()

# Handle TotalCharges - convert to numeric (some values are spaces)
survival_df['TotalCharges'] = pd.to_numeric(survival_df['TotalCharges'], errors='coerce')

# Fill missing TotalCharges with MonthlyCharges * tenure
mask = survival_df['TotalCharges'].isnull()
survival_df.loc[mask, 'TotalCharges'] = survival_df.loc[mask, 'MonthlyCharges'] * survival_df.loc[mask, 'tenure']

# Create survival variables
survival_df['duration'] = survival_df['tenure']  # Time variable (months)
survival_df['event'] = (survival_df['Churn'] == 'Yes').astype(int)  # Event: churned

# Handle zero tenure (new customers)
survival_df['duration'] = survival_df['duration'].replace(0, 0.5)

print(f"Survival dataset shape: {survival_df.shape}")
print(f"\nDuration statistics (months):")
print(survival_df['duration'].describe())
print(f"\nEvents (churned): {survival_df['event'].sum()} ({survival_df['event'].mean():.1%})")
print(f"Censored (active): {(1 - survival_df['event']).sum()} ({(1 - survival_df['event']).mean():.1%})")

In [None]:
# Visualize tenure distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall tenure distribution
axes[0].hist(survival_df['duration'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Tenure (months)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Customer Tenure', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Tenure by churn status
churned = survival_df[survival_df['event'] == 1]['duration']
active = survival_df[survival_df['event'] == 0]['duration']

axes[1].hist([churned, active], bins=50, label=['Churned', 'Active'], 
            edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Tenure (months)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Tenure Distribution by Churn Status', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Kaplan-Meier Analysis: Overall Customer Retention

In [None]:
# Overall retention curve
kmf = KaplanMeierFitter()
kmf.fit(survival_df['duration'], survival_df['event'], label='All Customers')

fig, ax = plt.subplots(figsize=(12, 6))
kmf.plot_survival_function(ax=ax, ci_show=True)
plt.title('Kaplan-Meier Retention Curve: Telco Customers', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Median customer lifetime: {kmf.median_survival_time_:.1f} months")
print(f"\nRetention rates:")
for months in [6, 12, 24, 36, 48, 60]:
    retention = kmf.predict(months)
    churn = 1 - retention
    print(f"  {months} months: {retention:.1%} retained, {churn:.1%} churned")

## 4. Stratified Analysis: By Contract Type

In [None]:
# Compare retention by contract type
fig, ax = plt.subplots(figsize=(12, 6))

contract_types = survival_df['Contract'].unique()
for contract in contract_types:
    mask = survival_df['Contract'] == contract
    kmf_contract = KaplanMeierFitter()
    kmf_contract.fit(survival_df[mask]['duration'], 
                     survival_df[mask]['event'], 
                     label=contract)
    kmf_contract.plot_survival_function(ax=ax, ci_show=False)

plt.title('Retention Curves by Contract Type', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.legend(title='Contract Type')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Log-rank test
result = multivariate_logrank_test(
    survival_df['duration'],
    survival_df['Contract'],
    survival_df['event']
)
print(f"\nLog-rank test for Contract Type:")
print(f"Test statistic: {result.test_statistic:.4f}")
print(f"p-value: {result.p_value:.4e}")
print(f"Significant difference: {'Yes' if result.p_value < 0.05 else 'No'}")

## 5. Stratified Analysis: By Internet Service

In [None]:
# Compare by internet service type
fig, ax = plt.subplots(figsize=(12, 6))

for service in survival_df['InternetService'].unique():
    mask = survival_df['InternetService'] == service
    kmf_service = KaplanMeierFitter()
    kmf_service.fit(survival_df[mask]['duration'], 
                    survival_df[mask]['event'], 
                    label=service)
    kmf_service.plot_survival_function(ax=ax, ci_show=False)

plt.title('Retention Curves by Internet Service Type', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.legend(title='Internet Service')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Stratified Analysis: By Payment Method

In [None]:
# Compare by payment method
fig, ax = plt.subplots(figsize=(12, 6))

for method in survival_df['PaymentMethod'].unique():
    mask = survival_df['PaymentMethod'] == method
    kmf_payment = KaplanMeierFitter()
    kmf_payment.fit(survival_df[mask]['duration'], 
                    survival_df[mask]['event'], 
                    label=method)
    kmf_payment.plot_survival_function(ax=ax, ci_show=False)

plt.title('Retention Curves by Payment Method', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.legend(title='Payment Method', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Prepare Features for Cox Model

In [None]:
# Select features for modeling
numeric_features = ['SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

categorical_features = ['gender', 'Partner', 'Dependents', 'PhoneService', 
                       'MultipleLines', 'InternetService', 'OnlineSecurity',
                       'OnlineBackup', 'DeviceProtection', 'TechSupport',
                       'StreamingTV', 'StreamingMovies', 'Contract',
                       'PaperlessBilling', 'PaymentMethod']

# Create Cox dataset
cox_df = survival_df[numeric_features + categorical_features + ['duration', 'event']].copy()

# Encode categorical variables
cox_df_encoded = pd.get_dummies(cox_df, columns=categorical_features, drop_first=True)

# Remove any NaN
cox_df_encoded = cox_df_encoded.dropna()

print(f"Cox dataset shape: {cox_df_encoded.shape}")
print(f"Features: {cox_df_encoded.shape[1] - 2}")

## 8. Cox Proportional Hazards Model

In [None]:
# Fit Cox model
cph = CoxPHFitter(penalizer=0.1)
cph.fit(cox_df_encoded, duration_col='duration', event_col='event')

print("Cox Proportional Hazards Model Summary:")
print(f"Concordance Index: {cph.concordance_index_:.4f}")
print(f"Log-likelihood: {cph.log_likelihood_:.4f}")
print(f"AIC: {cph.AIC_:.4f}")

In [None]:
# Display model summary
summary = cph.summary
summary['hazard_ratio'] = np.exp(summary['coef'])
summary_sorted = summary.sort_values('p', ascending=True)

print("\nTop 20 Most Significant Factors:")
significant = summary_sorted[summary_sorted['p'] < 0.05].head(20)
print(significant[['coef', 'hazard_ratio', 'p']].to_string())

In [None]:
# Visualize top factors
fig, ax = plt.subplots(figsize=(10, 10))

top_factors = significant.head(15)
y_pos = np.arange(len(top_factors))
hazard_ratios = top_factors['hazard_ratio'].values
labels = [label[:50] for label in top_factors.index]

colors = ['red' if hr > 1 else 'green' for hr in hazard_ratios]
ax.barh(y_pos, hazard_ratios - 1, color=colors, alpha=0.6)
ax.axvline(0, color='black', linestyle='--', linewidth=1)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=10)
ax.set_xlabel('Hazard Ratio - 1', fontsize=12)
ax.set_title('Top 15 Churn Risk Factors\n(Red: Increases churn, Green: Decreases churn)', 
            fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 9. Risk Stratification

In [None]:
# Calculate risk scores
risk_scores = cph.predict_partial_hazard(cox_df_encoded)

# Create risk groups
risk_percentiles = np.percentile(risk_scores, [20, 40, 60, 80])
risk_groups = pd.cut(risk_scores, 
                     bins=[0] + list(risk_percentiles) + [np.inf],
                     labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

cox_df_encoded['risk_group'] = risk_groups

# Plot retention by risk group
fig, ax = plt.subplots(figsize=(12, 6))

for group in ['Very Low', 'Low', 'Medium', 'High', 'Very High']:
    mask = cox_df_encoded['risk_group'] == group
    kmf_risk = KaplanMeierFitter()
    kmf_risk.fit(cox_df_encoded[mask]['duration'], 
                 cox_df_encoded[mask]['event'], 
                 label=group)
    kmf_risk.plot_survival_function(ax=ax, ci_show=False)

plt.title('Retention Curves by Risk Stratification', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.legend(title='Risk Group')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nRisk Group Distribution:")
print(cox_df_encoded['risk_group'].value_counts().sort_index())

print("\nChurn rates by risk group:")
churn_by_risk = cox_df_encoded.groupby('risk_group')['event'].agg(['mean', 'count'])
print(churn_by_risk)

## 10. Customer Lifetime Value (CLV) Analysis

In [None]:
# Calculate expected lifetime and CLV by risk group
results = []

for group in ['Very Low', 'Low', 'Medium', 'High', 'Very High']:
    mask = cox_df_encoded['risk_group'] == group
    group_data = cox_df_encoded[mask]
    
    # Fit KM for this group
    kmf_group = KaplanMeierFitter()
    kmf_group.fit(group_data['duration'], group_data['event'])
    
    # Expected lifetime = area under survival curve
    expected_lifetime = kmf_group.survival_function_.sum().values[0]
    
    # Average monthly charges for this group
    avg_monthly = survival_df.loc[group_data.index, 'MonthlyCharges'].mean()
    
    # CLV = Expected lifetime Ã— Monthly charges
    clv = expected_lifetime * avg_monthly
    
    results.append({
        'Risk Group': group,
        'Count': len(group_data),
        'Churn Rate': group_data['event'].mean(),
        'Expected Lifetime (months)': expected_lifetime,
        'Avg Monthly Charges': avg_monthly,
        'Expected CLV': clv
    })

clv_df = pd.DataFrame(results)
print("\nCustomer Lifetime Value Analysis by Risk Group:")
print("=" * 90)
print(clv_df.to_string(index=False))

In [None]:
# Visualize CLV by risk group
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Expected lifetime
axes[0].bar(clv_df['Risk Group'], clv_df['Expected Lifetime (months)'], 
           color=sns.color_palette('RdYlGn_r', n_colors=5), alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Risk Group', fontsize=12)
axes[0].set_ylabel('Expected Lifetime (months)', fontsize=12)
axes[0].set_title('Expected Customer Lifetime by Risk Group', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Expected CLV
axes[1].bar(clv_df['Risk Group'], clv_df['Expected CLV'], 
           color=sns.color_palette('RdYlGn_r', n_colors=5), alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Risk Group', fontsize=12)
axes[1].set_ylabel('Expected CLV ($)', fontsize=12)
axes[1].set_title('Expected Customer Lifetime Value by Risk Group', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 11. Churn Probability Predictions

In [None]:
# Predict churn probability at different time horizons for sample customers
sample_indices = [0, 100, 500, 1000, 2000]

print("Churn Probability Predictions for Sample Customers:")
print("=" * 90)

for idx in sample_indices:
    customer = cox_df_encoded.iloc[idx:idx+1]
    
    # Get survival function
    surv_func = cph.predict_survival_function(customer)
    
    # Calculate churn probabilities
    risk = risk_scores.iloc[idx]
    risk_group = risk_groups.iloc[idx]
    
    print(f"\nCustomer {idx}:")
    print(f"  Risk Score: {risk:.2f}")
    print(f"  Risk Group: {risk_group}")
    print(f"  Churn probability:")
    
    for months in [6, 12, 24, 36]:
        if months <= surv_func.index.max():
            churn_prob = 1 - surv_func.loc[months].values[0]
            print(f"    Within {months} months: {churn_prob:.1%}")

In [None]:
# Visualize survival curves for sample customers
fig, ax = plt.subplots(figsize=(12, 6))

for idx in sample_indices:
    customer = cox_df_encoded.iloc[idx:idx+1]
    surv_func = cph.predict_survival_function(customer)
    risk_group = risk_groups.iloc[idx]
    surv_func.plot(ax=ax, label=f'Customer {idx} ({risk_group} Risk)')

plt.title('Predicted Retention Curves for Sample Customers', fontsize=14, fontweight='bold')
plt.xlabel('Tenure (months)', fontsize=12)
plt.ylabel('Probability of Retention', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 12. Model Validation

In [None]:
# Check proportional hazards assumption for key features
print("Checking Proportional Hazards Assumption...")
print("="*80)

# Test on simplified model with key features
key_features = ['MonthlyCharges', 'TotalCharges', 'SeniorCitizen'] + \
               [col for col in cox_df_encoded.columns if 'Contract_' in col or 'InternetService_' in col]

cox_simple = cox_df_encoded[key_features + ['duration', 'event']].copy()

cph_simple = CoxPHFitter()
cph_simple.fit(cox_simple, duration_col='duration', event_col='event')
cph_simple.check_assumptions(cox_simple, p_value_threshold=0.05, show_plots=True)

## 13. Key Insights and Recommendations

In [None]:
print("=" * 80)
print("KEY FINDINGS: TELCO CUSTOMER CHURN ANALYSIS")
print("=" * 80)

print(f"\n1. Overall Churn Statistics:")
print(f"   - Overall churn rate: {survival_df['event'].mean():.1%}")
print(f"   - Median customer lifetime: {kmf.median_survival_time_:.1f} months")
print(f"   - 1-year retention: {kmf.predict(12):.1%}")
print(f"   - 2-year retention: {kmf.predict(24):.1%}")

print(f"\n2. Top Churn Risk Factors (Increase Churn):")
top_risk = summary_sorted[summary_sorted['hazard_ratio'] > 1].head(5)
for idx, (factor, row) in enumerate(top_risk.iterrows(), 1):
    print(f"   {idx}. {factor}: HR={row['hazard_ratio']:.3f} (p={row['p']:.4f})")

print(f"\n3. Top Protective Factors (Decrease Churn):")
top_protect = summary_sorted[summary_sorted['hazard_ratio'] < 1].head(5)
for idx, (factor, row) in enumerate(top_protect.iterrows(), 1):
    print(f"   {idx}. {factor}: HR={row['hazard_ratio']:.3f} (p={row['p']:.4f})")

print(f"\n4. Model Performance:")
print(f"   - Concordance Index: {cph.concordance_index_:.4f}")

print(f"\n5. Customer Lifetime Value by Risk:")
for _, row in clv_df.iterrows():
    print(f"   {row['Risk Group']}: ${row['Expected CLV']:.2f} (lifetime: {row['Expected Lifetime (months)']:.1f} months)")

print(f"\n6. Contract Type Impact:")
for contract in survival_df['Contract'].unique():
    mask = survival_df['Contract'] == contract
    churn_rate = survival_df[mask]['event'].mean()
    avg_tenure = survival_df[mask]['duration'].mean()
    print(f"   {contract}: {churn_rate:.1%} churn rate, {avg_tenure:.1f} months avg tenure")

print("\n" + "=" * 80)
print("STRATEGIC RECOMMENDATIONS")
print("=" * 80)
print("1. RETENTION PROGRAMS:")
print("   - Target high-risk customers (Very High/High) with proactive outreach")
print("   - Focus on month-to-month contract customers for upgrades")
print("   - Implement early warning system for customers predicted to churn within 6 months")

print("\n2. PRODUCT & PRICING:")
print("   - Promote long-term contracts (1-2 year) with incentives")
print("   - Bundle services (security, backup, tech support) to increase stickiness")
print("   - Optimize pricing for Fiber optic customers who show higher churn")

print("\n3. CUSTOMER EXPERIENCE:")
print("   - Improve tech support and online security offerings")
print("   - Address payment method friction (electronic check users churn more)")
print("   - Implement customer success programs for first 12 months (high-risk period)")

print("\n4. INTERVENTION TIMING:")
print("   - Critical periods: 0-6 months (acquisition), 12-18 months (renewal decision)")
print("   - Deploy retention offers 2-3 months before predicted churn")
print("   - Different strategies for different risk tiers")

print("\n5. CLV OPTIMIZATION:")
print(f"   - Average CLV difference: ${clv_df['Expected CLV'].max() - clv_df['Expected CLV'].min():.2f}")
print("   - Focus acquisition on low-risk profile customers")
print("   - Allocate retention budget proportional to CLV by risk group")
print("=" * 80)

## Next Steps

1. **Time-varying covariates**: Model changes in usage, complaints, payment delays over time
2. **Uplift modeling**: Test impact of retention campaigns using A/B testing
3. **Survival forests**: Try Random Survival Forests for non-linear relationships
4. **Personalized interventions**: Develop targeted offers based on churn probability curves
5. **Real-time scoring**: Deploy model for real-time churn risk monitoring
6. **Competing risks**: Separate voluntary churn from involuntary (payment issues)
7. **Causal inference**: Identify truly causal factors vs correlations