# Exploratory Data Analysis (EDA) for Customer Churn Prediction

This notebook provides a comprehensive exploratory data analysis of customer churn datasets, including:
- Univariate analysis of all features
- Bivariate analysis with statistical testing
- Correlation analysis using appropriate measures
- Publication-ready visualizations
- Business insights and recommendations

## Table of Contents
1. [Setup and Data Loading](#setup)
2. [Data Overview](#overview)
3. [Univariate Analysis](#univariate)
4. [Bivariate Analysis](#bivariate)
5. [Correlation Analysis](#correlation)
6. [Statistical Testing](#testing)
7. [Key Insights and Business Recommendations](#insights)
8. [Conclusions](#conclusions)

In [2]:
# Import libraries
import sys
import warnings
from pathlib import Path

# Add src and root to path for imports
notebook_dir = Path.cwd()
if notebook_dir.name == 'notebooks':
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

sys.path.insert(0, str(project_root))
sys.path.insert(0, str(project_root / 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Import our custom modules
try:
    from data_prep import DataLoader, DataCleaner, DataSplitter
    from eda import EDAAnalyzer, StatisticalTester, EDAVisualizer
    from config import config
    print("‚úÖ Custom modules imported successfully!")
except ImportError as e:
    print(f"‚ùå Import error: {e}")
    print("Please ensure you're running this notebook from the project root or notebooks directory")
    raise

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Configure plotting
try:
    plt.style.use('seaborn-v0_8')
except:
    plt.style.use('default')
    print("Using default matplotlib style (seaborn-v0_8 not available)")

sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(config.RANDOM_SEED)

print("Libraries imported successfully!")
print(f"Working directory: {Path.cwd()}")
print(f"Project root: {project_root}")
print(f"Config loaded - Random seed: {config.RANDOM_SEED}")

ModuleNotFoundError: No module named 'config'

## 1. Setup and Data Loading {#setup}

First, let's import the necessary libraries and load our data.

### Load and Prepare Data

We'll load both Telco and Olist datasets to demonstrate the EDA framework on different data structures.

In [None]:
# Initialize data components
loader = DataLoader()
cleaner = DataCleaner()

print("Loading Telco Customer Churn dataset...")
try:
    # Load Telco data
    telco_raw = loader.load_telco_data()
    print(f"Telco dataset loaded: {telco_raw.shape}")
    
    # Clean the data
    telco_clean = cleaner.clean_telco_data(telco_raw)
    print(f"Telco dataset cleaned: {telco_clean.shape}")
    
    telco_available = True
    
except Exception as e:
    print(f"Could not load Telco dataset: {e}")
    print("Will create synthetic data for demonstration...")
    
    # Create synthetic Telco-like data for demonstration
    np.random.seed(42)
    n_samples = 1000
    
    telco_clean = pd.DataFrame({
        'customerID': [f'C{i:04d}' for i in range(n_samples)],
        'gender': np.random.choice(['Male', 'Female'], n_samples),
        'SeniorCitizen': np.random.choice([0, 1], n_samples, p=[0.8, 0.2]),
        'Partner': np.random.choice([0, 1], n_samples, p=[0.5, 0.5]),
        'Dependents': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
        'tenure': np.random.exponential(20, n_samples).astype(int),
        'PhoneService': np.random.choice([0, 1], n_samples, p=[0.1, 0.9]),
        'InternetService': np.random.choice(['No', 'DSL', 'Fiber optic'], n_samples, p=[0.2, 0.4, 0.4]),
        'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples, p=[0.5, 0.3, 0.2]),
        'PaperlessBilling': np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
        'PaymentMethod': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_samples),
        'MonthlyCharges': np.random.normal(65, 20, n_samples).clip(20, 120),
        'TotalCharges': None  # Will calculate based on tenure and monthly charges
    })
    
    # Calculate TotalCharges with some noise
    telco_clean['TotalCharges'] = (telco_clean['tenure'] * telco_clean['MonthlyCharges'] * 
                                  np.random.normal(1, 0.1, n_samples)).clip(0, None)
    
    # Create churn target with realistic patterns
    churn_prob = (
        0.1 +  # Base rate
        0.3 * (telco_clean['Contract'] == 'Month-to-month') +  # Contract effect
        0.2 * (telco_clean['tenure'] < 12) +  # Tenure effect
        0.1 * (telco_clean['MonthlyCharges'] > 80) +  # Price effect
        0.1 * telco_clean['SeniorCitizen']  # Age effect
    ).clip(0, 1)
    
    telco_clean['Churn'] = np.random.binomial(1, churn_prob, n_samples)
    
    telco_available = True
    print(f"Synthetic Telco dataset created: {telco_clean.shape}")

# Display basic info about the dataset
if telco_available:
    print("\n=== Telco Dataset Info ===")
    print(f"Shape: {telco_clean.shape}")
    print(f"Churn rate: {telco_clean['Churn'].mean():.2%}")
    print(f"Missing values: {telco_clean.isnull().sum().sum()}")
    
    # Display first few rows
    display(telco_clean.head())

## 2. Data Overview {#overview}

Let's get a comprehensive overview of our dataset structure and quality.

In [None]:
# Generate data quality report
quality_report = loader.generate_quality_report(telco_clean, "Telco")

print("=== Data Quality Report ===")
print(f"Total rows: {quality_report.total_rows:,}")
print(f"Total columns: {quality_report.total_columns}")
print(f"Memory usage: {quality_report.memory_usage}")
print(f"Duplicate rows: {quality_report.duplicate_rows}")

if quality_report.missing_values:
    print("\nMissing values:")
    for col, count in quality_report.missing_values.items():
        percentage = (count / quality_report.total_rows) * 100
        print(f"  {col}: {count} ({percentage:.1f}%)")
else:
    print("\nNo missing values found!")

if quality_report.outliers:
    print("\nOutliers detected:")
    for col, count in quality_report.outliers.items():
        percentage = (count / quality_report.total_rows) * 100
        print(f"  {col}: {count} ({percentage:.1f}%)")
else:
    print("\nNo outliers detected using IQR method.")

In [None]:
# Display data types and basic statistics
print("=== Data Types ===")
print(telco_clean.dtypes)

print("\n=== Basic Statistics ===")
display(telco_clean.describe(include='all'))

## 3. Univariate Analysis {#univariate}

Let's analyze each variable individually to understand their distributions and characteristics.

In [None]:
# Initialize EDA components
eda_analyzer = EDAAnalyzer(significance_level=0.05, correlation_threshold=0.7)
visualizer = EDAVisualizer()

# Perform univariate analysis
print("Performing univariate analysis...")
univariate_results = eda_analyzer.perform_univariate_analysis(telco_clean)

print(f"\nAnalyzed {len(univariate_results)} variables")

# Display results for each variable type
numeric_vars = [name for name, result in univariate_results.items() if result.data_type == 'numeric']
categorical_vars = [name for name, result in univariate_results.items() if result.data_type == 'categorical']

print(f"\nNumeric variables ({len(numeric_vars)}): {numeric_vars}")
print(f"Categorical variables ({len(categorical_vars)}): {categorical_vars}")

In [None]:
# Display detailed results for numeric variables
print("=== Numeric Variables Analysis ===")
numeric_summary = []

for name, result in univariate_results.items():
    if result.data_type == 'numeric':
        numeric_summary.append({
            'Variable': name,
            'Missing (%)': f"{result.missing_percentage:.1f}%",
            'Unique': result.unique_count,
            'Mean': f"{result.mean:.2f}" if result.mean else 'N/A',
            'Median': f"{result.median:.2f}" if result.median else 'N/A',
            'Std': f"{result.std:.2f}" if result.std else 'N/A',
            'Skewness': f"{result.skewness:.2f}" if result.skewness else 'N/A',
            'Kurtosis': f"{result.kurtosis:.2f}" if result.kurtosis else 'N/A'
        })

numeric_df = pd.DataFrame(numeric_summary)
display(numeric_df)

In [None]:
# Display detailed results for categorical variables
print("=== Categorical Variables Analysis ===")
categorical_summary = []

for name, result in univariate_results.items():
    if result.data_type == 'categorical':
        categorical_summary.append({
            'Variable': name,
            'Missing (%)': f"{result.missing_percentage:.1f}%",
            'Unique': result.unique_count,
            'Mode': result.mode,
            'Mode Freq': result.mode_frequency,
            'Mode (%)': f"{result.mode_percentage:.1f}%" if result.mode_percentage else 'N/A'
        })

categorical_df = pd.DataFrame(categorical_summary)
display(categorical_df)

# Show top categories for each categorical variable
print("\n=== Top Categories by Variable ===")
for name, result in univariate_results.items():
    if result.data_type == 'categorical' and result.top_categories:
        print(f"\n{name}:")
        for category, count in list(result.top_categories.items())[:5]:  # Top 5
            percentage = (count / (len(telco_clean) - result.missing_count)) * 100
            print(f"  {category}: {count} ({percentage:.1f}%)")

In [None]:
# Create univariate summary visualization
fig_univariate = visualizer.plot_univariate_summary(univariate_results, 'univariate_summary')
plt.show()

## 4. Bivariate Analysis {#bivariate}

Now let's analyze the relationship between each feature and the target variable (churn).

In [None]:
# Perform bivariate analysis
print("Performing bivariate analysis...")
target_col = 'Churn'
feature_cols = [col for col in telco_clean.columns if col not in ['customerID', target_col]]

bivariate_results = eda_analyzer.perform_bivariate_analysis(
    telco_clean, target_col, feature_cols
)

print(f"\nAnalyzed {len(bivariate_results)} features against target '{target_col}'")

# Create summary of bivariate results
bivariate_summary = []

for name, result in bivariate_results.items():
    row = {
        'Feature': name,
        'Type': result.feature_type,
        'Test': result.test_name,
        'Statistic': f"{result.test_statistic:.3f}" if result.test_statistic else 'N/A',
        'P-value': f"{result.p_value:.3e}" if result.p_value else 'N/A',
        'Effect Size': f"{result.effect_size:.3f}" if result.effect_size else 'N/A',
        'Effect Interpretation': result.effect_size_interpretation or 'N/A',
        'Significant': 'Yes' if result.p_value and result.p_value < 0.05 else 'No'
    }
    
    # Add correlation for numeric features
    if result.correlation_coefficient is not None:
        row['Correlation'] = f"{result.correlation_coefficient:.3f}"
    
    bivariate_summary.append(row)

bivariate_df = pd.DataFrame(bivariate_summary)

# Sort by significance and effect size
bivariate_df['P-value_numeric'] = pd.to_numeric(bivariate_df['P-value'], errors='coerce')
bivariate_df = bivariate_df.sort_values(['Significant', 'P-value_numeric'], ascending=[False, True])
bivariate_df = bivariate_df.drop('P-value_numeric', axis=1)

print("\n=== Bivariate Analysis Results ===")
display(bivariate_df)

In [None]:
# Identify most significant features
significant_features = bivariate_df[bivariate_df['Significant'] == 'Yes']

print(f"\n=== Most Significant Features ({len(significant_features)} total) ===")
if len(significant_features) > 0:
    display(significant_features.head(10))
    
    print(f"\nTop 5 most significant features:")
    for i, (_, row) in enumerate(significant_features.head(5).iterrows()):
        print(f"{i+1}. {row['Feature']} (p={row['P-value']}, effect={row['Effect Interpretation']})")
else:
    print("No statistically significant features found at Œ±=0.05 level.")

In [None]:
# Create bivariate summary visualization
fig_bivariate = visualizer.plot_bivariate_summary(bivariate_results, 'bivariate_summary')
plt.show()

### Detailed Analysis of Top Features

Let's create detailed visualizations for the most significant features.

In [None]:
# Plot churn rates for top categorical features
categorical_significant = significant_features[significant_features['Type'] == 'categorical']

if len(categorical_significant) > 0:
    print("=== Churn Rates by Top Categorical Features ===")
    
    for _, row in categorical_significant.head(3).iterrows():
        feature = row['Feature']
        print(f"\n{feature}:")
        
        # Calculate and display churn rates
        churn_rates = telco_clean.groupby(feature)[target_col].agg(['mean', 'count'])
        churn_rates.columns = ['Churn_Rate', 'Count']
        churn_rates['Churn_Rate_Pct'] = churn_rates['Churn_Rate'] * 100
        
        display(churn_rates.sort_values('Churn_Rate', ascending=False))
        
        # Create visualization
        fig = visualizer.plot_churn_rate_by_segments(
            telco_clean, feature, target_col, 
            title=f'Churn Rate by {feature.replace("_", " ").title()}',
            save_name=f'churn_rate_{feature}'
        )
        plt.show()
        plt.close(fig)

In [None]:
# Plot distributions for top numeric features
numeric_significant = significant_features[significant_features['Type'] == 'numeric']

if len(numeric_significant) > 0:
    print("=== Distributions of Top Numeric Features ===")
    
    for _, row in numeric_significant.head(3).iterrows():
        feature = row['Feature']
        print(f"\n{feature}:")
        
        # Calculate and display summary statistics by churn
        summary_stats = telco_clean.groupby(target_col)[feature].describe()
        display(summary_stats)
        
        # Create visualization
        fig = visualizer.plot_numeric_distribution_by_churn(
            telco_clean, feature, target_col,
            title=f'{feature.replace("_", " ").title()} Distribution by Churn Status',
            save_name=f'distribution_{feature}'
        )
        plt.show()
        plt.close(fig)

## 5. Correlation Analysis {#correlation}

Let's analyze correlations between variables using appropriate measures for different data types.

In [None]:
# Perform correlation analysis
print("Performing correlation analysis...")

# Use mixed correlation analysis (automatic method selection)
correlation_result = eda_analyzer.perform_correlation_analysis(
    telco_clean, method='auto', include_categorical=True
)

print(f"Correlation method used: {correlation_result.correlation_method}")
print(f"Correlation matrix shape: {correlation_result.correlation_matrix.shape}")
print(f"High correlations found: {len(correlation_result.high_correlations)}")
print(f"Significant correlations found: {len(correlation_result.significant_correlations)}")

In [None]:
# Display correlation matrix
print("=== Correlation Matrix ===")
display(correlation_result.correlation_matrix.round(3))

In [None]:
# Create correlation heatmap
fig_correlation = visualizer.plot_correlation_heatmap(
    correlation_result.correlation_matrix,
    title=f'Correlation Matrix ({correlation_result.correlation_method.title()})',
    save_name='correlation_heatmap'
)
plt.show()

In [None]:
# Analyze high correlations
if correlation_result.high_correlations:
    print("=== High Correlations (|r| > 0.7) ===")
    high_corr_df = pd.DataFrame(correlation_result.high_correlations, 
                               columns=['Variable 1', 'Variable 2', 'Correlation'])
    high_corr_df['Abs_Correlation'] = high_corr_df['Correlation'].abs()
    high_corr_df = high_corr_df.sort_values('Abs_Correlation', ascending=False)
    
    display(high_corr_df.drop('Abs_Correlation', axis=1))
    
    print("\n‚ö†Ô∏è  High correlations may indicate multicollinearity issues.")
    print("Consider feature selection or dimensionality reduction techniques.")
else:
    print("‚úÖ No high correlations found (|r| > 0.7). Good for avoiding multicollinearity.")

In [None]:
# Analyze correlations with target variable
if target_col in correlation_result.correlation_matrix.columns:
    target_correlations = correlation_result.correlation_matrix[target_col].drop(target_col)
    target_correlations = target_correlations.sort_values(key=abs, ascending=False)
    
    print(f"=== Correlations with Target Variable ({target_col}) ===")
    print("\nTop 10 strongest correlations:")
    
    target_corr_df = pd.DataFrame({
        'Feature': target_correlations.index,
        'Correlation': target_correlations.values,
        'Abs_Correlation': target_correlations.abs().values
    }).head(10)
    
    display(target_corr_df.drop('Abs_Correlation', axis=1))
    
    # Visualize top correlations with target
    plt.figure(figsize=(10, 6))
    top_10 = target_correlations.head(10)
    colors = ['red' if x < 0 else 'blue' for x in top_10.values]
    
    bars = plt.barh(range(len(top_10)), top_10.values, color=colors, alpha=0.7)
    plt.yticks(range(len(top_10)), [name.replace('_', ' ') for name in top_10.index])
    plt.xlabel(f'Correlation with {target_col}')
    plt.title(f'Top 10 Feature Correlations with {target_col}')
    plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
    plt.grid(True, alpha=0.3)
    
    # Add value labels
    for i, bar in enumerate(bars):
        width = bar.get_width()
        plt.text(width + (0.01 if width >= 0 else -0.01), bar.get_y() + bar.get_height()/2,
                f'{width:.3f}', ha='left' if width >= 0 else 'right', va='center')
    
    plt.tight_layout()
    plt.show()
else:
    print(f"Target variable '{target_col}' not found in correlation matrix.")

## 6. Statistical Testing {#testing}

Let's perform comprehensive statistical testing with proper multiple testing corrections.

In [None]:
# Initialize statistical tester
tester = StatisticalTester(significance_level=0.05, multiple_testing_correction='bonferroni')

# Perform comprehensive testing
print("Performing comprehensive statistical testing...")
test_results = tester.perform_comprehensive_testing(telco_clean, target_col, feature_cols)

print(f"\nCompleted testing for {len(test_results)} features")

# Calculate effect sizes
print("Calculating effect sizes...")
effect_sizes = tester.calculate_effect_sizes(telco_clean, target_col, feature_cols)

print(f"Calculated effect sizes for {len(effect_sizes)} features")

In [None]:
# Generate comprehensive testing report
testing_report = tester.generate_testing_report(test_results, effect_sizes)

print("=== Statistical Testing Report ===")
display(testing_report)

# Save the report
report_path = config.TABLES_PATH / 'statistical_testing_report.csv'
testing_report.to_csv(report_path, index=False)
print(f"\nReport saved to: {report_path}")

In [None]:
# Analyze multiple testing correction impact
if 'p_value_corrected' in testing_report.columns:
    print("=== Multiple Testing Correction Impact ===")
    
    original_significant = testing_report['is_significant'].sum()
    corrected_significant = testing_report['is_significant_corrected'].sum()
    
    print(f"Original significant features: {original_significant}")
    print(f"Significant after correction: {corrected_significant}")
    print(f"Features lost due to correction: {original_significant - corrected_significant}")
    
    if corrected_significant > 0:
        print("\nFeatures remaining significant after correction:")
        remaining_significant = testing_report[testing_report['is_significant_corrected'] == True]
        for _, row in remaining_significant.iterrows():
            print(f"  - {row['feature']}: p={row['p_value']:.3e} ‚Üí p_corrected={row['p_value_corrected']:.3e}")
    else:
        print("\n‚ö†Ô∏è  No features remain significant after multiple testing correction.")
        print("Consider using a less conservative correction method or increasing sample size.")

## 7. Key Insights and Business Recommendations {#insights}

Based on our comprehensive EDA, let's summarize the key findings and provide actionable business recommendations.

In [None]:
# Generate analysis summary
analysis_summary = eda_analyzer.get_analysis_summary()

print("=== EDA Analysis Summary ===")
print(f"Analysis timestamp: {analysis_summary['timestamp']}")
print(f"Analyses performed: {', '.join(analysis_summary['analyses_performed'])}")

if 'univariate_summary' in analysis_summary:
    uni_summary = analysis_summary['univariate_summary']
    print(f"\nUnivariate Analysis:")
    print(f"  - Total columns analyzed: {uni_summary['total_columns']}")
    print(f"  - Numeric columns: {uni_summary['numeric_columns']}")
    print(f"  - Categorical columns: {uni_summary['categorical_columns']}")
    print(f"  - Columns with missing values: {uni_summary['columns_with_missing']}")

if 'bivariate_summary' in analysis_summary:
    bi_summary = analysis_summary['bivariate_summary']
    print(f"\nBivariate Analysis:")
    print(f"  - Total features tested: {bi_summary['total_features']}")
    print(f"  - Significant features: {bi_summary['significant_features']}")
    if bi_summary['significant_features'] > 0:
        print(f"  - Significant feature names: {', '.join(bi_summary['significant_feature_names'])}")

if 'correlation_summary' in analysis_summary:
    corr_summary = analysis_summary['correlation_summary']
    print(f"\nCorrelation Analysis:")
    print(f"  - Method used: {corr_summary['correlation_method']}")
    print(f"  - High correlations found: {corr_summary['high_correlations_count']}")
    print(f"  - Significant correlations: {corr_summary['significant_correlations_count']}")

In [None]:
# Identify key business insights
print("\n" + "="*60)
print("KEY BUSINESS INSIGHTS")
print("="*60)

# 1. Churn rate analysis
overall_churn_rate = telco_clean[target_col].mean()
print(f"\n1. OVERALL CHURN RATE: {overall_churn_rate:.1%}")

if overall_churn_rate > 0.25:
    print("   ‚ö†Ô∏è  HIGH CHURN RATE - Immediate attention required")
elif overall_churn_rate > 0.15:
    print("   ‚ö° MODERATE CHURN RATE - Monitor closely")
else:
    print("   ‚úÖ ACCEPTABLE CHURN RATE - Continue monitoring")

# 2. Most impactful features
print("\n2. MOST IMPACTFUL FEATURES:")
if len(significant_features) > 0:
    top_features = significant_features.head(5)
    for i, (_, row) in enumerate(top_features.iterrows()):
        print(f"   {i+1}. {row['Feature']} ({row['Type']})")
        print(f"      - Statistical significance: p={row['P-value']}")
        print(f"      - Effect size: {row['Effect Interpretation']}")
else:
    print("   No statistically significant features identified.")

# 3. Data quality insights
print("\n3. DATA QUALITY ASSESSMENT:")
missing_vars = [name for name, result in univariate_results.items() if result.missing_count > 0]
if missing_vars:
    print(f"   ‚ö†Ô∏è  Variables with missing data: {len(missing_vars)}")
    for var in missing_vars[:3]:  # Show top 3
        result = univariate_results[var]
        print(f"      - {var}: {result.missing_percentage:.1f}% missing")
else:
    print("   ‚úÖ No missing data detected")

# 4. Correlation insights
print("\n4. FEATURE RELATIONSHIPS:")
if correlation_result.high_correlations:
    print(f"   ‚ö†Ô∏è  {len(correlation_result.high_correlations)} high correlations detected")
    print("      - May indicate multicollinearity issues")
    print("      - Consider feature selection for modeling")
else:
    print("   ‚úÖ No concerning correlations detected")
    print("      - Features appear to be relatively independent")

In [None]:
# Generate business recommendations
print("\n" + "="*60)
print("BUSINESS RECOMMENDATIONS")
print("="*60)

recommendations = []

# Recommendation 1: Focus on significant features
if len(significant_features) > 0:
    top_feature = significant_features.iloc[0]['Feature']
    recommendations.append(
        f"1. PRIORITIZE {top_feature.upper()} MANAGEMENT\n"
        f"   - This is the most statistically significant predictor of churn\n"
        f"   - Develop targeted interventions around this feature\n"
        f"   - Monitor changes in this variable closely"
    )

# Recommendation 2: Address high-risk segments
if len(categorical_significant) > 0:
    recommendations.append(
        "2. SEGMENT-SPECIFIC RETENTION STRATEGIES\n"
        "   - Develop tailored retention programs for high-risk segments\n"
        "   - Focus resources on segments with highest churn rates\n"
        "   - Create early warning systems for at-risk customers"
    )

# Recommendation 3: Data quality improvements
if missing_vars:
    recommendations.append(
        "3. IMPROVE DATA COLLECTION\n"
        f"   - Address missing data in {len(missing_vars)} variables\n"
        "   - Implement data validation at point of collection\n"
        "   - Consider imputation strategies for modeling"
    )

# Recommendation 4: Feature engineering
if correlation_result.high_correlations:
    recommendations.append(
        "4. FEATURE ENGINEERING FOR MODELING\n"
        "   - Address multicollinearity through feature selection\n"
        "   - Consider creating composite features\n"
        "   - Use dimensionality reduction techniques if needed"
    )

# Recommendation 5: Monitoring and tracking
recommendations.append(
    "5. IMPLEMENT CONTINUOUS MONITORING\n"
    "   - Set up automated EDA reports for new data\n"
    "   - Track feature importance over time\n"
    "   - Monitor for data drift and concept drift"
)

for rec in recommendations:
    print(f"\n{rec}")

## 8. Conclusions {#conclusions}

Let's create a comprehensive EDA dashboard and summarize our findings.

In [3]:
# Create comprehensive EDA dashboard
print("Creating comprehensive EDA dashboard...")

dashboard_figures = visualizer.create_eda_dashboard(
    telco_clean, target_col, univariate_results, bivariate_results, 
    correlation_result, save_name='eda_dashboard'
)

print(f"\nCreated {len(dashboard_figures)} dashboard figures")

# Display all dashboard figures
for i, fig in enumerate(dashboard_figures):
    print(f"\n--- Dashboard Figure {i+1} ---")
    plt.show()
    plt.close(fig)

Creating comprehensive EDA dashboard...


NameError: name 'visualizer' is not defined

In [None]:
# Final summary and next steps
print("\n" + "="*80)
print("EXPLORATORY DATA ANALYSIS - FINAL SUMMARY")
print("="*80)

print(f"\nüìä DATASET OVERVIEW:")
print(f"   - Samples: {len(telco_clean):,}")
print(f"   - Features: {len(feature_cols)}")
print(f"   - Churn rate: {overall_churn_rate:.1%}")
print(f"   - Data quality: {'Good' if len(missing_vars) == 0 else 'Needs attention'}")

print(f"\nüîç KEY FINDINGS:")
print(f"   - Significant predictors identified: {len(significant_features)}")
print(f"   - High correlations detected: {len(correlation_result.high_correlations)}")
print(f"   - Statistical tests performed: {len(test_results)}")
print(f"   - Effect sizes calculated: {len(effect_sizes)}")

print(f"\nüìà BUSINESS IMPACT:")
if len(significant_features) > 0:
    print(f"   - Clear drivers of churn identified")
    print(f"   - Actionable insights for retention strategies")
    print(f"   - Strong foundation for predictive modeling")
else:
    print(f"   - No clear statistical drivers identified")
    print(f"   - May need more data or different features")
    print(f"   - Consider advanced feature engineering")

print(f"\nüöÄ NEXT STEPS:")
print(f"   1. Proceed with feature engineering based on EDA insights")
print(f"   2. Address data quality issues identified")
print(f"   3. Develop predictive models using significant features")
print(f"   4. Create business rules based on segment analysis")
print(f"   5. Set up monitoring for key churn indicators")

print(f"\nüìÅ OUTPUTS GENERATED:")
print(f"   - Statistical testing report: {config.TABLES_PATH / 'statistical_testing_report.csv'}")
print(f"   - EDA visualizations: {config.FIGURES_PATH}")
print(f"   - Analysis results stored in memory for next steps")

print("\n" + "="*80)
print("EDA ANALYSIS COMPLETE ‚úÖ")
print("="*80)

---

## Summary

This comprehensive EDA notebook has provided:

1. **Complete data profiling** with quality assessment
2. **Univariate analysis** of all features with appropriate statistics
3. **Bivariate analysis** with statistical testing against the target
4. **Correlation analysis** using appropriate measures for mixed data types
5. **Statistical significance testing** with multiple testing corrections
6. **Publication-ready visualizations** automatically saved to reports
7. **Business insights and actionable recommendations**

The analysis framework is designed to be:
- **Reproducible**: All random seeds set and parameters configurable
- **Scalable**: Works with different datasets and sizes
- **Comprehensive**: Covers all major aspects of EDA
- **Business-focused**: Provides actionable insights

**Next Steps**: Use these insights to guide feature engineering and model development in subsequent notebooks.