In [None]:
# Part A: Outlier Detection Methods

## Method 1: Partitioning Method (K-Means Based)

### Code Cell 1: Import Libraries and Load Data
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('reduced_file.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()
```

### Text Cell 1: Data Preprocessing
Before detecting outliers, we need to prepare numerical features and handle missing values.

### Code Cell 2: Prepare Numerical Features
```python
# Handle missing values in TotalCharges
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# Select numerical features for outlier detection
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
X_numerical = df[numerical_features].copy()

print("Numerical features statistics:")
print(X_numerical.describe())
```

### Text Cell 2: K-Means Clustering for Outlier Detection
We'll use K-Means to identify outliers based on distance from cluster centroids.
Outliers are points that are far from their assigned cluster center.

### Code Cell 3: K-Means Outlier Detection
```python
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numerical)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Calculate distances to cluster centers
distances = cdist(X_scaled, kmeans.cluster_centers_, 'euclidean')
min_distances = np.min(distances, axis=1)

# Define outliers as points beyond 95th percentile distance
threshold_kmeans = np.percentile(min_distances, 95)
outliers_kmeans = min_distances > threshold_kmeans

print(f"K-Means Method:")
print(f"  Outliers detected: {outliers_kmeans.sum()}")
print(f"  Percentage: {outliers_kmeans.sum()/len(df)*100:.2f}%")

# Visualize
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X_numerical['tenure'], X_numerical['MonthlyCharges'], 
            c=outliers_kmeans, cmap='coolwarm', alpha=0.6)
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('K-Means Outliers: Tenure vs Monthly Charges')
plt.colorbar(label='Outlier')

plt.subplot(1, 3, 2)
plt.scatter(X_numerical['tenure'], X_numerical['TotalCharges'], 
            c=outliers_kmeans, cmap='coolwarm', alpha=0.6)
plt.xlabel('Tenure')
plt.ylabel('Total Charges')
plt.title('K-Means Outliers: Tenure vs Total Charges')
plt.colorbar(label='Outlier')

plt.subplot(1, 3, 3)
plt.hist(min_distances, bins=50, alpha=0.7)
plt.axvline(threshold_kmeans, color='r', linestyle='--', label='Threshold')
plt.xlabel('Distance to Nearest Centroid')
plt.ylabel('Frequency')
plt.title('Distance Distribution')
plt.legend()

plt.tight_layout()
plt.show()
```

---

## Method 2: Hierarchical Clustering

### Text Cell 3: Hierarchical Clustering Approach
Hierarchical clustering builds a tree of clusters. We'll identify outliers as small clusters
or points that merge late in the hierarchy.

### Code Cell 4: Hierarchical Outlier Detection
```python
# Perform hierarchical clustering
linkage_matrix = linkage(X_scaled, method='ward')

# Cut tree to form clusters
n_clusters = 5
hierarchical_labels = fcluster(linkage_matrix, n_clusters, criterion='maxclust')

# Identify small clusters as outliers
cluster_counts = pd.Series(hierarchical_labels).value_counts()
small_clusters = cluster_counts[cluster_counts < len(df) * 0.05].index
outliers_hierarchical = np.isin(hierarchical_labels, small_clusters)

print(f"\nHierarchical Method:")
print(f"  Outliers detected: {outliers_hierarchical.sum()}")
print(f"  Percentage: {outliers_hierarchical.sum()/len(df)*100:.2f}%")
print(f"  Cluster sizes: {cluster_counts.sort_index().values}")

# Visualize dendrogram
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
dendrogram(linkage_matrix, truncate_mode='lastp', p=30)
plt.xlabel('Sample Index or Cluster Size')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram (Truncated)')

plt.subplot(1, 2, 2)
plt.scatter(X_numerical['tenure'], X_numerical['MonthlyCharges'], 
            c=outliers_hierarchical, cmap='coolwarm', alpha=0.6)
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('Hierarchical Outliers: Tenure vs Monthly Charges')
plt.colorbar(label='Outlier')

plt.tight_layout()
plt.show()
```

---

## Method 3: Density-Based Method (DBSCAN)

### Text Cell 4: DBSCAN for Outlier Detection
DBSCAN identifies outliers as points in low-density regions (labeled as -1).
It's particularly effective for finding irregularly shaped clusters.

### Code Cell 5: DBSCAN Outlier Detection
```python
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Points labeled as -1 are outliers
outliers_dbscan = dbscan_labels == -1

print(f"\nDBSCAN Method:")
print(f"  Outliers detected: {outliers_dbscan.sum()}")
print(f"  Percentage: {outliers_dbscan.sum()/len(df)*100:.2f}%")
print(f"  Number of clusters found: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")

# Visualize
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X_numerical['tenure'], X_numerical['MonthlyCharges'], 
            c=dbscan_labels, cmap='viridis', alpha=0.6)
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('DBSCAN Clusters')
plt.colorbar(label='Cluster')

plt.subplot(1, 3, 2)
plt.scatter(X_numerical['tenure'], X_numerical['MonthlyCharges'], 
            c=outliers_dbscan, cmap='coolwarm', alpha=0.6)
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('DBSCAN Outliers')
plt.colorbar(label='Outlier')

plt.subplot(1, 3, 3)
# Compare all three methods
comparison = pd.DataFrame({
    'K-Means': outliers_kmeans,
    'Hierarchical': outliers_hierarchical,
    'DBSCAN': outliers_dbscan
})
comparison.sum().plot(kind='bar', color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.ylabel('Number of Outliers')
plt.title('Comparison of Outlier Detection Methods')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()
```

---

## Consensus Outliers and Data Cleaning

### Text Cell 5: Finding Consensus Outliers
We'll identify outliers detected by multiple methods for more reliable detection.

### Code Cell 6: Consensus and Removal
```python
# Create consensus outliers (detected by at least 2 methods)
outlier_votes = (outliers_kmeans.astype(int) + 
                 outliers_hierarchical.astype(int) + 
                 outliers_dbscan.astype(int))

consensus_outliers = outlier_votes >= 2

print("\nConsensus Analysis:")
print(f"  Outliers detected by all 3 methods: {(outlier_votes == 3).sum()}")
print(f"  Outliers detected by 2+ methods: {consensus_outliers.sum()}")
print(f"  Percentage: {consensus_outliers.sum()/len(df)*100:.2f}%")

# Create cleaned dataset
df_cleaned = df[~consensus_outliers].copy()
print(f"\nDataset size after removing outliers:")
print(f"  Original: {len(df)} rows")
print(f"  Cleaned: {len(df_cleaned)} rows")
print(f"  Removed: {len(df) - len(df_cleaned)} rows")

# Save cleaned dataset
df_cleaned.to_csv('churn_cleaned.csv', index=False)
print("\nCleaned dataset saved as 'churn_cleaned.csv'")

# Visualize consensus
plt.figure(figsize=(10, 6))
plt.scatter(X_numerical['tenure'], X_numerical['MonthlyCharges'], 
            c=outlier_votes, cmap='RdYlGn_r', alpha=0.6, s=50)
plt.colorbar(label='Number of Methods Detecting as Outlier')
plt.xlabel('Tenure (months)')
plt.ylabel('Monthly Charges ($)')
plt.title('Consensus Outlier Detection\n(0=Normal, 3=All methods agree)')
plt.tight_layout()
plt.show()
```

SyntaxError: unterminated string literal (detected at line 43) (3265751536.py, line 43)

PART B

In [21]:
# Part B: Strategy to Enhance Performance

## Strategy Overview

### Text Cell 6: Performance Enhancement Strategy
**Strategy: Feature Engineering + Class Balancing + Ensemble Methods**

This strategy combines three powerful techniques:
1. **Feature Engineering**: Create meaningful features from existing data
2. **Class Balancing**: Address imbalanced churn rates using SMOTE
3. **Ensemble Methods**: Use multiple models for robust predictions

### Code Cell 7: Feature Engineering
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE

# Load cleaned dataset
df_clean = pd.read_csv('churn_cleaned.csv')

# Feature Engineering
def engineer_features(df):
    df = df.copy()
    
    # 1. Revenue per month of tenure
    df['RevenuePerMonth'] = df['TotalCharges'] / (df['tenure'] + 1)
    
    # 2. Service count (number of additional services)
    service_cols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                    'TechSupport', 'StreamingTV', 'StreamingMovies']
    df['ServiceCount'] = (df[service_cols] == 'Yes').sum(axis=1)
    
    # 3. Has premium services
    df['HasPremiumServices'] = (df['ServiceCount'] >= 3).astype(int)
    
    # 4. Contract risk score (month-to-month = high risk)
    df['ContractRisk'] = (df['Contract'] == 'Month-to-month').astype(int)
    
    # 5. Payment method risk
    df['PaymentRisk'] = (df['PaymentMethod'] == 'Electronic check').astype(int)
    
    # 6. Senior with dependents
    df['SeniorWithDependents'] = (df['SeniorCitizen'] == 1) & (df['Dependents'] == 'Yes')
    df['SeniorWithDependents'] = df['SeniorWithDependents'].astype(int)
    
    # 7. Tenure category
    df['TenureCategory'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 100],
                                   labels=['New', 'Medium', 'Long', 'VeryLong'])
    
    return df

df_engineered = engineer_features(df_clean)
print("New features created:")
print("  - RevenuePerMonth: Average revenue per tenure month")
print("  - ServiceCount: Number of additional services")
print("  - HasPremiumServices: Binary flag for 3+ services")
print("  - ContractRisk: Month-to-month contract flag")
print("  - PaymentRisk: Electronic check payment flag")
print("  - SeniorWithDependents: Senior citizen with dependents")
print("  - TenureCategory: Categorical tenure grouping")
```

### Text Cell 7: Why This Strategy Works
**Rationale for Feature Engineering:**
- **RevenuePerMonth**: Captures customer value intensity
- **ServiceCount**: Measures engagement with company services
- **Risk Scores**: Identify high-churn risk factors from domain knowledge
- **Tenure Categories**: Non-linear relationships with churn

**Benefits:**
- Captures domain knowledge
- Creates non-linear relationships
- Improves model interpretability

### Code Cell 8: Prepare Data for Modeling
```python
# Encode categorical variables
df_model = df_engineered.copy()

# Label encode binary categories
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
le = LabelEncoder()
for col in binary_cols:
    df_model[col] = le.fit_transform(df_model[col])

# One-hot encode multi-class categories
categorical_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 
                    'OnlineBackup', 'DeviceProtection', 'TechSupport',
                    'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod',
                    'TenureCategory']

df_model = pd.get_dummies(df_model, columns=categorical_cols, drop_first=True)

# Encode target
df_model['Churn'] = (df_model['Churn'] == 'Yes').astype(int)

# Drop customerID
df_model = df_model.drop('customerID', axis=1)

print(f"Final feature count: {df_model.shape[1] - 1}")
print(f"Churn distribution:\n{df_model['Churn'].value_counts()}")
```

### Text Cell 8: Class Balancing with SMOTE
**Why SMOTE?**
- Synthetic Minority Over-sampling Technique
- Generates synthetic examples of minority class
- Better than simple over-sampling (avoids exact duplicates)
- Helps model learn minority class patterns

### Code Cell 9: Apply SMOTE and Train Models
```python
# Split features and target
X = df_model.drop('Churn', axis=1)
y = df_model['Churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Before SMOTE - Training set:")
print(f"  Class 0: {(y_train == 0).sum()}")
print(f"  Class 1: {(y_train == 1).sum()}")

# Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE - Training set:")
print(f"  Class 0: {(y_train_balanced == 0).sum()}")
print(f"  Class 1: {(y_train_balanced == 1).sum()}")

# Train multiple models (Ensemble approach)
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_balanced, y_train_balanced)
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    results[name] = {
        'predictions': y_pred,
        'probabilities': y_pred_proba,
        'model': model
    }
    
print("\nModels trained successfully!")
```

### Text Cell 9: Why Ensemble Methods?
**Benefits of Using Multiple Models:**
1. **Diversity**: Different algorithms capture different patterns
2. **Robustness**: Reduces risk of overfitting to one model's bias
3. **Improved Performance**: Averaging predictions often outperforms single models

**Model Selection:**
- **Logistic Regression**: Linear baseline, interpretable
- **Random Forest**: Handles non-linear relationships, feature importance
- **Gradient Boosting**: Often best performance, sequential learning

### Code Cell 10: Evaluate Models
```python
# Evaluate each model
print("="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)

for name, result in results.items():
    print(f"\n{name}:")
    print("-" * 40)
    print(classification_report(y_test, result['predictions']))
    print(f"ROC-AUC Score: {roc_auc_score(y_test, result['probabilities']):.4f}")
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, result['predictions'])
    print(f"Confusion Matrix:")
    print(cm)

# Ensemble prediction (voting)
ensemble_pred_proba = np.mean([
    results['Logistic Regression']['probabilities'],
    results['Random Forest']['probabilities'],
    results['Gradient Boosting']['probabilities']
], axis=0)
ensemble_pred = (ensemble_pred_proba > 0.5).astype(int)

print("\n" + "="*60)
print("ENSEMBLE MODEL (Voting)")
print("="*60)
print(classification_report(y_test, ensemble_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, ensemble_pred_proba):.4f}")
```

### Code Cell 11: Feature Importance Analysis
```python
# Get feature importance from Random Forest
rf_model = results['Random Forest']['model']
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 6))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance')
plt.title('Top 15 Most Important Features (Random Forest)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))
```

### Text Cell 10: Strategy Defense and Summary

**Why This Strategy Enhances Performance:**

**1. Feature Engineering (Addresses Domain Knowledge)**
   - Creates features aligned with business understanding
   - Captures complex relationships not visible in raw data
   - Examples: Customer value (RevenuePerMonth), engagement (ServiceCount)

**2. Class Balancing with SMOTE (Addresses Imbalance)**
   - Churn datasets typically have imbalanced classes
   - SMOTE creates synthetic minority samples intelligently
   - Prevents model bias toward majority class
   - Improves recall on churn (minority) class

**3. Ensemble Methods (Addresses Model Variance)**
   - Combines strengths of multiple algorithms
   - Reduces overfitting through model diversity
   - More robust predictions through voting/averaging
   - Better generalization to unseen data

**Expected Improvements:**
- Better recall for churn class (fewer missed churners)
- Higher ROC-AUC score (better class separation)
- More balanced precision-recall trade-off
- Feature importance for business insights

**Trade-offs:**
- Increased computational cost (multiple models)
- More complex than single model approach
- Requires careful hyperparameter tuning

**Conclusion:**
This three-pronged strategy addresses the key challenges in churn prediction:
domain complexity, class imbalance, and model robustness. Each component
contributes to improved overall performance.

### Code Cell 12: Final Summary Statistics
```python
# Summary of improvements
print("="*60)
print("PERFORMANCE ENHANCEMENT SUMMARY")
print("="*60)
print(f"\n1. Dataset:")
print(f"   - Original size: 1000 samples")
print(f"   - After outlier removal: {len(df_clean)} samples")
print(f"   - Features after engineering: {X.shape[1]}")

print(f"\n2. Class Distribution:")
print(f"   - Original (Test): {(y_test==0).sum()} non-churn, {(y_test==1).sum()} churn")
print(f"   - Training (After SMOTE): {(y_train_balanced==0).sum()} each class")

print(f"\n3. Models Trained:")
for name in models.keys():
    print(f"   - {name}")
print(f"   - Ensemble (Voting)")

print(f"\n4. Key Features Created:")
print(f"   - RevenuePerMonth, ServiceCount, HasPremiumServices")
print(f"   - ContractRisk, PaymentRisk, TenureCategory")

print("\nStrategy successfully implemented!")
```
üìù Implementation Notes:
‚Ä¢ All code cells include comments explaining each step
‚Ä¢ Text cells provide context and rationale for methods used
‚Ä¢ Visualizations help understand outlier patterns
‚Ä¢ Performance strategy is defended with clear reasoning
‚Ä¢ Copy this content into Jupyter Notebook to execute

SyntaxError: unterminated string literal (detected at line 163) (733108715.py, line 163)