## Final Project: Predicting Customer Churn for Interconnect

In [3]:

# Step 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

# Step 2: Load Data
contract = pd.read_csv('/datasets/final_provider/contract.csv')
personal = pd.read_csv('/datasets/final_provider/personal.csv')
internet = pd.read_csv('/datasets/final_provider/internet.csv')
phone = pd.read_csv('/datasets/final_provider/phone.csv')

# Step 3: Merge Data
df = contract.merge(personal, on='customerID', how='left')
df = df.merge(internet, on='customerID', how='left')
df = df.merge(phone, on='customerID', how='left')

# Step 4: Target Creation and Cleaning
df['churn'] = df['EndDate'].apply(lambda x: 0 if x == 'No' else 1)
internet_cols = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                 'TechSupport', 'StreamingTV', 'StreamingMovies']
df[internet_cols] = df[internet_cols].fillna('No internet service')
df['MultipleLines'] = df['MultipleLines'].fillna('No phone service')

# Step 5: Preprocessing
df_model = df.copy()
df_model = df_model.drop(['customerID', 'BeginDate', 'EndDate'], axis=1)

binary_cols = ['gender', 'Partner', 'Dependents', 'PaperlessBilling', 'OnlineSecurity',
               'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
               'StreamingMovies', 'MultipleLines']
multi_cols = ['Type', 'PaymentMethod', 'InternetService']

for col in binary_cols:
    df_model[col] = df_model[col].map({'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0,
                                       'No internet service': 0, 'No phone service': 0})

df_model = pd.get_dummies(df_model, columns=multi_cols)

# Convert TotalCharges to numeric
df_model['TotalCharges'] = pd.to_numeric(df_model['TotalCharges'], errors='coerce')
df_model['TotalCharges'] = df_model['TotalCharges'].fillna(0)

# Step 6: Train-Test Split
X = df_model.drop('churn', axis=1)
y = df_model['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

X_train = X_train.copy()
X_test = X_test.copy()

# Step 7: Scaling
numeric_cols = ['MonthlyCharges', 'TotalCharges', 'SeniorCitizen']
scaler = StandardScaler()
X_train.loc[:, numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test.loc[:, numeric_cols] = scaler.transform(X_test[numeric_cols])

# Combine X_train and y_train for upsampling
train_data = pd.concat([X_train, y_train], axis=1)

# Split into classes
majority = train_data[train_data['churn'] == 0]
minority = train_data[train_data['churn'] == 1]

# Upsample the minority class
minority_upsampled = minority.sample(n=len(majority), replace=True, random_state=42)

# Combine back into a balanced training set
train_upsampled = pd.concat([majority, minority_upsampled])

# Shuffle and split back
X_train_balanced = train_upsampled.drop('churn', axis=1)
y_train_balanced = train_upsampled['churn']

# Train Gradient Boosting on upsampled data
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train_balanced, y_train_balanced)

# Predict
gb_pred_prob = gb.predict_proba(X_test)[:, 1]
gb_pred = gb.predict(X_test)

# Evaluate
gb_auc = roc_auc_score(y_test, gb_pred_prob)
gb_acc = accuracy_score(y_test, gb_pred)

print(f"Gradient Boosting AUC-ROC: {gb_auc:.3f}")
print(f"Gradient Boosting Accuracy: {gb_acc:.3f}")

# Step 8: Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
y_pred = logreg.predict(X_test)
logreg_auc = roc_auc_score(y_test, y_pred_prob)
logreg_acc = accuracy_score(y_test, y_pred)

# Step 9: Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_pred_prob = rf.predict_proba(X_test)[:, 1]
rf_pred = rf.predict(X_test)
rf_auc = roc_auc_score(y_test, rf_pred_prob)
rf_acc = accuracy_score(y_test, rf_pred)

# Step 10: Compare model performance
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],
    'AUC-ROC': [logreg_auc, rf_auc, gb_auc],
    'Accuracy': [logreg_acc, rf_acc, gb_acc]
})

# Highlight best model
best_model_name = results.sort_values(by='AUC-ROC', ascending=False).iloc[0]['Model']

print("\n📊 Updated Model Performance Comparison:\n")
print(results.to_string(index=False))
print(f"\n✅ Best model based on AUC-ROC: {best_model_name}")

Gradient Boosting AUC-ROC: 0.845
Gradient Boosting Accuracy: 0.752

📊 Updated Model Performance Comparison:

              Model  AUC-ROC  Accuracy
Logistic Regression 0.829962  0.792051
      Random Forest 0.819498  0.789212
  Gradient Boosting 0.845104  0.752307

✅ Best model based on AUC-ROC: Gradient Boosting


## Final Report
## ✅ Final Model Selection and Conclusion

After evaluating three models on both AUC-ROC and accuracy, the **Gradient Boosting Classifier** was selected as the final model due to its **highest AUC-ROC score of 0.845**, placing it in the **5 SP scoring tier**.

### 🔍 Model Comparison:
| Model               | AUC-ROC | Accuracy |
|--------------------|---------|----------|
| Logistic Regression| 0.830   | 0.792    |
| Random Forest      | 0.819   | 0.789    |
| **Gradient Boosting** | **0.845** ✅ | 0.752    |

### 💡 Strategic Insight:
- The best model was trained using **upsampled data** to correct for class imbalance.
- Churn is highest among customers on **month-to-month contracts** and those **not using add-on services** like TechSupport or OnlineSecurity.
- Customers with **fiber optic internet** churn more frequently than DSL users.
- A significant portion of users use **non-digital payment methods** (e.g., mailed checks), suggesting a potentially older customer base.

### 📈 Business Recommendation:
Interconnect should focus churn reduction efforts on:
- Encouraging longer-term contracts using **No Price Increase Guarantees**
- Bundling promotions with security or support services
- Simplifying billing and offering education/support for older customers

These steps are expected to improve customer retention and reduce churn.

## 📘 Project Reflection Report

**1. What steps of the plan were performed and what steps were skipped?**
- All required steps were completed: data loading, preprocessing, upsampling to fix class imbalance, and training multiple models (logistic regression, random forest, gradient boosting).
- No steps were skipped. The only limitation was the model selection scope, as only allowed classifiers (no imblearn or exotic methods) were used due to environment constraints.

**2. What difficulties did you encounter and how did you solve them?**
- **Class imbalance** initially led to low recall in earlier models. This was resolved using **manual upsampling**.
- Reviewer feedback noted that logistic regression alone was insufficient, prompting the use of gradient boosting for improvement.
- Some classifiers (e.g., HistGradientBoosting) were unavailable in the environment; fallback was to standard GradientBoostingClassifier.

**3. What were some of the key steps to solving the task?**
- **Feature encoding and scaling**, along with carefully handling missing values.
- **Addressing class imbalance** through upsampling.
- Comparing model performance using **AUC-ROC**, the project’s primary metric.

**4. What is your final model and what quality score does it have?**
- The final model is **Gradient Boosting Classifier**.
- It achieved an **AUC-ROC score of 0.845**, placing it in the **5 SP scoring bracket**.