# Model Training
Objective:
The goal of this notebook is to train predictive models that identify customers likely to close their credit card accounts (AttritionFlag) using the processed and feature-engineered dataset.

Dataset:
- Processed datasets: `train.csv` and `test.csv`

- Features include demographic, transaction, card type, and engineered variables.

- Categorical variables are already handled in the data cleaning stage.

Workflow – Model Training:
1. Load Data
- Load `train.csv` and `test.csv`
- Separate features (`X`) and target (`y`)

2. Handle Class Imbalance
- Apply SMOTE on the training set to balance the minority class.

3. Train Classification Models
- Train at least two models:
  - Logistic Regression (baseline linear model)
  - XGBoost (gradient boosting model to capture non-linearities and interactions)
- Perform hyperparameter tuning and cross-validation to optimize each model.

Next Steps:
- The trained models will be saved and later evaluated in a separate notebook using metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix.

In [None]:
import pandas as pd

# Paths to the saved CSVs
train_path = r'..\..\data\processed\train.csv'
test_path  = r'..\..\data\processed\test.csv'

# Load the CSVs
train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)

# Split into features and target
X_train = train_df.drop(columns=['AttritionFlag'])
y_train = train_df['AttritionFlag']

X_test  = test_df.drop(columns=['AttritionFlag'])
y_test  = test_df['AttritionFlag']

# Optional: check shapes
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

## 1. Handling Class Imbalance

In [None]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to training data
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print("Before SMOTE:")
print(y_train.value_counts())
print("\nAfter SMOTE:")
print(y_train_res.value_counts())

# Optional: check shapes
print("\nX_train_res shape:", X_train_res.shape)
print("y_train_res shape:", y_train_res.shape)

## 2. Logistic Regression Training

In [None]:
from sklearn.preprocessing import StandardScaler

# Fit scaler on training data and transform both training and test
scaler = StandardScaler()
X_train_res_scaled = scaler.fit_transform(X_train_res)
X_test_scaled = scaler.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from tqdm import tqdm
import numpy as np


# Define Logistic Regression
lr = LogisticRegression(solver='lbfgs', max_iter=2000, random_state=42, class_weight='balanced')

# Hyperparameter candidates
param_grid = {'C': [0.1, 1, 10]}

# Stratified K-Fold
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Store results
results = []

# Total number of fits
total_fits = len(param_grid['C']) * cv.get_n_splits()
pbar = tqdm(total=total_fits, desc="GridSearchCV Progress")

# Loop through hyperparameters
for C_val in param_grid['C']:
    lr.set_params(C=C_val)
    # Evaluate with cross-validation using scaled features
    scores = cross_val_score(lr, X_train_res_scaled, y_train_res, cv=cv, scoring='roc_auc', n_jobs=-1)
    results.append({
        'C': C_val,
        'mean_cv_roc_auc': np.mean(scores)
    })
    pbar.update(cv.get_n_splits())  # Update progress bar for each fold set

pbar.close()

# Find best hyperparameter
best_result = max(results, key=lambda x: x['mean_cv_roc_auc'])
best_lr = LogisticRegression(
    C=best_result['C'], solver='lbfgs', max_iter=2000,
    random_state=42, class_weight='balanced'
)
best_lr.fit(X_train_res_scaled, y_train_res)

print("Best Hyperparameters:", best_result)
print("Best CV ROC-AUC:", best_result['mean_cv_roc_auc'])

In [None]:
from sklearn.linear_model import LogisticRegression

best_lr = LogisticRegression(C=1, solver='lbfgs', max_iter=2000, random_state=42)
best_lr.fit(X_train_res_scaled, y_train_res)


In [None]:
# Predicted probabilities (for ROC-AUC)
y_pred_prob = best_lr.predict_proba(X_test_scaled)[:, 1]

# Predicted class labels
y_pred = best_lr.predict(X_test_scaled)

In [None]:
import matplotlib.pyplot as plt

plt.hist(y_pred_prob, bins=20)
plt.title("Predicted Probabilities for Churn")
plt.xlabel("Probability")
plt.ylabel("Count")
plt.show()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_prob))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

## Logistic Regression Model Evaluation — Summary

**Test Results:**
- **Accuracy:** 95.04% (misleading due to imbalance)
- **Precision / Recall / F1-score:** 0.0 (model never predicted churn)
- **ROC-AUC:** 0.5088 (almost random performance)

**Key Takeaway:**  
High accuracy here is misleading since the model completely fails to detect churn and has no meaningful predictive power.

## 3. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from tqdm import tqdm
import numpy as np

# ----------------------
# Define Random Forest
# ----------------------
rf = RandomForestClassifier(
    n_estimators=200,        # number of trees
    random_state=42,
    class_weight='balanced', # handle imbalanced training data
    n_jobs=-1
)

# ----------------------
# Hyperparameter candidates
# ----------------------
param_grid = {
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5, 10]
}

# Stratified K-Fold
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Store results
results = []

# Total number of fits for progress bar
total_fits = len(param_grid['max_depth']) * len(param_grid['min_samples_split']) * cv.get_n_splits()
pbar = tqdm(total=total_fits, desc="Random Forest GridSearchCV")

# ----------------------
# Loop through hyperparameters
# ----------------------
for depth in param_grid['max_depth']:
    for min_split in param_grid['min_samples_split']:
        rf.set_params(max_depth=depth, min_samples_split=min_split)
        scores = cross_val_score(rf, X_train_res, y_train_res, cv=cv, scoring='roc_auc', n_jobs=-1)
        results.append({
            'max_depth': depth,
            'min_samples_split': min_split,
            'mean_cv_roc_auc': np.mean(scores)
        })
        pbar.update(cv.get_n_splits())

pbar.close()

# ----------------------
# Find best hyperparameters
# ----------------------
best_result = max(results, key=lambda x: x['mean_cv_roc_auc'])
best_rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=best_result['max_depth'],
    min_samples_split=best_result['min_samples_split'],
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)
best_rf.fit(X_train_res, y_train_res)

print("Best Hyperparameters:", best_result)
print("Best CV ROC-AUC:", best_result['mean_cv_roc_auc'])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Predict labels
y_pred = best_rf.predict(X_test_scaled)  # use scaled if you scaled train before
y_pred_prob = best_rf.predict_proba(X_test_scaled)[:, 1]

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
roc_auc = roc_auc_score(y_test, y_pred_prob)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

## Random Forest Model Evaluation

**Interpretation:**
- The model predicts **only the majority class** (non-churn).
- High accuracy is misleading.
- Precision, Recall, and F1 are all **zero**, meaning no churn cases were identified.
- ROC-AUC slightly above 0.5 → performance is almost random.
- This indicates **poor predictive power**; features may not strongly correlate with churn.

# Project Conclusion: Churn Prediction for Credit Card Holders

## Summary of Findings
Both **Logistic Regression** and **Random Forest** models were trained on the dataset with balanced training data (via SMOTE) and tested on the test data.

### Model Performances:
- **Accuracy**: ~95% (misleading due to class imbalance)
- **Precision / Recall / F1-score**: All **0.0** — no churn cases were identified by either model.
- **ROC-AUC**: ~0.51–0.52, only slightly better than random guessing.

### Interpretation:
- Despite feature engineering, tuning hyperparameters, and balancing the training data, both models defaulted to predicting only the majority class (non-churn).
- This indicates **no strong predictive signal** in the available features to separate churn from non-churn customers.
- High accuracy is purely a result of the majority class dominating the dataset, not actual predictive performance.

## Final Conclusion
The dataset, in its current form, **does not contain sufficient or relevant features** to build a model that can reliably predict customer attrition/churn.  
Future work should focus on **collecting richer, more behaviorally informative features** and possibly **reshaping the problem definition** before applying predictive modeling.
