# Boosting Techniques Assignment - DA-AG-015


## Question 1: What is Boosting in Machine Learning? Explain how it improves weak learners.

**Answer:**

Boosting is an ensemble learning method in machine learning that combines multiple weak learners sequentially to create a strong learner. It works on the principle that combining many simple models (weak learners) can result in a powerful and accurate model.

### Key Concepts of Boosting:

1. **Weak Learners**: These are models that perform slightly better than random chance. A common example is a decision stump (a decision tree with only one split).

2. **Sequential Training**: Unlike bagging where models are trained in parallel, boosting trains models one after another. Each new model focuses on correcting the errors made by the previous models.

3. **Weighted Training**: During each iteration, boosting assigns higher weights to misclassified examples and lower weights to correctly classified ones. This forces subsequent models to focus on the "hard" cases.

### How Boosting Improves Weak Learners:

- **Error Correction**: Each new weak learner is trained to correct the mistakes of the previous ensemble
- **Weight Adjustment**: Misclassified instances get higher weights in the next iteration, making the algorithm focus on difficult cases
- **Sequential Learning**: The combination of multiple weak learners results in a strong learner that can capture complex patterns
- **Bias Reduction**: By focusing on errors, boosting effectively reduces bias in the final model

The mathematical foundation shows that weak and strong learners are equivalent - any weak learning algorithm can be converted into a strong one through boosting.

## Question 2: What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?

**Answer:**

### AdaBoost (Adaptive Boosting):

**Training Process:**
- **Data Reweighting**: AdaBoost modifies the training data distribution by adjusting sample weights
- **Equal Initial Weights**: All training samples start with equal weights (1/n)
- **Error-based Reweighting**: After each weak learner is trained, incorrectly classified samples get increased weights, correctly classified samples get decreased weights
- **Weak Learner Weight**: Each weak learner gets a weight based on its error rate
- **Focus**: Works primarily on the data samples (rows)

### Gradient Boosting:

**Training Process:**
- **Residual Learning**: Instead of reweighting data, Gradient Boosting fits new models to the residuals (errors) of the previous ensemble
- **Gradient Descent**: Uses gradient descent optimization to minimize a loss function
- **Pseudo-residuals**: Each new model is trained on the negative gradients of the loss function
- **Learning Rate**: Uses a shrinkage parameter (learning rate) to control the contribution of each model
- **Focus**: Works on the loss function optimization

### Key Differences:

| Aspect | AdaBoost | Gradient Boosting |
|--------|----------|-------------------|
| **Data Handling** | Re-weights training samples | Fits to residuals/gradients |
| **Loss Function** | Fixed (exponential loss) | Flexible (any differentiable loss) |
| **Model Weights** | Based on error rates | Based on learning rate |
| **Optimization** | Sample weight adjustment | Gradient descent |
| **Generalization** | Specific algorithm | General framework |

### Mathematical Difference:
- **AdaBoost**: Weights are computed as exact solutions for exponential loss
- **Gradient Boosting**: Uses gradients for any loss function, making it more flexible

## Question 3: How does regularization help in XGBoost?

**Answer:**

Regularization in XGBoost is crucial for preventing overfitting and improving model generalization. XGBoost implements multiple regularization techniques that control model complexity.

### Types of Regularization in XGBoost:

#### 1. **L1 Regularization (Alpha Parameter)**
- **Purpose**: Adds penalty based on absolute values of leaf weights
- **Effect**: Encourages sparsity by driving some leaf weights to exactly zero
- **Parameter**: `alpha` or `reg_alpha`
- **Benefits**: Feature selection, simpler models, reduced overfitting

#### 2. **L2 Regularization (Lambda Parameter)**
- **Purpose**: Adds penalty based on squared values of leaf weights  
- **Effect**: Smoothly shrinks leaf weights towards zero
- **Parameter**: `lambda` or `reg_lambda`
- **Benefits**: Smoother weight distribution, better generalization

#### 3. **Tree-Specific Regularization**

**Gamma (Min Split Loss):**
- Controls minimum loss reduction required for node splitting
- Higher gamma values → more conservative splits → simpler trees

**Min Child Weight:**
- Requires minimum sum of instance weights per leaf node
- Prevents overly specific leaf nodes

**Max Depth:**
- Limits tree depth to control model complexity
- Prevents trees from becoming too deep and overfitting

#### 4. **Early Stopping**
- **Parameter**: `early_stopping_rounds`
- **Function**: Monitors validation metric and stops training when no improvement
- **Benefit**: Finds optimal point before overfitting begins

### How Regularization Helps:

1. **Overfitting Prevention**: Penalizes complex models that fit noise
2. **Generalization**: Improves performance on unseen data
3. **Model Simplicity**: Creates more interpretable models
4. **Computational Efficiency**: Reduces unnecessary complexity
5. **Robust Performance**: Makes models less sensitive to training data variations

### Mathematical Formulation:
The XGBoost objective function includes regularization terms:

```
Obj = Σ L(yi, ŷi) + Σ Ω(ft)
```

Where:
- L is the loss function
- Ω is the regularization term = γT + (λ/2)Σw² + α Σ|w|
- T is number of leaves, w are leaf weights

## Question 4: Why is CatBoost considered efficient for handling categorical data?

**Answer:**

CatBoost is specifically designed to handle categorical features efficiently without requiring manual preprocessing. Here's why it excels:

### 1. **Automatic Categorical Feature Handling**

**No Preprocessing Required:**
- CatBoost automatically detects and processes categorical features
- No need for manual one-hot encoding or label encoding
- Preserves original feature meaning and relationships

**Built-in Encoding Methods:**
- Uses sophisticated encoding techniques internally
- Combines multiple encoding strategies for optimal performance

### 2. **Advanced Encoding Techniques**

**Target Statistics (CatBoost's Core Innovation):**
- Calculates statistics based on target variable for each category
- Uses historical data to avoid overfitting
- Formula: `(countInClass + prior) / (totalCount + 1)`

**Ordered Boosting:**
- Uses random permutations of training data
- Prevents target leakage during encoding
- Ensures unbiased categorical feature processing

### 3. **Handling High Cardinality**

**Efficient Memory Usage:**
- Handles features with thousands of categories
- Uses optimized data structures
- Doesn't explode dimensionality like one-hot encoding

**Combination Features:**
- Automatically creates combinations of categorical features
- Discovers interaction patterns between categories
- Builds more complex feature representations

### 4. **Technical Advantages**

**Missing Value Handling:**
- Treats missing values as separate category
- No need for imputation
- Maintains data integrity

**One-Hot Encoding Control:**
- Uses one-hot encoding only for low-cardinality features
- Parameter: `one_hot_max_size` (default varies by conditions)
- Automatically chooses optimal encoding method

### 5. **Performance Benefits**

**Training Speed:**
- Faster training compared to preprocessing + other algorithms
- Optimized C++ implementation
- GPU acceleration support

**Model Quality:**
- Better handling of categorical feature interactions
- Reduced information loss
- More accurate predictions on categorical-heavy datasets

### Why This Matters:

1. **Reduced Preprocessing Time**: No manual feature engineering needed
2. **Better Feature Representation**: Preserves categorical relationships
3. **Handling Complex Interactions**: Automatic feature combinations
4. **Robust Performance**: Less prone to overfitting on categorical features
5. **Industry Applications**: Excellent for domains with many categorical features (finance, e-commerce, healthcare)

## Question 5: What are some real-world applications where boosting techniques are preferred over bagging methods?

**Answer:**

Boosting techniques are preferred over bagging in several real-world scenarios where reducing bias and achieving high accuracy is crucial. Here are key applications:

### 1. **Financial Services**

**Credit Risk Assessment:**
- **Why Boosting**: Need to identify subtle patterns in default prediction
- **Algorithm Used**: XGBoost, LightGBM
- **Advantage**: Sequential learning captures complex risk factors that individual models miss
- **Real Impact**: Banks use boosting to reduce false positives/negatives in loan approvals

**Fraud Detection:**
- **Why Boosting**: Fraudulent transactions have subtle, evolving patterns
- **Algorithm Used**: AdaBoost, Gradient Boosting
- **Advantage**: Focuses on hard-to-detect fraudulent cases through iterative learning
- **Real Impact**: PayPal, Visa use boosting for real-time fraud scoring

### 2. **Healthcare and Medical Diagnosis**

**Disease Prediction:**
- **Why Boosting**: Medical diagnosis requires high precision on minority cases
- **Algorithm Used**: XGBoost, CatBoost
- **Advantage**: Excellent at handling imbalanced datasets (rare diseases)
- **Real Impact**: Used for cancer detection, cardiovascular risk prediction

### 3. **Computer Vision and Image Processing**

**Object Detection (Viola-Jones Algorithm):**
- **Why Boosting**: Real-time face detection requires speed and accuracy
- **Algorithm Used**: AdaBoost with Haar features
- **Advantage**: Creates strong classifier from simple rectangle features
- **Real Impact**: Used in cameras, security systems, photo tagging

### 4. **Search and Recommendation Systems**

**Web Search Ranking:**
- **Why Boosting**: Need to rank millions of pages accurately
- **Algorithm Used**: Gradient Boosted Regression Trees (GBRT)
- **Advantage**: Captures complex relevance signals
- **Real Impact**: Google, Bing use boosting in their ranking algorithms

### 5. **Marketing and Customer Analytics**

**Customer Churn Prediction:**
- **Why Boosting**: Early identification of at-risk customers
- **Advantage**: Focuses on borderline cases that are most actionable
- **Real Impact**: Telecom companies use boosting to reduce churn rates

### When Boosting is Preferred Over Bagging:

| Scenario | Why Boosting > Bagging |
|----------|------------------------|
| **High Bias Models** | Boosting reduces bias better than bagging |
| **Imbalanced Data** | Sequential focus on minority class |
| **Complex Patterns** | Better at capturing subtle interactions |
| **High Accuracy Requirements** | Often achieves better precision/recall |
| **Structured Data** | Excels on tabular data with mixed types |
| **Real-time Scoring** | Can be optimized for fast inference |

### Industry Examples:

- **Microsoft**: Uses LightGBM for Bing search ranking
- **Uber**: XGBoost for demand forecasting and pricing
- **Airbnb**: Boosting models for pricing recommendations
- **Pinterest**: CatBoost for content recommendation
- **Spotify**: Gradient boosting for music recommendation

The key insight is that boosting excels when you need to extract maximum predictive power from structured data and when the cost of misclassification is high.

---

# Practical Implementation Section

Now let's move to the practical coding questions using the specified datasets:
- **Classification tasks**: `sklearn.datasets.load_breast_cancer()`
- **Regression tasks**: `sklearn.datasets.fetch_california_housing()`

---

## Question 6: Train an AdaBoost Classifier on the Breast Cancer dataset

**Task:**
- Train an AdaBoost Classifier on the Breast Cancer dataset
- Print the model accuracy

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Load the Breast Cancer dataset
print("Loading Breast Cancer Dataset...")
data = load_breast_cancer()
X, y = data.data, data.target

print(f"Dataset shape: {X.shape}")
print(f"Target classes: {data.target_names}")
print(f"Feature names: {data.feature_names[:5]}...")  # Show first 5 features

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

In [None]:
# Create and train AdaBoost Classifier
print("Training AdaBoost Classifier...")
ada_classifier = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),  # Decision stumps
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

# Train the model
ada_classifier.fit(X_train, y_train)

# Make predictions
y_pred = ada_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\n=== AdaBoost Classifier Results ===")
print(f"Model Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Feature importance
feature_importance = ada_classifier.feature_importances_
top_features_idx = np.argsort(feature_importance)[-5:]
print("\nTop 5 Important Features:")
for idx in reversed(top_features_idx):
    print(f"{data.feature_names[idx]}: {feature_importance[idx]:.4f}")

## Question 7: Train a Gradient Boosting Regressor on the California Housing dataset

**Task:**
- Train a Gradient Boosting Regressor on the California Housing dataset
- Evaluate performance using R-squared score

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

# Load the California Housing dataset
print("Loading California Housing Dataset...")
housing_data = fetch_california_housing()
X, y = housing_data.data, housing_data.target

print(f"Dataset shape: {X.shape}")
print(f"Target variable: House values in hundreds of thousands of dollars")
print(f"Feature names: {housing_data.feature_names}")
print(f"Target statistics: Mean={y.mean():.2f}, Std={y.std():.2f}, Min={y.min():.2f}, Max={y.max():.2f}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

In [None]:
# Create and train Gradient Boosting Regressor
print("Training Gradient Boosting Regressor...")
gb_regressor = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    verbose=0
)

# Train the model
gb_regressor.fit(X_train, y_train)

# Make predictions
y_pred = gb_regressor.predict(X_test)

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)

print(f"\n=== Gradient Boosting Regressor Results ===")
print(f"R-squared Score: {r2:.4f} ({r2*100:.2f}%)")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")

# Feature importance
feature_importance = gb_regressor.feature_importances_
feature_names = housing_data.feature_names

print("\nFeature Importance Ranking:")
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

for idx, row in feature_importance_df.iterrows():
    print(f"{row['Feature']}: {row['Importance']:.4f}")

# Training progress
print(f"\nModel Training Info:")
print(f"Number of estimators used: {gb_regressor.n_estimators}")
print(f"Training score: {gb_regressor.train_score_[-1]:.4f}")

# Create a simple prediction comparison
print("\nSample Predictions vs Actual:")
sample_indices = np.random.choice(len(y_test), 5, replace=False)
print("Actual -> Predicted")
for idx in sample_indices:
    print(f"{y_test[idx]:.2f} -> {y_pred[idx]:.2f}")

## Question 8: Train an XGBoost Classifier with GridSearchCV for learning rate tuning

**Task:**
- Train an XGBoost Classifier on the Breast Cancer dataset
- Tune the learning rate using GridSearchCV
- Print the best parameters and accuracy

In [None]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from xgboost import XGBClassifier

# Load the Breast Cancer dataset
print("Loading Breast Cancer Dataset for XGBoost...")
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Dataset shape: {X.shape}")
print(f"Training set: {X_train.shape[0]}, Test set: {X_test.shape[0]}")

In [None]:
# Create XGBoost Classifier
print("Setting up XGBoost Classifier...")
xgb_classifier = XGBClassifier(
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)

# Define parameter grid for learning rate tuning
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'n_estimators': [50, 100],
    'max_depth': [3, 6]
}

print(f"Parameter grid: {param_grid}")
print("Starting GridSearchCV...")

# Perform GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_classifier,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train, y_train)

In [None]:
# Get the best parameters
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"=== GridSearchCV Results ===")
print(f"Best Parameters: {best_params}")
print(f"Best Cross-Validation Score: {best_score:.4f} ({best_score*100:.2f}%)")

# Train the best model on full training set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate final accuracy
final_accuracy = accuracy_score(y_test, y_pred)

print(f"\n=== Final XGBoost Model Performance ===")
print(f"Test Accuracy: {final_accuracy:.4f} ({final_accuracy*100:.2f}%)")

# Detailed results
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Feature importance from best model
feature_importance = best_model.feature_importances_
top_features_idx = np.argsort(feature_importance)[-5:]

print("\nTop 5 Important Features:")
for idx in reversed(top_features_idx):
    print(f"{data.feature_names[idx]}: {feature_importance[idx]:.4f}")

# Show all parameter combinations tested
print("\nAll Parameter Combinations Tested:")
results_df = pd.DataFrame(grid_search.cv_results_)
for idx, params in enumerate(grid_search.cv_results_['params']):
    score = grid_search.cv_results_['mean_test_score'][idx]
    print(f"{params} -> CV Score: {score:.4f}")

## Question 9: Train a CatBoost Classifier and plot confusion matrix

**Task:**
- Train a CatBoost Classifier
- Plot the confusion matrix using seaborn

In [None]:
# Import necessary libraries
from catboost import CatBoostClassifier
import seaborn as sns
from sklearn.metrics import precision_score, recall_score, f1_score

# Load the Breast Cancer dataset
print("Loading Breast Cancer Dataset for CatBoost...")
data = load_breast_cancer()
X, y = data.data, data.target

# Convert to DataFrame for better handling
X_df = pd.DataFrame(X, columns=data.feature_names)

print(f"Dataset shape: {X_df.shape}")
print(f"Target classes: {data.target_names}")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]}, Test set: {X_test.shape[0]}")

In [None]:
# Create and train CatBoost Classifier
print("Training CatBoost Classifier...")
catboost_classifier = CatBoostClassifier(
    iterations=100,
    depth=6,
    learning_rate=0.1,
    loss_function='Logloss',
    verbose=False,
    random_state=42
)

# Train the model
catboost_classifier.fit(X_train, y_train)

# Make predictions
y_pred = catboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\n=== CatBoost Classifier Results ===")
print(f"Model Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(cm)

In [None]:
# Plot confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, 
           annot=True, 
           fmt='d', 
           cmap='Blues',
           xticklabels=data.target_names,
           yticklabels=data.target_names,
           cbar_kws={'label': 'Count'})

plt.title('CatBoost Classifier - Confusion Matrix\nBreast Cancer Dataset', fontsize=14, pad=20)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

# Feature importance
feature_importance = catboost_classifier.feature_importances_
top_features_idx = np.argsort(feature_importance)[-10:]

print("\nTop 10 Important Features:")
for idx in reversed(top_features_idx):
    print(f"{data.feature_names[idx]}: {feature_importance[idx]:.4f}")

# Additional metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\n=== Additional Metrics ===")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Model parameters used
print(f"\n=== Model Configuration ===")
print(f"Iterations: {catboost_classifier.get_params()['iterations']}")
print(f"Depth: {catboost_classifier.get_params()['depth']}")
print(f"Learning Rate: {catboost_classifier.get_params()['learning_rate']}")

## Question 10: Complete Data Science Pipeline for Loan Default Prediction

**Task:**
You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior. The dataset is imbalanced, contains missing values, and has both numeric and categorical features.

Describe your step-by-step data science pipeline using boosting techniques:
- Data preprocessing & handling missing/categorical values
- Choice between AdaBoost, XGBoost, or CatBoost
- Hyperparameter tuning strategy
- Evaluation metrics you'd choose and why
- How the business would benefit from your model

In [None]:
# Import all necessary libraries for comprehensive pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (roc_auc_score, roc_curve, precision_recall_curve, 
                           f1_score, recall_score)
from imblearn.over_sampling import SMOTE

print("=== FINTECH LOAN DEFAULT PREDICTION PIPELINE ===")
print("Using Boosting Techniques for Imbalanced Dataset\n")

# STEP 1: CREATE SYNTHETIC REALISTIC LOAN DATASET
print("STEP 1: Creating Synthetic Loan Dataset...")

np.random.seed(42)
n_samples = 5000

# Generate realistic loan data with missing values and mixed types
data = {
    'loan_amount': np.random.normal(25000, 15000, n_samples),
    'annual_income': np.random.normal(50000, 25000, n_samples),
    'credit_score': np.random.normal(650, 100, n_samples),
    'employment_years': np.random.exponential(5, n_samples),
    'debt_to_income': np.random.uniform(0.1, 0.8, n_samples),
    'loan_term': np.random.choice([12, 24, 36, 48, 60], n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'home_ownership': np.random.choice(['Rent', 'Own', 'Mortgage'], n_samples),
    'loan_purpose': np.random.choice(['Debt Consolidation', 'Home Improvement', 
                                    'Medical', 'Business', 'Other'], n_samples),
    'state': np.random.choice(['CA', 'NY', 'TX', 'FL', 'IL'], n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Create realistic default probability based on features
default_prob = (
    (df['debt_to_income'] > 0.5) * 0.3 +
    (df['credit_score'] < 600) * 0.4 +
    (df['annual_income'] < 30000) * 0.2 +
    (df['employment_years'] < 2) * 0.15
)

# Create target variable (imbalanced - 15% default rate)
df['default'] = np.random.binomial(1, np.clip(default_prob, 0.05, 0.6), n_samples)

# Introduce missing values realistically
missing_indices = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
df.loc[missing_indices[:len(missing_indices)//2], 'credit_score'] = np.nan
df.loc[missing_indices[len(missing_indices)//2:], 'employment_years'] = np.nan

print(f"Dataset created with {n_samples} samples")
print(f"Default rate: {df['default'].mean():.1%}")
print(f"Missing values: {df.isnull().sum().sum()} total")

In [None]:
# STEP 2: DATA PREPROCESSING & FEATURE ENGINEERING
print("STEP 2: Data Preprocessing...")

# Handle missing values
print("Handling missing values...")
df['credit_score'].fillna(df['credit_score'].median(), inplace=True)
df['employment_years'].fillna(df['employment_years'].median(), inplace=True)

# Feature engineering
print("Creating new features...")
df['loan_to_income_ratio'] = df['loan_amount'] / df['annual_income']
df['credit_score_category'] = pd.cut(df['credit_score'], 
                                   bins=[0, 600, 700, 850], 
                                   labels=['Poor', 'Good', 'Excellent'])

# Encode categorical variables
categorical_features = ['education', 'home_ownership', 'loan_purpose', 'state', 'credit_score_category']
label_encoders = {}

df_processed = df.copy()
for feature in categorical_features:
    le = LabelEncoder()
    df_processed[feature + '_encoded'] = le.fit_transform(df_processed[feature].astype(str))
    label_encoders[feature] = le

# Prepare features for modeling
numerical_features = ['loan_amount', 'annual_income', 'credit_score', 'employment_years', 
                     'debt_to_income', 'loan_term', 'loan_to_income_ratio']
encoded_features = [f + '_encoded' for f in categorical_features]

all_features = numerical_features + encoded_features
X = df_processed[all_features]
y = df_processed['default']

print(f"Final feature set: {len(all_features)} features")

In [None]:
# STEP 3: TRAIN-TEST SPLIT & CLASS IMBALANCE
print("STEP 3: Splitting data and addressing class imbalance...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training default rate: {y_train.mean():.1%}")

# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print(f"Original training set: {y_train.value_counts().to_dict()}")
print(f"Balanced training set: {pd.Series(y_train_balanced).value_counts().to_dict()}")

In [None]:
# STEP 4: MODEL SELECTION AND COMPARISON
print("STEP 4: Model Selection - Comparing Boosting Algorithms...")

# Initialize models
models = {
    'AdaBoost': AdaBoostClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss', verbosity=0),
    'CatBoost': CatBoostClassifier(random_state=42, verbose=False)
}

# Cross-validation
cv_scores = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    scores = cross_val_score(model, X_train_balanced, y_train_balanced, 
                           cv=cv, scoring='roc_auc', n_jobs=-1)
    cv_scores[name] = scores
    print(f"{name} CV AUC: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# Select best model based on CV scores
best_model_name = max(cv_scores.keys(), key=lambda k: cv_scores[k].mean())
print(f"\nBest model: {best_model_name}")

In [None]:
# STEP 5: HYPERPARAMETER TUNING
print(f"STEP 5: Hyperparameter Tuning for {best_model_name}...")

if best_model_name == 'XGBoost':
    param_grid = {
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 0.2],
        'max_depth': [3, 6],
        'subsample': [0.8, 1.0]
    }
    best_model = XGBClassifier(random_state=42, eval_metric='logloss', verbosity=0)
    
elif best_model_name == 'CatBoost':
    param_grid = {
        'iterations': [50, 100],
        'learning_rate': [0.1, 0.2],
        'depth': [4, 6]
    }
    best_model = CatBoostClassifier(random_state=42, verbose=False)
    
else:  # AdaBoost
    param_grid = {
        'n_estimators': [50, 100],
        'learning_rate': [0.5, 1.0, 1.5]
    }
    best_model = AdaBoostClassifier(random_state=42)

# Perform grid search
grid_search = GridSearchCV(
    best_model, param_grid, cv=3, scoring='roc_auc', n_jobs=-1
)

grid_search.fit(X_train_balanced, y_train_balanced)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

In [None]:
# STEP 6: MODEL EVALUATION
print("STEP 6: Final Model Evaluation...")

# Train best model
final_model = grid_search.best_estimator_
y_pred = final_model.predict(X_test)
y_pred_proba = final_model.predict_proba(X_test)[:, 1]

# Calculate metrics
auc_score = roc_auc_score(y_test, y_pred_proba)
f1 = f1_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"\n=== FINAL MODEL PERFORMANCE ===")
print(f"Model: {best_model_name}")
print(f"AUC Score: {auc_score:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=['No Default', 'Default'],
           yticklabels=['No Default', 'Default'])
plt.title(f'{best_model_name} - Confusion Matrix\nLoan Default Prediction')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'{best_model_name} (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Loan Default Prediction')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# STEP 7: BUSINESS IMPACT ANALYSIS
print("STEP 7: Business Impact Analysis...")

# Calculate business metrics
true_defaults = (y_test == 1).sum()
predicted_defaults = (y_pred == 1).sum()
correctly_identified_defaults = ((y_test == 1) & (y_pred == 1)).sum()

# Assume average loan amount and loss given default
avg_loan_amount = 25000
loss_given_default = 0.6  # 60% loss rate on defaults

# Business impact calculations
total_exposure = len(y_test) * avg_loan_amount
potential_losses = true_defaults * avg_loan_amount * loss_given_default
prevented_losses = correctly_identified_defaults * avg_loan_amount * loss_given_default
missed_losses = (true_defaults - correctly_identified_defaults) * avg_loan_amount * loss_given_default

print(f"\n=== BUSINESS IMPACT ANALYSIS ===")
print(f"Total loan portfolio exposure: ${total_exposure:,}")
print(f"Actual defaults in test set: {true_defaults}")
print(f"Predicted defaults: {predicted_defaults}")
print(f"Correctly identified defaults: {correctly_identified_defaults}")
print(f"Potential total losses: ${potential_losses:,}")
print(f"Prevented losses: ${prevented_losses:,}")
print(f"Missed losses: ${missed_losses:,}")
print(f"Loss prevention rate: {prevented_losses/potential_losses:.1%}")

# Risk-based pricing recommendations
high_risk_threshold = 0.3
medium_risk_threshold = 0.15

risk_categories = []
for prob in y_pred_proba:
    if prob >= high_risk_threshold:
        risk_categories.append('High Risk')
    elif prob >= medium_risk_threshold:
        risk_categories.append('Medium Risk')
    else:
        risk_categories.append('Low Risk')

risk_distribution = pd.Series(risk_categories).value_counts()
print(f"\n=== RISK-BASED PORTFOLIO SEGMENTATION ===")
print(risk_distribution)

print(f"\n=== MODEL DEPLOYMENT RECOMMENDATIONS ===")
print("1. Deploy model for real-time loan application scoring")
print("2. Implement risk-based pricing tiers")
print("3. Set up automated alerts for high-risk applications")
print("4. Regular model retraining (quarterly)")
print("5. Monitor model performance and drift")
print(f"6. Expected ROI: ${prevented_losses:,} in prevented losses")

# Final summary
print(f"\n=== PIPELINE SUMMARY ===")
print(f"✓ Processed {n_samples} loan applications")
print(f"✓ Handled missing values and categorical features")
print(f"✓ Addressed class imbalance with SMOTE")
print(f"✓ Compared multiple boosting algorithms")
print(f"✓ Optimized hyperparameters")
print(f"✓ Achieved {auc_score:.1%} AUC score")
print(f"✓ Prevented ${prevented_losses:,} in potential losses")