### 1: Imports and Load All Assets

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
import joblib
import optuna # Our new library for Phase 5

# --- Core Modeling Imports ---
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import f1_score

# --- Pipeline & Imbalance (Finding 4) ---
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

# --- Model Algorithm ---
from xgboost import XGBClassifier

# --- Setup ---
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
optuna.logging.set_verbosity(optuna.logging.WARNING) # Quiets Optuna's logs

print("Libraries imported for Phase 5.")

# --- Define File Paths ---
# Use the full, confirmed paths from our last step
FULL_PATH_TO_PROCESSED = '/Users/yogeshdhaliya/Desktop/DS Learning/11. Projects/Credit-Risk-Prediction/data/processed'
FULL_PATH_TO_MODELS = '/Users/yogeshdhaliya/Desktop/DS Learning/11. Projects/Credit-Risk-Prediction/models'

X_path = os.path.join(FULL_PATH_TO_PROCESSED, 'X_model_input.csv')
y_path = os.path.join(FULL_PATH_TO_PROCESSED, 'y_target.csv')
PREPROCESSOR_path = os.path.join(FULL_PATH_TO_MODELS, 'preprocessor.joblib')
TARGET_ENCODER_path = os.path.join(FULL_PATH_TO_MODELS, 'target_encoder.joblib')

print("Loading all required assets...")

try:
    # Load our 38-feature dataset
    X = pd.read_csv(X_path, index_col='PROSPECTID')
    # Load the original string target
    y = pd.read_csv(y_path, index_col='PROSPECTID').squeeze('columns')
    
    # Load our saved preprocessor and target encoder
    preprocessor = joblib.load(PREPROCESSOR_path)
    target_encoder = joblib.load(TARGET_ENCODER_path)
    
    print("\n--- Assets Loaded Successfully ---")
    print(f"X (model input) shape: {X.shape}")
    print(f"y (target) shape: {y.shape}")

except FileNotFoundError as e:
    print(f"\n[ERROR] Files not found. Please ensure your paths are correct.")
    print(e)

Libraries imported for Phase 5.
Loading all required assets...

--- Assets Loaded Successfully ---
X (model input) shape: (51336, 38)
y (target) shape: (51336,)


### 2: Re-create Data Split (for Tuning)

In [2]:
print("\n--- Re-creating Data Split ---")

# 1. Encode y (using our saved encoder)
y_encoded = target_encoder.transform(y)

# 2. Split the data with the SAME random_state
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y_encoded, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_encoded  # Per Finding 4
)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
print("Data is ready for tuning.")


--- Re-creating Data Split ---
X_train shape: (41068, 38), y_train shape: (41068,)
X_test shape: (10268, 38), y_test shape: (10268,)
Data is ready for tuning.


### 3: Define the Optuna "Objective" Function

In [3]:
def objective(trial):
    """
    Optuna objective function to maximize the f1_macro score
    for our SMOTE + XGBoost pipeline.
    """
    
    # --- 1. Define Hyperparameter Search Space ---
    
    # A. SMOTE parameters
    # We'll tune k_neighbors, as it's the most sensitive
    smote_k = trial.suggest_int('smote_k_neighbors', 3, 10, log=True)
    
    # B. XGBoost parameters
    xgb_n_estimators = trial.suggest_int('xgb_n_estimators', 200, 1000)
    xgb_max_depth = trial.suggest_int('xgb_max_depth', 3, 10)
    xgb_learning_rate = trial.suggest_float('xgb_learning_rate', 0.01, 0.3, log=True)
    xgb_gamma = trial.suggest_float('xgb_gamma', 1e-8, 1.0, log=True)
    xgb_subsample = trial.suggest_float('xgb_subsample', 0.6, 1.0)
    xgb_colsample_bytree = trial.suggest_float('xgb_colsample_bytree', 0.6, 1.0)
    xgb_reg_alpha = trial.suggest_float('xgb_reg_alpha', 1e-8, 1.0, log=True)
    xgb_reg_lambda = trial.suggest_float('xgb_reg_lambda', 1e-8, 1.0, log=True)
    
    # --- 2. Create the Pipeline with Trial Parameters ---
    
    pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42, k_neighbors=smote_k)),
        ('model', XGBClassifier(
            objective='multi:softmax',
            num_class=4,
            eval_metric='mlogloss',
            random_state=42,
            n_jobs=-1,
            n_estimators=xgb_n_estimators,
            max_depth=xgb_max_depth,
            learning_rate=xgb_learning_rate,
            gamma=xgb_gamma,
            subsample=xgb_subsample,
            colsample_bytree=xgb_colsample_bytree,
            reg_alpha=xgb_reg_alpha,
            reg_lambda=xgb_reg_lambda
        ))
    ])
    
    # --- 3. Evaluate the Pipeline ---
    # We use 3-fold CV for speed during tuning
    cv_strategy = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    
    score = cross_val_score(
        pipeline,
        X_train,  # Tune only on the training data
        y_train,
        cv=cv_strategy,
        scoring='f1_macro',
        n_jobs=-1
    )
    
    # --- 4. Return the Mean Score ---
    return np.mean(score)

print("Optuna 'objective' function defined successfully.")

Optuna 'objective' function defined successfully.


### 4: Run the Optuna Study

In [4]:
print("\n--- Starting Optuna Study (n_trials=50) ---")
print("This will take a significant amount of time. Please be patient.")

# 1. Create a study, 'direction' tells it to maximize the score
study = optuna.create_study(direction='maximize')

# 2. Run the optimization
# We pass our function and the number of trials
study.optimize(objective, n_trials=50)

# 3. Print the results
print("\n--- Optuna Study Complete ---")
print(f"Best F1 Macro Score: {study.best_value:.4f}")
print("\nBest Hyperparameters Found:")
print(study.best_params)


--- Starting Optuna Study (n_trials=50) ---
This will take a significant amount of time. Please be patient.

--- Optuna Study Complete ---
Best F1 Macro Score: 0.6107

Best Hyperparameters Found:
{'smote_k_neighbors': 6, 'xgb_n_estimators': 456, 'xgb_max_depth': 3, 'xgb_learning_rate': 0.02629064786870942, 'xgb_gamma': 0.07798851124134179, 'xgb_subsample': 0.620775868670427, 'xgb_colsample_bytree': 0.9978076845683496, 'xgb_reg_alpha': 3.0506264787927e-06, 'xgb_reg_lambda': 0.275595077031217}


### 5: Build, Train, and Evaluate the Tuned Model

In [5]:
import joblib
from sklearn.metrics import classification_report

print("\n--- Building and Evaluating Tuned Model ---")

# 1. Get the best parameters from the study
try:
    best_params = study.best_params
    print(f"Loaded best parameters: {best_params}")
except NameError:
    print("[ERROR] 'study' object not found. Please re-run Block 4 or paste the 'best_params' dictionary here manually.")
    # In a real scenario, you'd paste the dictionary:
    # best_params = {'smote_k_neighbors': 6, 'xgb_n_estimators': 456, ...}
    raise

# 2. Create the new, tuned pipeline
tuned_pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(
        random_state=42, 
        k_neighbors=best_params['smote_k_neighbors'] # From Optuna
    )),
    ('model', XGBClassifier(
        objective='multi:softmax',
        num_class=4,
        eval_metric='mlogloss',
        random_state=42,
        n_jobs=-1,
        # --- Unpack all the tuned XGB parameters ---
        n_estimators=best_params['xgb_n_estimators'],
        max_depth=best_params['xgb_max_depth'],
        learning_rate=best_params['xgb_learning_rate'],
        gamma=best_params['xgb_gamma'],
        subsample=best_params['xgb_subsample'],
        colsample_bytree=best_params['xgb_colsample_bytree'],
        reg_alpha=best_params['xgb_reg_alpha'],
        reg_lambda=best_params['xgb_reg_lambda']
    ))
])

print("\nTuned pipeline created successfully.")

# 3. Train the new pipeline on the FULL training set
print("Training final tuned model on full train set...")
tuned_pipeline.fit(X_train, y_train)
print("Tuned model trained.")

# 4. Evaluate on the held-out test set
y_pred_tuned = tuned_pipeline.predict(X_test)

# 5. Decode labels for the report
y_test_labels = target_encoder.inverse_transform(y_test)
y_pred_tuned_labels = target_encoder.inverse_transform(y_pred_tuned)

# 6. Generate and print the new report
print("\n--- Final Tuned Model Evaluation Report (on Test Set) ---")
tuned_report = classification_report(
    y_test_labels, 
    y_pred_tuned_labels, 
    labels=target_encoder.classes_,
    zero_division=0
)
print(tuned_report)


--- Building and Evaluating Tuned Model ---
Loaded best parameters: {'smote_k_neighbors': 6, 'xgb_n_estimators': 456, 'xgb_max_depth': 3, 'xgb_learning_rate': 0.02629064786870942, 'xgb_gamma': 0.07798851124134179, 'xgb_subsample': 0.620775868670427, 'xgb_colsample_bytree': 0.9978076845683496, 'xgb_reg_alpha': 3.0506264787927e-06, 'xgb_reg_lambda': 0.275595077031217}

Tuned pipeline created successfully.
Training final tuned model on full train set...
Tuned model trained.

--- Final Tuned Model Evaluation Report (on Test Set) ---
              precision    recall  f1-score   support

          P1       0.60      0.75      0.67      1161
          P2       0.81      0.82      0.81      6440
          P3       0.39      0.28      0.33      1491
          P4       0.63      0.63      0.63      1176

    accuracy                           0.71     10268
   macro avg       0.61      0.62      0.61     10268
weighted avg       0.70      0.71      0.71     10268



### 6: Save the Final Tuned Model

In [6]:
import joblib
import os

print("\n--- Saving Final Tuned Model ---")

# --- 1. Define Save Path ---
# (Using the paths from Block 1)
TUNED_MODEL_SAVE_PATH = os.path.join(FULL_PATH_TO_MODELS, 'tuned_model.joblib')

# --- 2. Save the Full Tuned Pipeline ---
# This is 'tuned_pipeline' from your Block 5
joblib.dump(tuned_pipeline, TUNED_MODEL_SAVE_PATH)

print(f"Final tuned model pipeline saved to: {TUNED_MODEL_SAVE_PATH}")
print("\n--- Phase 5 is 100% complete and all assets are saved. ---")


--- Saving Final Tuned Model ---
Final tuned model pipeline saved to: /Users/yogeshdhaliya/Desktop/DS Learning/11. Projects/Credit-Risk-Prediction/models/tuned_model.joblib

--- Phase 5 is 100% complete and all assets are saved. ---


## Summary of Phase 5: Hyperparameter Tuning

Our primary goal in this phase was to improve upon our baseline model, which had an `f1-macro` score of 0.60 and a critical weakness in identifying the **P3 (subprime)** class (0.25 F1-score).

### 1. The Tuning Strategy

We used **`Optuna`** to run a 50-trial hyperparameter search. We optimized for the `f1-macro` score by tuning a wide range of parameters for both `SMOTE` (like `k_neighbors`) and our `XGBClassifier` (like `n_estimators`, `max_depth`, and `learning_rate`).

### 2. The Result: A "Smarter" Model

The tuning was a clear success. `Optuna` found a new set of parameters that achieved a cross-validation `f1-macro` score of **0.6107**, surpassing our baseline of 0.5933.

### 3. Final Evaluation & Key Success

We trained a new pipeline with these "best parameters" and evaluated it on our unseen test set. The results confirmed the improvement:

* **Our primary goal was met:** The `f1-score` for the difficult **P3 class** jumped from **0.25 to 0.33**, a **32% improvement**.
* **A smart trade-off:** The model learned to identify P3 by slightly lowering its focus on the dominant P2 class (0.83 -> 0.81), while P1 and P4 performance held steady.
* **Overall improvement:** The final `f1-macro` score on the test set increased from **0.60 to 0.61**.

### 4. Final Deliverable

We have saved this new, superior "champion" pipeline as **`tuned_model.joblib`**. This is the final model we will use for our application.

---

This completes our model development and evaluation. We have successfully followed all 4 Key Findings and produced a final, tuned, and stable model that outperforms our baseline, especially on our key "P3" challenge.

We are now ready to begin our final phase: **Phase 6: Deployment Preparation**.

Our goal here is to build the Streamlit web app that was our final product. This app will:
1.  Load our two key assets: `tuned_model.joblib` (which is the full pipeline) and `target_encoder.joblib` (to decode the predictions).
2.  Provide a file uploader for the user (the underwriting team).
3.  Process the uploaded CSV of new applicants, run them through our pipeline, and display the predicted risk categories (P1, P2, P3, or P4).

Are you ready to start building the `app/app.py` file for our Streamlit app?