In [1]:
# Import relevant libraries
import pandas as pd
from functools import reduce
import os
import pickle


# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

### **2.5 Implement ML and test dataset**

In [2]:
# Load the test dataset
data_path = '../data/raw/home-credit-default-risk/'
processed_path = '../data/processed/'
app_test = pd.read_csv(data_path + 'application_test.csv')

In [3]:
# Transform the test dataset to match the training dataset for prediction
app_test['YEARS_BIRTH'] = app_test['DAYS_BIRTH'] / -365
app_test['YEARS_EMPLOYED'] = abs(app_test['DAYS_EMPLOYED'] / -365)

In [4]:
# Load the exported _agg DataFrames
bureau_agg = pd.read_csv(os.path.join(processed_path, 'bureau_agg.csv'))
pre_appl_agg = pd.read_csv(os.path.join(processed_path, 'pre_appl_agg.csv'))
pos_cash_agg = pd.read_csv(os.path.join(processed_path, 'pos_cash_agg.csv'))
install_pay_agg = pd.read_csv(os.path.join(processed_path, 'install_pay_agg.csv'))
credit_card_agg = pd.read_csv(os.path.join(processed_path, 'credit_card_agg.csv'))

In [5]:
# Load the final model from the file
model_path  = '../models/' 
with open(model_path+'final_model.pkl', 'rb') as model_file:
    final_model = pickle.load(model_file)

# Load the one-hot encoders from the file
with open(model_path+'one_hot_code_encoders.pkl', 'rb') as encoder_file:
    one_hot_encoders = pickle.load(encoder_file)

# Load the label encoders from the file
with open(model_path+'label_encoders.pkl', 'rb') as encoder_file:
    label_encoders = pickle.load(encoder_file)

# Load the imputer from the file
with open(model_path+'imputer.pkl', 'rb') as imputer_file:
    imputer = pickle.load(imputer_file)

# Load the selected Features from the file
with open(model_path+'selected_features.pkl', 'rb') as selected_features_file:
    selected_features = pickle.load(selected_features_file)

print("Model, encoders, selected features, and imputer have been loaded successfully.")

Model, encoders, selected features, and imputer have been loaded successfully.


In [6]:
# List of DataFrames to merge
dfs = [bureau_agg, pre_appl_agg, pos_cash_agg, install_pay_agg, credit_card_agg]

# Merge all DataFrames in the list on 'SK_ID_CURR' using reduce and lambda
final_merge_data = reduce(lambda left, right: left.merge(right, on='SK_ID_CURR', how='left'), [app_test] + dfs)

In [7]:
# Check the shape of the final merged data
app_test.shape, final_merge_data.shape

((48744, 123), (48744, 458))

In [8]:
# Apply the label encoders to the test dataset
def apply_label_encoders(df, label_encoders, one_hot_encoders):
    """
    Applies the label encoders to the DataFrame.
    
    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    label_encoders (dict): A dictionary of label encoders.
    
    Returns:
    pd.DataFrame: The DataFrame with encoded categorical features.
    """
    df_encoded = df.copy()
    # Apply Label Encoding   
    for column, encoder in label_encoders.items():
        df_encoded[column] = encoder.transform(df_encoded[column])
        
    # Apply One-Hot Encoding
    for col in one_hot_encoders:
        df_encoded = pd.get_dummies(df_encoded, columns=[col], drop_first=True)
    
    return df_encoded

In [9]:
# Encode the test dataset
test_encoded_data = apply_label_encoders(final_merge_data, label_encoders,one_hot_encoders)

In [10]:
# Select the columns by using selected feature from training dataset
test_data_selected = test_encoded_data[selected_features]

In [11]:
# Apply imputation on selected features from the training dataset
test_data_imputed = pd.DataFrame(imputer.transform(test_data_selected), columns=selected_features)

In [12]:
# Predict using the loaded model
y_test_pred = final_model.predict(test_data_imputed)
y_test_proba = final_model.predict_proba(test_data_imputed)[:, 1]

In [13]:
# Create submission file
submission = pd.DataFrame({"SK_ID_CURR": app_test["SK_ID_CURR"], "TARGET": y_test_proba})
submission.to_csv("../data/processed/submission.csv", index=False)

#### **Upload the result into Kaggle to test the score**

![Submission result from Kaggle](../reports/figures/Kaggle_result.png)

## **Part 3: Propose methodologies for evaluating business impact and tracking model performance post-deployment.** 

### **Evaluating Business Impact**

#### **1. Assumption**  
- Baseline AUC: **0.5** (random guessing)  
- Best Model AUC: **~0.77** (significantly better at detecting defaulters)  

#### **2. Key Business Metrics**  
- **Default Rate Detected:**  
  - Measure how effectively the model identifies high-risk applicants.  
- **Reduction in Bad Debt:**  
  - Compare the decrease in non-performing loans (NPL) due to improved risk assessment.  

#### **3. Approach: A/B Testing**  
- **Group A (Control):** Uses the existing credit risk assessment process.  
- **Group B (Test):** Uses the ML model for loan approvals.  
- **Evaluation Period:** Monitor financial impact over 3-6 months.  
- **Expected Outcome:** If Group B shows a significant reduction in bad debt while maintaining revenue growth, the ML model is validated for full deployment.  

---

### **Tracking Model Performance Post-Deployment**  

#### **1. ROC-AUC Monitoring**  
- Continuously track **ROC-AUC** to ensure the model maintains predictive power over time.  

#### **2. Drift Detection**  
- **Feature Drift:** Monitor changes in key features affecting loan approvals or feature distribution changed. 
- **Concept Drift:** Detect shifts in relationships among variables due to economic changes.  
- **Feedback Loops:** Analyze false positives (incorrectly flagged as risky) and false negatives (missed defaulters) to refine the model.  

---

### **Post-Deployment Improvement Strategies**  

#### **1. Periodic Model Retraining**  
- Retrain the model using the latest data to account for changes in customer behavior and market trends.
For example, when the distribution of data has been changed or time-period like quartly.  
- Adjust hyperparameters and feature selection based on new insights.  

---

### **Conclusion**  
By combining **A/B testing** for business validation, **drift detection**, and **ongoing monitoring**, we ensure that the ML model remains effective and drives measurable business impact. This approach helps align technical model performance with **key financial and operational objectives**, ensuring long-term success.  