# Mohammad Tayyab Alam 24157 Midterm

## This notebook outlines the best performing model which achieved an accuracy of 95.338 % on the kaggle leaderboard.
### Objective
The goal of this project is to build a predictive model to identify overstressed customers in a microfinance setting. This will help financial institutions intervene early and avoid loan defaults. The dataset contains customer financial data, and the target variable `Y` indicates whether the customer is overstressed (1) or not (0).

# NOTE:
Throughout this project my entire rationale and thinking was based on achieving the highest score possible and dedicating time and effort to explore the parameters for the best machine learning model. This project was taken in two phases



1.   Phase 1 -> Testing out all 10 models on raw data with no feature engineering and bare minimum missing value imputation and scaling to even the playing field with no intense hyper parameter tuning. At the end of this phase models that struggled were eliminated and only xgboost , lgboost , catboost and random forest survived.
2.   Phase 2 -> The best performing models were intensely iterated to find the best hyperparameters via random search and feature engineering was also explored and regularization.



# Data Loading

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load datasets
train_df = pd.read_csv('fda_trainingset.csv')
test_df = pd.read_csv('fda_testset.csv')

# Check the first few rows of the training and test datasets
train_df.head(), test_df.head()

(   ID  Income index  income_volatility  employment_status_flag  \
 0   1     87.000000          34.118411                       0   
 1   2     82.372284          31.573280                       0   
 2   3     50.000000          27.771653                       0   
 3   4     66.236109          26.515922                       0   
 4   5     81.303299          20.843691                       0   
 
    dependents_count  gender_flag     Payment  collateral_flag  \
 0                 2            0  165.100000                1   
 1                 0            1  162.983897                1   
 2                 0            1  165.100000                1   
 3                 0            1  167.009549                1   
 4                 0            1  158.165419                0   
 
    repayment_history_score  missed_payment_count  ...       X70  X71  X72  \
 0                      829                     2  ...  0.040000  0.0  0.0   
 1                      724               

### Missing Value Counts

In [2]:
# Calculate missing values
missing_counts =train_df.isnull().sum()
missing_percent = (missing_counts / len(train_df)) * 100

# Combine into a summary DataFrame
missing_summary = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percentage': missing_percent
})

# Filter out columns with no missing values
missing_summary = missing_summary[missing_summary['Missing Count'] > 0]

# Sort by most missing
missing_summary = missing_summary.sort_values(by='Missing Count', ascending=False)

print(missing_summary)

                   Missing Count  Missing Percentage
Income index                2112              1.0560
income_volatility           1720              0.8600
X76                          371              0.1855
X78                          367              0.1835
X77                          360              0.1800
X75                          357              0.1785


In [3]:
# Check class distribution in column 'Y'
class_distribution = train_df['Y'].value_counts()

# Calculate percentages for each class
class_percentage = train_df['Y'].value_counts(normalize=True) * 100

# Display class counts and percentages
print("Class Distribution (Counts):")
print(class_distribution)
print("\nClass Distribution (Percentages):")
print(class_percentage)

Class Distribution (Counts):
0    199475
1       525
Name: Y, dtype: int64

Class Distribution (Percentages):
0    99.7375
1     0.2625
Name: Y, dtype: float64


# Data Preprocessing and Feature Engineering

We'll first handle any missing values and standardize the data for better model performance. We'll also create additional features like the sum, mean, and standard deviation of the features to capture more information from the data. These simple features helped increase the score from 0.93 to 0.95


In [4]:
# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = train_df.drop(columns=['Y'])
y = train_df['Y'].astype(pd.Int64Dtype()).fillna(0)

test_ids = test_df['ID']  # Save IDs

# Combine train and test data for consistent preprocessing
combined = pd.concat([X, test_df.drop(columns=['ID'])], axis=0)

# Impute missing values
imputer = SimpleImputer(strategy='median')
combined_imputed = pd.DataFrame(imputer.fit_transform(combined), columns=combined.columns)

# Feature engineering
combined_imputed['feature_sum'] = combined_imputed.sum(axis=1)
combined_imputed['feature_mean'] = combined_imputed.mean(axis=1)
combined_imputed['feature_std'] = combined_imputed.std(axis=1)

# Standardization
scaler = StandardScaler()
combined_scaled = pd.DataFrame(scaler.fit_transform(combined_imputed), columns=combined_imputed.columns)

# Split back into train and test sets
X_processed = combined_scaled.iloc[:len(X)]
test_processed = combined_scaled.iloc[len(X):]


### Rechecking Missing Values

In [5]:
# Calculate missing values
missing_counts =X_processed.isnull().sum()
missing_percent = (missing_counts / len(X_processed)) * 100

# Combine into a summary DataFrame
missing_summary = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percentage': missing_percent
})

# Filter out columns with no missing values
missing_summary = missing_summary[missing_summary['Missing Count'] > 0]

# Sort by most missing
missing_summary = missing_summary.sort_values(by='Missing Count', ascending=False)

print(missing_summary)

Empty DataFrame
Columns: [Missing Count, Missing Percentage]
Index: []


# Model Selection and Hyper Parameter Tuning
## Throughout this project, around 80 submissions were done to determine the best possible model. Initial submissions focussed on submitting models like:
| Model               | Best Score     | Remarks                                                                 |
|---------------------|----------------|-------------------------------------------------------------------------|
| Decision Tree       | 0.51470         | Model was discarded due to low performance                             |
| KNN                 | 0.55477         | Model was discarded due to low performance                             |
| Logistic Regression | 0.51836         | Model was discarded due to low performance                             |
| XGBOOST             | **0.95338**     | Best model with significant hyperparameter tuning                      |
| CATBOOST            | ~0.94           | Strong performance, but did not beat XGBOOST                           |
| LGBOOST             | ~0.93           | Good performance, but behind CATBOOST and XGBOOST                      |
| STACKING            | ~0.95           | Ensemble of LG, XG, and CATBOOST; strong but still behind XGBOOST      |
| ADABOOST            | ~0.52           | Model was discarded due to low performance                             |
| GRADIENT DESCENT    | ~0.50           | Model was discarded due to low performance                             |
| RANDOM FOREST       | ~0.84           | First strong model found                                               |
| NAIVE BAYES         | ~0.50           | Model was discarded due to low performance                             |


### When submitting these models it quickly became clear that basic models like Decision Tree , Logistic Regeression , NAIVE Bayes struggle in this problem set. This can be attributed to the fact that there are 79 columns and severe class imbalance of 525 (1) out of 200,000 rows.
### So the next best approach was to move to these complex models like LGBOOST , XGBOOST , CATBOOST and ADABOOST and to use stacking ensembling if required.

#### When investigating the kaggle submissions the strategy used was that if a model scores low on initial modelling with missing values imputed and data scaled and class imbalance being handled via SMOTE then the model was discarded due to time constraints and the competitive nature. Once best models were identified ( LGBOOST , CATBOOST , XGBOOST and STACKING of these 3) significant efforts were directed to improving the score of these via feature engineering and and hyper parameter tuning via random search instead of grid search due to computational limitations.

































































































































































































































































































































































































































































































































































# After more than 60 submissions these are the best parameters for the XGBOOST model.
## Hyperparameter Tuning with XGBoost and Insights Gained

### **1. Hyperparameters in XGBoost Model**

In this section, we define and fine-tune several **hyperparameters** of the XGBoost model using **RandomizedSearchCV**. Here’s an explanation of what each parameter means and how adjusting them can impact model performance:

- **n_estimators**:
    - This defines the number of trees (or boosting rounds) the model will use. More trees often lead to better performance but can also cause overfitting if not managed properly. Initially, when I only increased the number of trees (iterations), I observed that the model started to overfit. This taught me that simply increasing iterations isn't enough — careful tuning of other parameters is also crucial.

- **learning_rate**:
    - The learning rate determines the size of the steps the model takes during training. A smaller learning rate leads to more conservative updates and requires more trees (`n_estimators`) to converge. I tested values like `0.0025` to ensure that the model gradually learns without skipping important details, especially when combined with a higher number of trees.

- **max_depth**:
    - This parameter controls the maximum depth of the decision trees. A deeper tree can capture more complex relationships in the data, but if set too high, it risks overfitting by capturing noise in the training data. I tested values from `5` to `8` and found that setting a reasonable depth (like `7`) led to better generalization.

- **min_child_weight**:
    - This defines the minimum sum of instance weights (hessian) in a child node. This parameter is used to control overfitting. Higher values prevent the model from learning overly specific patterns that might not generalize well. I found that increasing it helped reduce overfitting when tuning my model.

- **gamma**:
    - Gamma is the minimum loss reduction required to make a further partition on a leaf node. Essentially, it prevents splits that would result in a minimal improvement. When set too low, the model may create overly complex trees. I experimented with values between `0.3` and `0.6` to ensure the model wasn't too sensitive to small changes.

- **subsample**:
    - This parameter controls the fraction of training data used to grow each tree. It helps prevent overfitting by introducing randomness into the model. I set it to `0.85`, which means 85% of the training data was used for each tree, leaving room for variation without sacrificing accuracy.

- **colsample_bytree**:
    - Similar to `subsample`, this controls the fraction of features used to grow each tree. It is another way to reduce overfitting. I set it to `0.85`, ensuring each tree sees a different subset of features to avoid overfitting on specific attributes.

- **reg_alpha** (L1 regularization):
    - L1 regularization helps in feature selection by shrinking some feature coefficients to zero. This is particularly useful when there are many features and some are not contributing significantly to the model. This was the key insight I gained — after applying L1 regularization (Lasso), my model's score improved drastically. I learned that Lasso regularization helped simplify the model by removing irrelevant features, leading to better generalization.

- **reg_lambda** (L2 regularization):
    - L2 regularization penalizes large coefficients but does not shrink them to zero. It helps prevent overfitting by discouraging overly complex models. I experimented with values like `2.0` and found that the combination of L1 and L2 regularization helped balance model complexity and accuracy.

- **objective**:
    - This specifies the learning task and objective function. For binary classification, the correct value is `'binary:logistic'`, which outputs probabilities for the positive class.

- **eval_metric**:
    - This determines how the model's performance is evaluated during training. The `logloss` metric is suitable for binary classification tasks, as it penalizes incorrect classifications based on their confidence (probability).

- **tree_method**:
    - This defines the algorithm used for tree construction. The `hist` method is particularly useful for large datasets as it is faster than other methods, like `auto` or `exact`.

### **2. Key Insights and Learnings**

- **Overfitting and Iterations**:
    - Initially, I only focused on increasing the number of trees (`n_estimators`), thinking it would improve accuracy. However, this led to overfitting. This experience taught me that adding more iterations or trees is not always beneficial — regularization and other parameters must also be tuned to prevent the model from becoming too complex.

- **Importance of Regularization (Lasso - L1)**:
    - The most significant insight I gained was the impact of **Lasso regularization** (through **reg_alpha**). After using it, my model's performance improved dramatically. This aligned with what I learned in class: **too many features are not good**. Rather than blindly using all available features, **feature engineering** is key. By focusing on the right features (like the **sum, mean, and standard deviation**), I improved my score from **0.93 to 0.95**.

- **Feature Engineering**:
    - By adding engineered features like `feature_sum`, `feature_mean`, and `feature_std`, I was able to provide the model with more useful information. This showed me that **feature engineering is crucial** for model performance. It's not just about having more features; it's about having the **right features** that contribute meaningfully to predictions. This approach helped the model generalize better and significantly improved performance.

### **3. Conclusion**

Through this exercise, I learned the importance of hyperparameter tuning, particularly in relation to regularization, and the impact of effective feature engineering. The experience validated key lessons from class, such as the importance of **feature selection**, **regularization** to prevent overfitting, and the fact that **more features are not always better** — the right features are the key to improving model performance.



In [6]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# Define the base XGBoost model
xgb_model = XGBClassifier(
    n_estimators=3000,
    learning_rate=0.005,
    max_depth=7,
    min_child_weight=8,
    gamma=0.5,
    subsample=0.85,
    colsample_bytree=0.85,
    reg_alpha=0.6,
    reg_lambda=2.0,
    objective='binary:logistic',
    eval_metric='logloss',
    tree_method='hist',
    random_state=42,
    verbosity=1
)

# Hyperparameter tuning with RandomizedSearchCV
param_dist = {
    'learning_rate': [0.001, 0.005, 0.01, 0.05],
    'max_depth': [5, 6, 7, 8],
    'min_child_weight': [5, 6, 7, 8],
    'gamma': [0.3, 0.4, 0.5, 0.6],
    'subsample': [0.7, 0.75, 0.8, 0.85],
    'colsample_bytree': [0.7, 0.75, 0.8, 0.85],
    'reg_alpha': [0.4, 0.5, 0.6],
    'reg_lambda': [1.5, 2.0, 2.5],
    'n_estimators': [2000, 3000],
}


# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_model, param_distributions=param_dist, n_iter=20,
                                   scoring='roc_auc', cv=3, verbose=3, random_state=42, n_jobs=-1)

random_search.fit(X_processed, y)

# Display best hyperparameters
best_model = random_search.best_estimator_
print(f"\nBest Parameters from RandomizedSearchCV: {random_search.best_params_}")


Fitting 3 folds for each of 20 candidates, totalling 60 fits


14 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Tayyab\AppData\Roaming\Python\Python310\site-packages\xgboost\core.py", line 729, in inner_f
    return func(**kwargs)
  File "C:\Users\Tayyab\AppData\Roaming\Python\Python310\site-packages\xgboost\sklearn.py", line 1663, in fit
    train_dmatrix, evals = _wrap_evaluation_matrices(
  File "C:\Users\Tayyab\AppData\Roaming\Python\Python310\site-packages\xgboost\sklearn.py", line 628, in _wrap_evalua


Best Parameters from RandomizedSearchCV: {'subsample': 0.85, 'reg_lambda': 1.5, 'reg_alpha': 0.6, 'n_estimators': 2000, 'min_child_weight': 8, 'max_depth': 7, 'learning_rate': 0.005, 'gamma': 0.6, 'colsample_bytree': 0.7}


# Best Parameters

Best Parameters from RandomizedSearchCV: {'subsample': 0.85, 'reg_lambda': 1.5, 'reg_alpha': 0.6, 'n_estimators': 2000, 'min_child_weight': 8, 'max_depth': 7, 'learning_rate': 0.005, 'gamma': 0.6, 'colsample_bytree': 0.7}

# Model Training with Cross-Validation

## Thought Process and Key Insights

In this section, I implemented **Stratified K-Fold Cross-Validation** and evaluated the model using **AUC-ROC**. Here are the key points I learned and the reasoning behind these choices:

### **1. Stratified K-Fold Cross-Validation**:
- **Stratified K-Fold** is used to ensure that each fold has the same distribution of the target classes (i.e., the same proportion of positive and negative samples) as the overall dataset. This is particularly important in cases of **class imbalance** where random splitting might lead to folds with an unequal class distribution.
- By using **n_splits=5**, I split the data into 5 parts (folds), ensuring that each fold is used for validation exactly once. This allows the model to be trained and validated on multiple different subsets of the data, providing a better estimate of its generalization performance.
- **Shuffling** the data before splitting it ensures randomness and avoids potential bias from ordering in the dataset, which can be especially important when the data is ordered in some way (e.g., time series).

### **2. Out-of-Fold Predictions (OOF)**:
- The **OOF (Out-of-Fold) predictions** are made by training the model on all but one fold and predicting on the held-out fold. This approach allows for an evaluation of how the model performs on unseen data in each fold, providing an estimate of how the model generalizes to new data.
- I used **OOF predictions** to evaluate the model's performance during training. This was the first time I encountered the concept of **OOF predictions**. I learned that this technique helps in obtaining more reliable performance metrics since it ensures that the validation data has not been seen by the model during training.
- OOF predictions also help in avoiding **overfitting**, as the model’s performance is evaluated on data that it hasn't seen during training.

### **3. Averaging Predictions Across Folds**:
- After each fold, I made predictions on the **test set**, and the predictions were averaged across all folds. This helps in reducing variance and ensures that the final prediction is more stable.
- Averaging the predictions from multiple models (from different folds) helps in improving accuracy and reducing the likelihood of overfitting to any particular fold.

### **4. AUC-ROC for Model Evaluation**:
- I chose **AUC-ROC** as the evaluation metric because it is a reliable measure for binary classification, especially when dealing with imbalanced datasets. The **AUC (Area Under the Curve)** tells us how well the model distinguishes between the two classes.
- **AUC-ROC** provides a clear view of the model's performance across all possible classification thresholds, making it a valuable tool for evaluating models on imbalanced datasets.

### **Key Insights I Gained**:
- **Learning About OOF**: This project was my first experience working with **OOF predictions**. I now understand that OOF predictions allow me to evaluate the model on data that it hasn’t seen during training, giving a better estimate of how the model will perform on unseen data. It also helps in mitigating overfitting.
- **Stratified K-Fold**: Using Stratified K-Fold cross-validation helped me ensure that the class distribution remained consistent across all folds, which is critical in the case of imbalanced datasets.
- **Model Generalization**: By averaging predictions across all folds, I learned that combining predictions from multiple models helps in improving the robustness of the final prediction, reducing overfitting to a specific fold.
- **AUC-ROC**: The use of AUC-ROC as the evaluation metric gave me a comprehensive view of the model’s ability to distinguish between classes, which is especially important when the classes are imbalanced.

### **Conclusion**:
- This approach helped me get more confidence in the model's performance by using **Stratified K-Fold** cross-validation and **AUC-ROC** as the evaluation metric.
- I also learned that **OOF predictions** are crucial for obtaining more reliable model performance estimates, and they provide insight into how the model might generalize to unseen data.


In [7]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

# Initialize Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Arrays to collect OOF and test predictions
oof_preds = np.zeros(X_processed.shape[0])
test_preds = np.zeros(test_processed.shape[0])

# Perform training and prediction
for fold, (train_idx, val_idx) in enumerate(skf.split(X_processed, y)):
    print(f"\nFold {fold+1}")
    X_train_fold, X_val_fold = X_processed.iloc[train_idx], X_processed.iloc[val_idx]
    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]

    # Train the model
    best_model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], verbose=False)

    # Predict OOF and test set
    oof_preds[val_idx] = best_model.predict_proba(X_val_fold)[:, 1]
    test_preds += best_model.predict_proba(test_processed)[:, 1] / skf.n_splits

# Evaluate the model
print("\nAUC-ROC:", roc_auc_score(y, oof_preds))



Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

AUC-ROC: 0.9515332509742837


## Model Evaluation

The model's performance is evaluated using AUC-ROC. A higher AUC indicates better performance in distinguishing between stressed and non-stressed customers. In this case, the AUC-ROC score of 0.9515 indicates that the model performs exceptionally well, with its ability to correctly classify both the positive (stressed) and negative (non-stressed) classes.

AUC-ROC score provides a comprehensive evaluation of the model's performance across all classification thresholds, making it especially useful in situations with imbalanced classes, as it measures both the true positive rate and false positive rate.

The high AUC-ROC score of 0.9515 shows that the model can distinguish between stressed and non-stressed customers with high accuracy, making it an effective tool for identifying at-risk clients in the financial setting.


## Final Submission

We will prepare the final submission file containing the test IDs and predicted probabilities of the target variable.


In [10]:
# Predict the probabilities
test_preds_probs = best_model.predict_proba(test_processed)[:, 1]

# Uncomment ONE of the following lines based on what you want to save:

# Option 1: Save probabilities
test_preds_final = test_preds_probs

# Option 2: Save binary labels
#test_preds_final = (test_preds_probs > 0.5).astype(int)

# Prepare the final submission
submission = pd.DataFrame({
    'ID': test_ids,
    'target': test_preds_final
})

# Save the submission file
submission.to_csv('submission_binary.csv', index=False)
