# Credit Risk Prediction Project: Data Preprocessing and XGBoost Modeling

This notebook demonstrates a complete machine learning pipeline for the "Give Me Some Credit" Kaggle competition. It covers data loading, extensive preprocessing (handling outliers and missing values, feature transformation), XGBoost model training with hyperparameter tuning using cross-validation, and generating predictions for the test dataset.

## 1. Setup and Data Loading

### Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
import joblib # For saving/loading models

### Load the Training Dataset (`cs-training.csv`)

In [2]:
# Define the path to your data. Adjust this to your local directory.
data_path = "/Users/pengchengyang/ML_python/Upstart_PJ/data/"
training_data_file = "cs-training.csv"
df_train = pd.read_csv(os.path.join(data_path, training_data_file), index_col=0)

print(f"Training data loaded successfully with shape: {df_train.shape}")
print("First 5 rows of the training dataset:")
display(df_train.head())

print("\nTraining Dataset Information:")
df_train.info()

print("\nTraining Dataset Descriptive Statistics:")
display(df_train.describe())

Training data loaded successfully with shape: (150000, 11)
First 5 rows of the training dataset:


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0



Training Dataset Information:
<class 'pandas.core.frame.DataFrame'>
Index: 150000 entries, 1 to 150000
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  int64  
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         120269 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberOfTimes90DaysLate               150000 non-null  int64  
 8   NumberRealEstateLoansOrLines          150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfDependents                    1460

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
count,150000.0,150000.0,150000.0,150000.0,150000.0,120269.0,150000.0,150000.0,150000.0,150000.0,146076.0
mean,0.06684,6.048438,52.295207,0.421033,353.005076,6670.221,8.45276,0.265973,1.01824,0.240387,0.757222
std,0.249746,249.755371,14.771866,4.192781,2037.818523,14384.67,5.145951,4.169304,1.129771,4.155179,1.115086
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.029867,41.0,0.0,0.175074,3400.0,5.0,0.0,0.0,0.0,0.0
50%,0.0,0.154181,52.0,0.0,0.366508,5400.0,8.0,0.0,1.0,0.0,0.0
75%,0.0,0.559046,63.0,0.0,0.868254,8249.0,11.0,0.0,2.0,0.0,1.0
max,1.0,50708.0,109.0,98.0,329664.0,3008750.0,58.0,98.0,54.0,98.0,20.0


## 2. Data Preprocessing for Training Data

This section replicates the preprocessing steps identified in `credit_risk_preprocessing.ipynb` and applies them to the training dataset. These steps include handling outliers, imputing missing values, and transforming skewed features. Importantly, any statistics (like medians or quantiles) for imputation and transformation are derived *only* from the training data to prevent data leakage.

### 2.1. Handling Outliers

#### 'age' Anomaly

In [3]:
# Calculate median age from training data
median_age = df_train['age'].median()
df_train['age'] = df_train['age'].replace(0, median_age)
print(f"Replaced age=0 with median age (from training data): {median_age}")

Replaced age=0 with median age (from training data): 52.0


#### Delinquency Outliers (96 and 98)

In [4]:
delinquency_cols = ['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate']

for col in delinquency_cols:
    median_val = df_train[col].median() # Calculate median from training data
    df_train[col] = df_train[col].replace({96: median_val, 98: median_val})
    print(f"Replaced 96/98 in '{col}' with median (from training data): {median_val}")

Replaced 96/98 in 'NumberOfTime30-59DaysPastDueNotWorse' with median (from training data): 0.0
Replaced 96/98 in 'NumberOfTime60-89DaysPastDueNotWorse' with median (from training data): 0.0
Replaced 96/98 in 'NumberOfTimes90DaysLate' with median (from training data): 0.0


### 2.2. Missing Value Imputation

#### 'MonthlyIncome'

In [5]:
# Calculate median income from training data
monthly_income_median = df_train['MonthlyIncome'].median()
df_train['MonthlyIncome'].fillna(monthly_income_median, inplace=True)
print(f"Filled missing 'MonthlyIncome' values with median (from training data): ${monthly_income_median:,.2f}")

Filled missing 'MonthlyIncome' values with median (from training data): $5,400.00


#### 'NumberOfDependents'

In [6]:
df_train['NumberOfDependents'].fillna(0, inplace=True)
print("Filled missing 'NumberOfDependents' values with 0.")

Filled missing 'NumberOfDependents' values with 0.


### 2.3. Feature Transformation (Log Transformation)

In [7]:
# Calculate 95th percentile for DebtRatio from training data
debt_ratio_95th_percentile = df_train['DebtRatio'].quantile(0.95)
df_train['DebtRatio'] = np.log1p(df_train['DebtRatio'].clip(0, debt_ratio_95th_percentile))
df_train['RevolvingUtilizationOfUnsecuredLines'] = np.log1p(df_train['RevolvingUtilizationOfUnsecuredLines'])

print("Applied log transformation to 'DebtRatio' and 'RevolvingUtilizationOfUnsecuredLines' (using training data stats).")

Applied log transformation to 'DebtRatio' and 'RevolvingUtilizationOfUnsecuredLines' (using training data stats).


## 3. XGBoost Model Training and Hyperparameter Tuning

This section defines the XGBoost model, sets up a cross-validation strategy, and performs hyperparameter tuning using `GridSearchCV`. The target variable is highly imbalanced, so `StratifiedKFold` and `scale_pos_weight` are used to address this.

In [8]:
# Separate features (X) and target (y) for training
X_train = df_train.drop('SeriousDlqin2yrs', axis=1)
y_train = df_train['SeriousDlqin2yrs']

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"Training set target distribution:\n{y_train.value_counts(normalize=True) * 100}")

# 3.1. Cross-Validation Strategy
# StratifiedKFold is essential due to class imbalance.
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 3.2. Define the XGBoost model and initial parameter grid
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc', random_state=42)

param_grid = {
    'n_estimators': [100, 200], 
    'learning_rate': [0.05, 0.1], 
    'max_depth': [3, 5], 
    'subsample': [0.7, 0.9], 
    'colsample_bytree': [0.7, 0.9], 
    'gamma': [0, 0.1], 
}

# Handle class imbalance using scale_pos_weight (calculated from training data)
scale_pos_weight_value = (y_train == 0).sum() / (y_train == 1).sum()
print(f"Calculated scale_pos_weight: {scale_pos_weight_value:.2f}")
xgb_model.set_params(scale_pos_weight=scale_pos_weight_value)

# 3.3. Hyperparameter Tuning using GridSearchCV
print("\nStarting GridSearchCV for XGBoost...")
grid_search = GridSearchCV(estimator=xgb_model,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=cv,
                           verbose=2,
                           n_jobs=-1)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters found by GridSearchCV: {grid_search.best_params_}")
print(f"Best ROC AUC score found by GridSearchCV: {grid_search.best_score_:.4f}")

X_train shape: (150000, 10), y_train shape: (150000,)
Training set target distribution:
SeriousDlqin2yrs
0    93.316
1     6.684
Name: proportion, dtype: float64
Calculated scale_pos_weight: 13.96

Starting GridSearchCV for XGBoost...
Fitting 5 folds for each of 64 candidates, totalling 320 fits

Best parameters found by GridSearchCV: {'colsample_bytree': 0.7, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.9}
Best ROC AUC score found by GridSearchCV: 0.8652


## 4. Final Model Training and Saving

In [9]:
# Train the final XGBoost model using the best parameters found by GridSearchCV
best_xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    random_state=42,
    scale_pos_weight=scale_pos_weight_value,
    **grid_search.best_params_ # Unpack the best parameters
)

print("\nTraining final XGBoost model on the full training dataset with best parameters...")
best_xgb_model.fit(X_train, y_train)
print("Final model training complete.")

# Save the trained model for future use (e.g., deployment or prediction on new data)
joblib.dump(best_xgb_model, 'best_credit_risk_xgb_model.pkl')
print("\nModel saved as 'best_credit_risk_xgb_model.pkl'")


Training final XGBoost model on the full training dataset with best parameters...
Final model training complete.

Model saved as 'best_credit_risk_xgb_model.pkl'


## 5. Preprocessing and Prediction for Test Data (`cs-test.csv`)

This section loads the separate `cs-test.csv` file, applies *the exact same preprocessing steps* as the training data using the *statistics derived from the training data*, and then uses the trained model to generate predictions.

In [11]:
# Load the raw test data
test_data_file = "cs-test.csv"
df_test_raw = pd.read_csv(os.path.join(data_path, test_data_file), index_col=0)

print(f"\nRaw test data loaded successfully with shape: {df_test_raw.shape}")
print("First 5 rows of raw test dataset:")
display(df_test_raw.head())

# Create a copy to preprocess, keeping the raw for reference if needed
df_test_processed = df_test_raw.copy()

# --- Apply SAME Preprocessing Steps to df_test_processed --- 
# IMPORTANT: Use statistics (medians, quantiles) calculated ONLY from the TRAINING DATA (df_train)

print("\nApplying preprocessing to test data...")

# 5.1. Handling 'age' Anomaly - use median_age calculated from df_train
df_test_processed['age'] = df_test_processed['age'].replace(0, median_age)

# 5.2. Handling Delinquency Outliers - use medians calculated from df_train for each column
for col in delinquency_cols:
    df_test_processed[col] = df_test_processed[col].replace({96: df_train[col].median(), 98: df_train[col].median()})

# 5.3. Impute 'MonthlyIncome' - use monthly_income_median from df_train
df_test_processed['MonthlyIncome'].fillna(monthly_income_median, inplace=True)

# 5.4. Impute 'NumberOfDependents' - use 0 (constant)
df_test_processed['NumberOfDependents'].fillna(0, inplace=True)

# 5.5. Feature Transformation (Log Transformation) - use debt_ratio_95th_percentile from df_train
df_test_processed['DebtRatio'] = np.log1p(df_test_processed['DebtRatio'].clip(0, debt_ratio_95th_percentile))
df_test_processed['RevolvingUtilizationOfUnsecuredLines'] = np.log1p(df_test_processed['RevolvingUtilizationOfUnsecuredLines'])

print("Test data preprocessing complete. Head of preprocessed test data:")
display(df_test_processed.head())
print(f"Missing values in preprocessed test data: {df_test_processed.isnull().sum().sum()}")



Raw test data loaded successfully with shape: (101503, 11)
First 5 rows of raw test dataset:


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,,0.885519,43,0,0.177513,5700.0,4,0,0,0,0.0
2,,0.463295,57,0,0.527237,9141.0,15,0,4,0,2.0
3,,0.043275,59,0,0.687648,5083.0,12,0,1,0,2.0
4,,0.280308,38,1,0.925961,3200.0,7,0,2,0,0.0
5,,1.0,27,0,0.019917,3865.0,4,0,0,0,1.0



Applying preprocessing to test data...
Test data preprocessing complete. Head of preprocessed test data:


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,,0.634203,43,0,0.163404,5700.0,4,0,0,0,0.0
2,,0.380691,57,0,0.42346,9141.0,15,0,4,0,2.0
3,,0.042365,59,0,0.523336,5083.0,12,0,1,0,2.0
4,,0.247101,38,1,0.655425,3200.0,7,0,2,0,0.0
5,,0.693147,27,0,0.019721,3865.0,4,0,0,0,1.0


Missing values in preprocessed test data: 101503


## 6. Generate Predictions and Create Submission File

In [14]:
# The 'SeriousDlqin2yrs' column in the actual Kaggle test data is all NaN, so we ensure it's not used as a feature.
# We are predicting probabilities for the positive class (SeriousDlqin2yrs = 1).
X_test_final = df_test_processed.drop('SeriousDlqin2yrs', axis=1, errors='ignore') # Use errors='ignore' in case it's already dropped or not present

y_test_pred_proba_final = best_xgb_model.predict_proba(X_test_final)[:, 1]

print("Sample of final predicted probabilities for the test set:")
print(y_test_pred_proba_final[:10])

# Create submission file in the format required by Kaggle
# The 'Id' column corresponds to the original index of the test DataFrame.
submission_df = pd.DataFrame({'Id': df_test_processed.index, 'Probability': y_test_pred_proba_final})

# Save the submission file to a CSV
submission_df.to_csv('submission.csv', index=False)

print("\nSubmission file 'submission.csv' created successfully.")
print("Head of submission file:")
display(submission_df.head())

Sample of final predicted probabilities for the test set:
[0.5122864  0.4350345  0.16844305 0.56528616 0.5939809  0.2927244
 0.29479548 0.28599513 0.02619811 0.64816475]

Submission file 'submission.csv' created successfully.
Head of submission file:


Unnamed: 0,Id,Probability
0,1,0.512286
1,2,0.435035
2,3,0.168443
3,4,0.565286
4,5,0.593981


[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.05, max_depth=3, n_estimators=100, subsample=0.7; total time=   0.6s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.05, max_depth=3, n_estimators=200, subsample=0.9; total time=   0.8s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.05, max_depth=5, n_estimators=100, subsample=0.9; total time=   0.6s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.05, max_depth=5, n_estimators=200, subsample=0.9; total time=   1.0s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.1, max_depth=3, n_estimators=100, subsample=0.9; total time=   0.4s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.1, max_depth=3, n_estimators=200, subsample=0.9; total time=   0.7s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=0.9; total time=   0.5s
[CV] END colsample_bytree=0.7, gamma=0, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.9; total time=   0.9s
[CV]