Credit Risk Modeling & Early Default Prediction

Dataset: Home Credit Default Risk (Home Credit Group)
Objective: Understand the structure, quality, and risk characteristics of the dataset to support explainable credit risk modeling.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/application_train.csv")

df.head()

In [None]:
df.columns

In [None]:
df.shape

The dataset contains customer-level application data with a large number of financial and demographic variables, typical of real-world credit risk datasets

In [None]:
df['TARGET'].value_counts()

In [None]:
df['TARGET'].value_counts(normalize=True)

Target = 1 (customer defaulted)
Target = 0 (customer did not default)
The target variable is highly imbalanced, reflecting real-world credit portfolios, and indicates that accuracy alone is not an appropriate evaluation metric.

In [None]:
df.info()

In [None]:
missing_pct = df.isnull().mean().sort_values(ascending = False)
missing_pct.head(10)

Missing values in credit datasets often reflect data availability or customer behavior rather than random omission, and may themselves carry risk information

In [None]:
df[['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY']].describe()

In [None]:
import matplotlib.pyplot as plt

In [None]:
df['AMT_INCOME_TOTAL'].hist(bins=50)
plt.title("Applicant income distribution")
plt.show()

The applicant income distribution is highly right-skewed, with a small number of extreme outliers. This suggests that raw income values may not be directly suitable for modeling and motivates the use of ratio-based features, such as credit-to-income and annuity-to-income ratios, to better capture relative financial risk.

In [None]:
df[['AMT_CREDIT', 'AMT_ANNUITY']].describe()

Day 1 EDA Completed. 

Day 2 of preparing data for a regulated financial model.

In [None]:
target= 'TARGET'
id_col = 'SK_ID_CURR'

In [None]:
y=df[target]
x=df.drop(columns=[target, id_col])

I removed identifiers to prevent data leakage and ensure the model learns only from applicant characteristics.

In [None]:
missing = x.isnull().mean().sort_values(ascending=False)
missing.head(10)

Some variables missing 50-70%

Decision rules
>60% Will drop the feature, low infomation value
5%- 60% median impute + flag, missingness may be informative
<5% median impute only, likely missing at random

In [None]:
num_cols = x.select_dtypes(include= ['int64', 'float']).columns

In [None]:
for col in num_cols:
    missing_rate = x[col].isnull().mean()
    
    if missing_rate > 0.6:
        x.drop(columns=[col], inplace=True)
    elif missing_rate > 0.05:
        x[col + '_missing_flag'] = x[col].isnull().astype(int)
        # Use assignment instead of inplace=True
        x[col] = x[col].fillna(x[col].median())
    else:
        # Use assignment instead of inplace=True
        x[col] = x[col].fillna(x[col].median())

In [None]:
cat_cols = x.select_dtypes(include=['object']).columns
len(cat_cols)

In [None]:
x= pd.get_dummies(x, columns=cat_cols, drop_first =True)

One-hot encoding was applied to categorical variables to preserve interpretability and avoid imposing ordinal assumptions

In [None]:
x['credit_income_ratio'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
x['annuity_income_ratio'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
x['credit_term'] = df['AMT_CREDIT'] / df['AMT_CREDIT'] / df['AMT_ANNUITY']

In [None]:
x['employment_years'] = (-df['DAYS_EMPLOYED']) / 365
x['employment_bucket'] = pd.cut(
    x['employment_years'],
    bins=[0, 1, 3, 5, 10, 50],
    labels=['<1yr', '1-3yrs', '3-5yrs', '5-10yrs', '10+yrs']
)
x = pd.get_dummies(x, columns=['employment_bucket'], drop_first=True)

In [None]:
x.isnull().sum().sum()

In [None]:
x.shape

In [None]:
x.isnull().sum().sort_values(ascending=False).head(10)

In [None]:
import numpy as np

In [None]:
#replacing infinite values with NaN
x.replace([np.inf, -np.inf], np.nan, inplace = True)

In [None]:
#remaining NaNs will be imputed by median
for col in x.columns:
    if x[col].isnull().sum() > 0:
        x[col].fillna(x[col].median(), inplace = True)

In [None]:
x.isnull().sum().sum()

Ratio-based features introduced a small number of missing values due to division by zero and special employment codes. These were handled by replacing infinite values and applying median imputation to preserve distributional integrity

What drives Risk

In [None]:
df['TARGET'].value_counts(normalize=True).plot(kind='bar')
plt.title("Default vs Non-default distribution")
plt.ylabel("Proportion")
plt.show()

Income Vs Default Risk (Boxplot)

In [None]:
import seaborn as sns

In [None]:
sns.boxplot(x= 'TARGET', y= 'AMT_INCOME_TOTAL', data = df)
plt.title("Income distribution by default status")
plt.yscale('log')
plt.show()

In [None]:
df['CREDIT_INCOME_RATION'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']
sns.boxplot(x= 'TARGET', y= 'CREDIT_INCOME_RATION', data=df)
plt.title("Credit-to-Income Ratio by Default Status")
plt.ylim(0, 10)
plt.show()

In [None]:
sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=df[df['DAYS_EMPLOYED'] < 0])
plt.title("Employment Length by Default Status")
plt.show()

I engineered financial ratios directly in the modeling dataset to ensure consistency between EDA, model training, and business interpretation

In [None]:
sns.boxplot(x='TARGET', y='DAYS_EMPLOYED', data=df[df['DAYS_EMPLOYED'] < 0])
plt.title("Employment Length by Default Status")
plt.show()

Shorter employment tenure is associated with elevated default risk

In [None]:
sns.lineplot(x='TARGET', y='DAYS_EMPLOYED', data=df[df['DAYS_EMPLOYED'] < 0], marker='o')
plt.title("Mean Employment Length by Default Status")
plt.xticks([0, 1]) # Ensures only 0 and 1 show on the axis
plt.show()

In [None]:
# Using stripplot (the standard way to do a scatter plot with categorical data)
sns.stripplot(x='TARGET', y='DAYS_EMPLOYED', data=df[df['DAYS_EMPLOYED'] < 0], alpha=0.3, jitter=True)
plt.title("Employment Length vs Default Status (Scatter/Strip Plot)")
plt.show()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state = 42, stratify = y)

Used Stratify to ensure rate is preserved 

Feature Scaling 

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Why scaling, because Logistic regression assumes features are on similar scale

TRAINING THE MODEL

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_model = LogisticRegression(max_iter = 1000, class_weight ='balanced', solver = 'lbfgs')
log_model.fit(x_train_scaled, y_train)

Why class_weight='balanced'?

Defaults are rare → avoids bias toward non-defaults.

In [None]:
y_pred = log_model.predict(x_test_scaled)
y_prob = log_model.predict_proba(x_test_scaled)[:, 1] #Prob of default (pd)

In [None]:
#Model Evaluation ROC - AUC
from sklearn.metrics import roc_auc_score

In [None]:
roc_auc = roc_auc_score(y_test, y_prob)
roc_auc

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot() #False negatives are riskier than false positives in lending.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
threshold = 0.3
y_custom_pred = (y_prob >= threshold).astype(int)

In [None]:
coef_df = pd.DataFrame({
    'Feature': x.columns,
    'Coefficient': log_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

coef_df.head(10)

In [None]:
#Logistic Regression was selected as the baseline credit risk model due to its interpretability and regulatory suitability. 
#Model coefficients provide transparent insight into how borrower characteristics influence default risk, making the model appropriate for governance, validation, and policy decision-making.

In [None]:
results_df = x_test.copy()
results_df['actual_default'] = y_test.values
results_df['predicted_pd'] = y_prob
results_df['risk_segment'] = pd.cut(
    y_prob,
    bins=[0, 0.2, 0.5, 1],
    labels=['Low', 'Medium', 'High']
)

results_df.to_csv("logistic_regression_predictions.csv", index=False)

In [None]:
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr, label='Logistic Regression')
plt.plot([0,1], [0,1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

In [None]:
ks_df = pd.DataFrame({
    'y_true': y_test,
    'y_prob': y_prob
}).sort_values('y_prob')

ks_df['cum_good'] = (ks_df['y_true'] == 0).cumsum() / (ks_df['y_true'] == 0).sum()
ks_df['cum_bad'] = (ks_df['y_true'] == 1).cumsum() / (ks_df['y_true'] == 1).sum()

ks_stat = max(abs(ks_df['cum_good'] - ks_df['cum_bad']))
ks_stat

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
train_prob = log_model.predict_proba(x_train_scaled)[:,1]

train_auc = roc_auc_score(y_train, train_prob)
test_auc = roc_auc_score(y_test, y_prob)

train_auc, test_auc

In [None]:
coef_df.sort_values('Coefficient', ascending = False)

In [None]:
results_df.groupby('risk_segment')['actual_default'].mean()

The logistic regression model was validated using ROC-AUC, KS statistic, and stability checks between training and test datasets. 
Performance consistency and intuitive coefficient behavior indicate a stable and interpretable model suitable for credit risk decision support.
Risk segmentation further confirms the model’s ability to distinguish borrower risk levels.

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_model = XGBClassifier(
    n_estimators = 200,
    max_depth = 4,
    learning_rate = 0.05,
    subsample = 0.8,
    colsample_bytree = 0.8,
    eval_metric = 'auc',
    random_state = 42
)
xgb_model.fit(x_train, y_train)

In [None]:
xgb_prob= xgb_model.predict_proba(x_test)[:,1]

In [None]:
xgb_auc = roc_auc_score(y_test, xgb_prob)
log_auc = roc_auc_score(y_test, y_prob)


In [None]:
log_auc, xgb_auc

In [None]:
ks_df_xgb = pd.DataFrame({
    'y_true': y_test,
    'y_prob': xgb_prob
}).sort_values('y_prob')

ks_df_xgb['cum_good'] = (ks_df_xgb['y_true'] == 0).cumsum() / (ks_df_xgb['y_true'] == 0).sum()
ks_df_xgb['cum_bad'] = (ks_df_xgb['y_true'] == 1).cumsum() / (ks_df_xgb['y_true'] == 1).sum()

ks_xgb = max(abs(ks_df_xgb['cum_good'] - ks_df_xgb['cum_bad']))
ks_xgb

In [None]:
importances = pd.Series(xgb_model.feature_importances_, index=x.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title("XGBoost Feature Importance")
plt.show()

XGBoost captures nonlinear relationships but lacks coefficient-level transparency.

While XGBoost demonstrated improved predictive performance, Logistic Regression was retained as the primary model due to its interpretability, stability, and regulatory suitability. 
XGBoost is positioned as a challenger model for monitoring and performance benchmarking

In [None]:
results_df.columns

In [None]:
'SK_ID_CURR' in df.columns

In [None]:
results_df = df.loc[x_test.index, ['SK_ID_CURR']].copy()

results_df['actual_default'] = y_test.values
results_df['predicted_pd'] = y_prob

results_df['risk_segment'] = pd.cut(
    y_prob,
    bins=[0, 0.2, 0.5, 1],
    labels=['Low', 'Medium', 'High']
)

In [None]:
results_df.to_csv("credit_risk_predictions.csv", index = False)

In [None]:
results_df.head()

In [None]:
key_features= ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'CREDIT_INCOME_RATION', 'DAYS_EMPLOYED']

In [None]:
df.columns

In [None]:
results_df = results_df.merge(df.loc[x_test.index, key_features], left_index = True, right_index = True)