Load Data

In [1]:
#Loading data
# Imports
from pathlib import Path
import sys
import pandas as pd
import numpy as np

# Find ROOT folder
ROOT = Path.cwd().resolve().parents[0] if Path.cwd().name == "notebooks" else Path.cwd()

# Add ROOT to sys.path so we can import from src/
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

# Load config
from src.data_loader import load_config
cfg = load_config()

# Build path to processed data
processed_path = (ROOT / cfg["paths"]["processed_data_dir"] / cfg["paths"]["processed_filename"]).resolve()

# Load the processed dataset
df = pd.read_csv(processed_path)
print(df.shape)


(10000, 4)


What we’ll add:

Log transforms for money columns: money is right-skewed; log1p (i.e., log(1+x)) makes patterns easier for linear models and tames outliers.

Ratio: balance_to_salary captures relative liquidity vs. income (often more informative than each alone).

Interaction: employed * log_salary lets the model express that salary might matter differently for employed vs unemployed people.



In [2]:
# Feature engineering
df_fe = df.copy()

# 1) Log transforms to reduce skew and tame outliers
for col in ["bank_balance", "annual_salary"]:
    df_fe[f"log_{col}"] = np.log1p(df_fe[col])  # log(1 + x) safe for zeros

# 2) Safe ratio: bank balance / salary (guard against divide-by-zero)
#    - If salary is 0, define ratio as 0 (or np.nan then fill 0). Here we use 0.
den = df_fe["annual_salary"].replace({0: np.nan})
ratio = df_fe["bank_balance"] / den
df_fe["balance_to_salary"] = ratio.fillna(0)

# 3) Simple interaction: employment * log(salary)
df_fe["employed_x_log_salary"] = df_fe["employed"] * df_fe["log_annual_salary"]

# 4) Sanity check for any inf/NaN introduced
bad_any = ~np.isfinite(df_fe.select_dtypes(include=[np.number])).all().all()
print("Any non-finite values after FE? ->", bad_any)

df_fe.head()

Any non-finite values after FE? -> False


Unnamed: 0,employed,bank_balance,annual_salary,DEFAULT,log_bank_balance,log_annual_salary,balance_to_salary,employed_x_log_salary
0,1,8754.36,532339.56,0,9.077421,13.185039,0.016445,13.185039
1,0,9806.16,145273.56,0,9.190868,11.886381,0.067501,0.0
2,1,12882.6,381205.68,0,9.46371,12.851097,0.033794,12.851097
3,1,6351.0,428453.88,0,8.756525,12.967941,0.014823,12.967941
4,1,9427.92,461562.0,0,9.151537,13.042374,0.020426,13.042374


What we did:

log_* columns compress long tails so differences at low–mid values matter more (good for logistic).

balance_to_salary captures relative ability to cover expenses/debt; raw values can mislead.

employed_x_log_salary allows the model to express: “salary helps most when the person is employed.”

# Freeze the feature set we made

In [3]:
# Save engineered data
# Freeze the engineered dataset to disk so we can reuse it elsewhere
from pathlib import Path

# Use the same ROOT we defined above
processed_dir = ROOT / "data" / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)  # make sure the folder exists

engineered_path = processed_dir / "loan_default_engineered.csv"
df_fe.to_csv(engineered_path, index=False)

print("Saved engineered dataset to:", engineered_path)
print("Rows, Cols:", df_fe.shape)


Saved engineered dataset to: C:\Users\sauna\DS Projects\Financial_Analytics_LDP\data\processed\loan_default_engineered.csv
Rows, Cols: (10000, 8)


In [4]:
# Descriptive statistics now after new features
# Look at distributions and simple differences by DEFAULT
fe_cols = ["log_bank_balance", "log_annual_salary", "balance_to_salary", "employed_x_log_salary"]
display(df_fe[fe_cols + ["DEFAULT"]].describe().T)

group_means = df_fe.groupby("DEFAULT")[fe_cols].mean().T
display(group_means)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
log_bank_balance,10000.0,8.607409,2.115978,0.0,8.662469,9.198735,9.546574,10.368882
log_annual_salary,10000.0,12.810315,0.460599,9.13396,12.453271,12.935148,13.172474,13.690686
balance_to_salary,10000.0,0.032414,0.033757,0.0,0.013183,0.024301,0.04194,1.645025
employed_x_log_salary,10000.0,9.205848,5.95119,0.0,0.0,12.935148,13.172474,13.690686
DEFAULT,10000.0,0.0333,0.179428,0.0,0.0,0.0,0.0,1.0


DEFAULT,0,1
log_bank_balance,8.561854,9.929894
log_annual_salary,12.812071,12.759313
balance_to_salary,0.03117,0.068525
employed_x_log_salary,9.244619,8.080322


| feature                 | DEFAULT=0 (no default) | DEFAULT=1 (default) | What this suggests                                                                                                                         |
| ----------------------- | ---------------------: | ------------------: | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `log_bank_balance`      |                  8.562 |               9.930 | Defaulters have **higher** (log) balances on average. That’s unusual → likely skew/outliers or a particular customer slice. Worth probing. |
| `log_annual_salary`     |                 12.812 |              12.759 | Defaulters have **slightly lower** (log) salary. Small gap → weak but sensible signal.                                                     |
| `balance_to_salary`     |                  0.031 |               0.069 | Defaulters hold **more balance relative to salary**. This ratio looks **more discriminative** than raw numbers.                            |
| `employed_x_log_salary` |                  9.245 |               8.080 | Lower for defaulters → consistent with “less employed and/or lower salary” among defaulters.                                               |


What this means?

Means “by DEFAULT” compare group averages. If the DEFAULT=1 column is consistently higher/lower than DEFAULT=0, that feature carries signal.

The ratio and interaction are doing what we hoped: summarising economic position better than raw columns.

The “defaulters have higher bank balance” surprise isn’t wrong; it’s just telling you the distribution is skewed. A few high-balance defaulters can pull up the mean even if the typical (median) defaulter is lower. That’s why we log-transformed; still, means can be tricked by outliers.