# Credit Score model - EDA

**Objective**: In this notebook we make the EDA analisys of the database labeled in 1.data_labeling.ipynb

Here we have:

1. Data loading and Initial Exploration.
2. First needed feature engineering based on the purpose of the modeling.
3. Second feature engeneering based on business criteria
4. EDA over remaining variables.
5. Feature selection.
6. Final databases storing.


# 1. Data loading and Initial Exploration.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path

# Auxiliary functions made for this project
from functions import (
    DataProfile,
    FeatureEngineering,
    StabilityMetrics,
    WOEAnalysis
)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
sns.set_palette("husl")

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

import warnings
warnings.filterwarnings("ignore")

In [2]:
data_dir = Path("data/processed")

train = pd.read_parquet(data_dir / "train.parquet", engine="fastparquet")
valid = pd.read_parquet(data_dir / "valid.parquet", engine="fastparquet")
test = pd.read_parquet(data_dir / "test.parquet", engine="fastparquet")

print("Dataset Shapes:")
print(f"Train: {train.shape}")
print(f"Valid: {valid.shape}")
print(f"Test:  {test.shape}")

print("\nDate Ranges:")
print(f"Train: {train['issue_d'].min()} → {train['issue_d'].max()}")
print(f"Valid: {valid['issue_d'].min()} → {valid['issue_d'].max()}")
print(f"Test:  {test['issue_d'].min()} → {test['issue_d'].max()}")

Dataset Shapes:
Train: (845679, 143)
Valid: (225005, 143)
Test:  (237923, 143)

Date Ranges:
Train: 2016-01-01 00:00:00 → 2017-12-01 00:00:00
Valid: 2018-01-01 00:00:00 → 2018-06-01 00:00:00
Test:  2018-07-01 00:00:00 → 2018-12-01 00:00:00


# 2. First needed feature engineering based on the purpose of the modeling.

Given that we want to buld an origination score, variables related with the post-origination must be discarted, based on descriptions we have:

- loan_status
- total_pymnt
- total_pymnt_inv
- total_rec_prncp
- total_rec_int
- total_rec_late_fee
- out_prncp
- out_prncp_inv
- recoveries
- collection_recovery_fee
- last_pymnt_d
- last_pymnt_amnt
- next_pymnt_d
- last_credit_pull_d
- last_fico_range_low
- last_fico_range_high
- pymnt_plan
- deferral_term
- orig_projected_additional_accrued_interest

Aditionally we have prefixes that indicate the feature is a post-origination feature.

Perhaps, features related to the pricing must to be droped to avoid redundancy in modeling. Based on descriptions we have:

- int_rate
- grade
- sub_grade
- initial_list_status
- funded_amnt
- funded_amnt_inv


In [3]:
fe = FeatureEngineering()

PRICING_ARTIFACTS = [
    "int_rate",
    "grade",
    "sub_grade",
    "initial_list_status",
    "funded_amnt",
    "funded_amnt_inv",
]

POST_ORIG_BASE = [
    "loan_status",
    "total_pymnt",
    "total_pymnt_inv",
    "total_rec_prncp",
    "total_rec_int",
    "total_rec_late_fee",
    "out_prncp",
    "out_prncp_inv",
    "recoveries",
    "collection_recovery_fee",
    "last_pymnt_d",
    "last_pymnt_amnt",
    "next_pymnt_d",
    "last_credit_pull_d",
    "last_fico_range_low",
    "last_fico_range_high",
    "pymnt_plan",
    "deferral_term",
    "orig_projected_additional_accrued_interest",
]

FAMILY_PREFIXES = ("hardship_", "settlement_", "debt_settlement_flag")
family_cols = [c for c in train.columns if c.startswith(FAMILY_PREFIXES)]

drop_cols = set(PRICING_ARTIFACTS) | set(POST_ORIG_BASE) | set(family_cols)

train_filtered = fe.make_pre_offer_features(
    train, keep_term=False, keep_target=True, drop_cols=drop_cols
)
valid_filtered = fe.make_pre_offer_features(
    valid, keep_term=False, keep_target=True, drop_cols=drop_cols
)
test_filtered = fe.make_pre_offer_features(
    test, keep_term=False, keep_target=True, drop_cols=drop_cols
)

print(f"Kept {len(train_filtered.columns)} features + target")

Kept 101 features + target


# 3. Second feature engeneering based on business criteria

In [4]:
profiler = DataProfile()
profile = profiler.quick_profile(train_filtered, exclude=["target"])
print("\nData Quality Summary:")
display(profile.head(55))


Data Quality Summary:


Unnamed: 0,feature,dtype,missing_pct,n_unique,min,median,max,top_category,top_pct
0,sec_app_revol_util,float64,95.73,1146,0.0,62.4,182.5,,
1,revol_bal_joint,float64,95.66,27895,0.0,25323.0,357135.0,,
2,sec_app_chargeoff_within_12_mths,float64,95.66,19,0.0,0.0,21.0,,
3,sec_app_collections_12_mths_ex_med,float64,95.66,14,0.0,0.0,16.0,,
4,sec_app_earliest_cr_line,object,95.66,598,,,,Aug-2006,0.91
5,sec_app_fico_range_high,float64,95.66,61,544.0,669.0,850.0,,
6,sec_app_fico_range_low,float64,95.66,61,540.0,665.0,845.0,,
7,sec_app_inq_last_6mths,float64,95.66,7,0.0,0.0,6.0,,
8,sec_app_mort_acc,float64,95.66,18,0.0,1.0,18.0,,
9,sec_app_num_rev_accts,float64,95.66,74,0.0,11.0,96.0,,


`application_type` feature shows that 97% of loans are `individual`. Given the low share of joint loans, features related to the second applicant are going to be droped. Those that start with `sec_app_` and those that end with `joint`

In [5]:
sec_app_cols = [c for c in train_filtered.columns if c.startswith("sec_app_")]

JOINT_FEATURES = {
    "annual_inc_joint",
    "dti_joint",
    "verification_status_joint",
    "revol_bal_joint",
    "application_type",
}

UTILITY_TEXT = ["url", "policy_code", "member_id", "id", "Unnamed: 0", "emp_title", "title"]

drop_cols = set(UTILITY_TEXT) | set(sec_app_cols) | JOINT_FEATURES

train_filtered = fe.make_pre_offer_features(
    train_filtered, keep_term=False, keep_target=True, drop_cols=drop_cols
)
valid_filtered = fe.make_pre_offer_features(
    valid_filtered, keep_term=False, keep_target=True, drop_cols=drop_cols
)
test_filtered = fe.make_pre_offer_features(
    test_filtered, keep_term=False, keep_target=True, drop_cols=drop_cols
)

print(f"Kept {len(train_filtered.columns)} features + target")

Kept 79 features + target


#### 3.1. Feature Engineering over `emp_length`, `revol_util` and `fico_range` features

Some transformations are made over those variables to analize them better, this is:

1. emp_length: Transform it into a monotonic variable.
2. revol_util: is a string because percentages have `%` at the end. Transform it into a numerical variable.
3. fico_range: Median of fico_range variables (upper and lower) is taken to construct a new feature.

In [6]:
train_eng = fe.engineer_features(train_filtered)
valid_eng = fe.engineer_features(valid_filtered)
test_eng = fe.engineer_features(test_filtered)

train_eng.drop(columns=["emp_length", "revol_util", "fico_range_low", "fico_range_high"], inplace=True)
valid_eng.drop(columns=["emp_length", "revol_util", "fico_range_low", "fico_range_high"], inplace=True)
test_eng.drop(columns=["emp_length", "revol_util", "fico_range_low", "fico_range_high"], inplace=True)

### 3.2. some imputations based on bussiness criteria

NanN values in buro variables like `open_acc_6m` could mean that there is no info for the variable, wich make sence. Next we impute some NaN values with common used values for this cases:

- 999: for variables that starts with `mths_since_` and these 2 features `num_tl_120dpd_2m`  and `mo_sin_old_il_acct`
- 0: for counts

In [7]:
count_cols = [
    "il_util",
    "all_util",
    "bc_util",
    "percent_bc_gt_75",
    "bc_open_to_buy",
    "open_acc_6m",
    "open_il_12m",
    "open_il_24m",
    "open_rv_12m",
    "open_rv_24m",
    "inq_fi",
    "inq_last_12m",
    "total_bal_il",
    "total_cu_tl",
    "max_bal_bc",
    "open_act_il",
]

no_rec_cols = ["num_tl_120dpd_2m", "mo_sin_old_il_acct"]

train_clean = fe.handle_missing_values(
    train_eng, prefix="mths_since_", count_cols=count_cols, no_rec_cols=no_rec_cols
)
valid_clean = fe.handle_missing_values(
    valid_eng, prefix="mths_since_", count_cols=count_cols, no_rec_cols=no_rec_cols
)
test_clean = fe.handle_missing_values(
    test_eng, prefix="mths_since_", count_cols=count_cols, no_rec_cols=no_rec_cols
)

### 3.3 Default variable mapping

In [8]:
target_mapping = {"non_default": 0, "default": 1}

train_clean["target"] = train_clean["target"].map(target_mapping).astype("int8")
valid_clean["target"] = valid_clean["target"].map(target_mapping).astype("int8")
test_clean["target"] = test_clean["target"].map(target_mapping).astype("int8")

print("\nTarget Distribution:")
print(test_clean["target"].value_counts())
print(f"\nDefault Rate: {test_clean['target'].mean():.2%}")


Target Distribution:
target
0    217809
1     20114
Name: count, dtype: int64

Default Rate: 8.45%


# 4. Exploratory Data Analysis (EDA)

Quick view of variables profile.

In [9]:
train_final = train_clean.copy()
valid_final = valid_clean.copy()
test_final = test_clean.copy()

profile = profiler.quick_profile(train_final, exclude=["target"])
print("\nData Quality Summary:")
display(profile.head(20))


Data Quality Summary:


Unnamed: 0,feature,dtype,missing_pct,n_unique,min,median,max,top_category,top_pct
0,emp_length_yrs,float32,6.82,11,0.5,6.0,10.0,,
1,revol_util_pct,float32,0.08,1216,0.0,48.799999,173.2,,
2,dti,float64,0.06,8155,-1.0,18.17,999.0,,
3,acc_now_delinq,float64,0.0,7,0.0,0.0,7.0,,
4,acc_open_past_24mths,float64,0.0,52,0.0,4.0,61.0,,
5,addr_state,object,0.0,50,,,,CA,13.31
6,all_util,float64,0.0,183,0.0,60.0,211.0,,
7,annual_inc,float64,0.0,45349,0.0,66000.0,110000000.0,,
8,avg_cur_bal,float64,0.0,69794,0.0,7366.0,752994.0,,
9,bc_open_to_buy,float64,0.0,69839,0.0,5588.0,711140.0,,


## 4.1. Creation of useful ratios

Ratios are very useful for scorecad modelling as they measure variables relative to others. Feature related to income are very informative. So we create the respective ratios "to_income" and "to_limit", and drop the original variables.

In [10]:
for df in (train_final, valid_final, test_final):
    df["loan_to_income"] = df["loan_amnt"] / df["annual_inc"]
    df["install_to_income"] = df["installment"] / (df["annual_inc"] / 12)
    df["util_to_limit"] = df["tot_cur_bal"] / df["tot_hi_cred_lim"]
    df["balance_to_income"] = df["tot_cur_bal"] / df["annual_inc"]
    df.drop(
        columns=["loan_amnt", "installment", "tot_cur_bal", "annual_inc", "tot_hi_cred_lim"],
        inplace=True
    )

### Binning of numerical important varioables

From a off iteration it was discovered that binning some numerical variables and some ratios could help to improve model stability and performance of the model. Here we bin some of the most important numerical variables based on business criteria and based on our offline analysis of WOE discrimination per bin.

In [11]:
num_bins = {
    "loan_to_income": [0, 0.04, 0.08, 0.12, 0.16, 0.22, 0.30, np.inf],
    "inq_last_6mths": [0, 1, 2, np.inf],
    "avg_cur_bal": [0, 11000, 15000, 25000, 30000, np.inf],
    "install_to_income": [-0.00001, 0.02, 0.04, 0.07, 0.1, np.inf],
    "util_to_limit": [-0.00001, 0.5, 1, np.inf],
    "balance_to_income": [-0.00001, 0.4, np.inf],
    "delinq_2yrs": [-0.001, 1, np.inf],
    "inq_last_12m": [-0.001, 1, 4, np.inf],
    "mths_since_last_delinq": [-0.001, 900],
    "revol_util_pct": [-0.00001, 30, 60, 90, np.inf],
}

dfs = [train_final, valid_final, test_final]

for col, edges in num_bins.items():
    for df in dfs:
        if col in df.columns:
            x = pd.to_numeric(df[col], errors="coerce")
            b = pd.cut(x, bins=np.array(edges, dtype=float), include_lowest=True)
            lab = b.astype(str).where(b.notna(), "no_info")
            lab = lab.str.replace(r"[^A-Za-z0-9]+", "_", regex=True).str.strip("_")
            df[f"{col}_bin"] = pd.Categorical(lab)

for df in dfs:
    df.drop(columns=[c for c in num_bins if c in df.columns], inplace=True)

## 4.2 WOE segmentation analisys
Here an example of how variables discriminate by bin based on woe analisys:

In [12]:
woe_analyzer = WOEAnalysis()

iv_overview = woe_analyzer.show_woe_for_columns(
    train_final,
    target="target",
    num_cols=[f"{name}_bin" for name in num_bins.keys()],
    show_tables=True,
)

loan_to_income_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_001_0_04,28182,25416,2766,9.8%,0.645405,0.011138
1,0_04_0_08,81336,71718,9618,11.8%,0.436534,0.015812
2,0_08_0_12,109191,94917,14274,13.1%,0.321991,0.012014
3,0_12_0_16,118270,101013,17257,14.6%,0.19446,0.004957
4,0_16_0_22,163112,136279,26833,16.5%,0.0525,0.000522
5,0_22_0_3,162874,131258,31616,19.4%,-0.14907,0.004491
6,0_3_inf,182714,139748,42966,23.5%,-0.39314,0.037782


inq_last_6mths_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_001_1_0,747844,626104,121740,16.3%,0.065057,0.003663
1,1_0_2_0,68850,53203,15647,22.7%,-0.348736,0.011055
2,2_0_inf,28984,21041,7943,27.4%,-0.59839,0.014727
3,no_info,1,1,0,0.0%,0.356176,0.0


avg_cur_bal_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_001_11000_0,505251,407583,97668,19.3%,-0.143901,0.012961
1,11000_0_15000_0,72043,60539,11504,16.0%,0.088021,0.000641
2,15000_0_25000_0,128524,109635,18889,14.7%,0.186005,0.004942
3,25000_0_30000_0,39461,34133,5328,13.5%,0.284717,0.003438
4,30000_0_inf,100383,88444,11939,11.9%,0.429987,0.018977
5,no_info,17,15,2,11.8%,0.442331,3e-06


install_to_income_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_00101_0_02,46125,41620,4505,9.8%,0.650821,0.018502
1,0_02_0_04,136244,119751,16493,12.1%,0.409907,0.02357
2,0_04_0_07,252864,214941,37923,15.0%,0.162234,0.007457
3,0_07_0_1,188993,154670,34323,18.2%,-0.067094,0.001028
4,0_1_inf,221453,169367,52086,23.5%,-0.3934,0.045857


util_to_limit_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_00101_0_5,143109,120657,22452,15.7%,0.109,0.001939
1,0_5_1_0,685733,566003,119730,17.5%,-0.019212,0.000301
2,1_0_inf,16820,13674,3146,18.7%,-0.103207,0.000219
3,no_info,17,15,2,11.8%,0.442331,3e-06


balance_to_income_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_00101_0_4,155240,128354,26886,17.3%,-0.009385,1.6e-05
1,0_4_inf,690439,571995,118444,17.2%,0.002118,4e-06


delinq_2yrs_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_002_1_0,786162,652015,134147,17.1%,0.00856,6.8e-05
1,1_0_inf,59517,48334,11183,18.8%,-0.108831,0.000864


inq_last_12m_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_002_1_0,437231,373426,63805,14.6%,0.194317,0.018298
1,1_0_4_0,302600,245219,57381,19.0%,-0.120133,0.005369
2,4_0_inf,105848,81704,24144,22.8%,-0.353505,0.017488


mths_since_last_delinq_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_002_900_0,434849,357688,77161,17.7%,-0.038805,0.000784
1,no_info,410830,342661,68169,16.6%,0.04218,0.000852


revol_util_pct_bin


Unnamed: 0,Agrupacion,Total,G,B,%_bad_rate,WOE,IV_bin
0,0_00101_30_0,204805,175168,29637,14.5%,0.20415,0.009429
1,30_0_60_0,348424,286875,61549,17.7%,-0.033359,0.000464
2,60_0_90_0,247823,202003,45820,18.5%,-0.08901,0.00239
3,90_0_inf,43956,35724,8232,18.7%,-0.104778,0.00059
4,no_info,671,579,92,13.7%,0.266942,5.2e-05


## 4.3 Information value (IV)

Next, is calculated the information value (IV) for all the features. It is well known that variables with an IV over 0.2 could have a high power of prediction in a scorecard model. Fore numerical variables, here are very important the bins we've already built, but for some others bins are constructed based on percentiles of distribution.


In [13]:
iv_overview = woe_analyzer.show_woe_for_columns(
    train_final,
    target="target",
    num_cols=train_final.columns.difference(["target"]).tolist(),
    cat_cols=[],
)
display(iv_overview.style.format({"iv_total": "{:.4f}"}))

Unnamed: 0,feature,iv_total,type
14,fico_mid,0.1167,numeric
20,install_to_income_bin,0.0964,categorical
21,loan_to_income_bin,0.0867,categorical
1,acc_open_past_24mths,0.0815,numeric
56,open_rv_24m,0.0684,numeric
74,verification_status,0.0631,categorical
49,num_tl_op_past_12m,0.0576,numeric
6,bc_open_to_buy,0.051,numeric
34,mths_since_recent_inq,0.0455,numeric
69,total_bc_limit,0.0433,numeric


## 4.4 Population Stability Index

To improve the stability of the model over time, it is better to choose variables that are stable over time in terms of PSI. Here we calculate the PSI per variable for train vs validation. PSI >0.25 could indicate that the feature is very unstable, so it can be droped if it's information value is not important.

In [14]:
stability = StabilityMetrics()

num_cols = [
    c for c in train_final.columns if c != "target" and pd.api.types.is_numeric_dtype(train_final[c])
]
cat_cols = [
    c for c in train_final.columns if c != "target" and not pd.api.types.is_numeric_dtype(train_final[c])
]

psi_results = []
for col in num_cols:
    psi_val = stability.calculate_psi_numeric(train_final[col], valid_final[col])
    psi_results.append({"feature": col, "psi": psi_val})

for col in cat_cols:
    psi_val = stability.calculate_psi_categorical(train_final[col], valid_final[col])
    psi_results.append({"feature": col, "psi": psi_val})

psi_df = pd.DataFrame(psi_results).sort_values("psi", ascending=False)

print("\nPopulation Stability (PSI - Train vs Valid):")
display(psi_df.head(15))


Population Stability (PSI - Train vs Valid):


Unnamed: 0,feature,psi
60,fico_mid,0.074319
26,bc_util,0.061995
53,percent_bc_gt_75,0.058687
20,all_util,0.050462
75,revol_util_pct_bin,0.050102
25,bc_open_to_buy,0.047286
1,mths_since_last_record,0.041057
3,pub_rec,0.033999
63,purpose,0.028618
37,mths_since_recent_revol_delinq,0.023051


## 5. Feature Selection

Next we drop features with `IV< 0.02`, `PSI>0.25`and we add an additional filter for strong correlated features. To reduce noise, features with `correlation>0.85` are analized and one of them is dropped. 

IV_MIN = 0.02
PSI_MAX = 0.25
CORR_THRESHOLD = 0.85

iv_pass = set(iv_overview.loc[iv_overview.iv_total >= IV_MIN, "feature"])
psi_pass = set(psi_df.loc[psi_df.psi <= PSI_MAX, "feature"])
base_features = iv_pass & psi_pass

engineered = {"fico_mid", "emp_length_yrs"}
base_features |= engineered

unstable_features = {
    "max_bal_bc",
    "all_util",
    "total_bal_il",
    "il_util",
    "open_act_il",
    "mths_since_last_record",
}
base_features -= unstable_features

print(f"Features after IV/PSI filtering: {len(base_features)}")

numeric_candidates = [
    c for c in base_features if c in train_final.columns and pd.api.types.is_numeric_dtype(train_final[c])
]

corr_matrix = train_final[numeric_candidates].corr(method="spearman").abs()
iv_lookup = iv_overview.set_index("feature").iv_total.to_dict()

selected = []
dropped = set()

for feat in sorted(numeric_candidates, key=lambda x: -iv_lookup.get(x, 0)):
    if feat in dropped:
        continue
    selected.append(feat)
    highly_corr = corr_matrix.index[
        (corr_matrix[feat] >= CORR_THRESHOLD) & (corr_matrix.index != feat)
    ].tolist()
    dropped.update(highly_corr)

final_features = sorted((set(selected) | (base_features - set(numeric_candidates))) - dropped)

FINAL_FEATURE_SET = final_features


print(f"Number of final features: {len(final_features)}")
print(f"\nSelected Features: {final_features}")

In [15]:
# Create lookup dictionaries
iv_lookup = iv_overview.set_index('feature')['iv_total'].to_dict()
psi_lookup = psi_df.set_index('feature')['psi'].to_dict()

# Simple tiered selection function
def passes_selection(feature, iv_lookup, psi_lookup):
    """
    Returns True if feature should be selected
    """
    iv = iv_lookup.get(feature, 0)
    psi = psi_lookup.get(feature, 999)
    
    # Tiered PSI thresholds based on IV
    if iv >= 0.10:
        return psi <= 0.40  # Strong predictors - relaxed PSI
    elif iv >= 0.05:
        return psi <= 0.30  # Medium predictors
    elif iv >= 0.02:
        return psi <= 0.25  # Weak predictors - your original threshold
    else:
        return False  # Too weak, don't use

# Select features
base_features = set()
for feature in iv_overview['feature']:
    if passes_selection(feature, iv_lookup, psi_lookup):
        base_features.add(feature)

# Always keep these critical features
base_features |= {"fico_mid", "emp_length_yrs"}

# Only drop features with extreme PSI (data quality issues)
extreme_psi_features = set(psi_df[psi_df['psi'] > 0.40]['feature'])
base_features -= extreme_psi_features

print(f"Features after IV/PSI filtering: {len(base_features)}")

# ============================================================================
# CORRELATION FILTERING (keep your existing logic)
# ============================================================================

numeric_candidates = [
    c for c in base_features 
    if c in train_final.columns and pd.api.types.is_numeric_dtype(train_final[c])
]

corr_matrix = train_final[numeric_candidates].corr(method="spearman").abs()

selected = []
dropped = set()

for feat in sorted(numeric_candidates, key=lambda x: -iv_lookup.get(x, 0)):
    if feat in dropped:
        continue
    selected.append(feat)
    highly_corr = corr_matrix.index[
        (corr_matrix[feat] >= 0.85) & (corr_matrix.index != feat)
    ].tolist()
    dropped.update(highly_corr)

final_features = sorted((set(selected) | (base_features - set(numeric_candidates))) - dropped)

FINAL_FEATURE_SET = final_features


Features after IV/PSI filtering: 32


In [16]:
FINAL_FEATURE_SET

['acc_open_past_24mths',
 'all_util',
 'avg_cur_bal_bin',
 'bc_open_to_buy',
 'dti',
 'emp_length_yrs',
 'fico_mid',
 'home_ownership',
 'il_util',
 'inq_fi',
 'inq_last_12m_bin',
 'inq_last_6mths_bin',
 'install_to_income_bin',
 'loan_to_income_bin',
 'max_bal_bc',
 'mo_sin_old_rev_tl_op',
 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_rcnt_tl',
 'mort_acc',
 'mths_since_rcnt_il',
 'mths_since_recent_bc',
 'mths_since_recent_inq',
 'num_actv_rev_tl',
 'num_tl_op_past_12m',
 'open_acc_6m',
 'open_il_24m',
 'open_rv_12m',
 'open_rv_24m',
 'total_bc_limit',
 'verification_status']

In [17]:
psi_final = psi_df[psi_df.feature.isin(FINAL_FEATURE_SET)].sort_values("psi", ascending=False)
print("\nPSI for Final Features:")
display(psi_final)


PSI for Final Features:


Unnamed: 0,feature,psi
60,fico_mid,0.074319
20,all_util,0.050462
25,bc_open_to_buy,0.047286
62,verification_status,0.021561
30,mo_sin_old_rev_tl_op,0.020806
66,loan_to_income_bin,0.02075
0,dti,0.019871
57,total_bc_limit,0.018528
40,num_actv_rev_tl,0.014844
16,il_util,0.013235


# 6. Final Feature Storing

In [18]:
X_train = train_final[FINAL_FEATURE_SET].copy()
y_train = train_final["target"].values

X_valid = valid_final[FINAL_FEATURE_SET].copy()
y_valid = valid_final["target"].values

X_test = test_final[FINAL_FEATURE_SET].copy()
y_test = test_final["target"].values


In [19]:
out = Path("data/eda")
out.mkdir(parents=True, exist_ok=True)

X_train.to_parquet("data/eda/X_train.parquet", index=False)
X_valid.to_parquet("data/eda/X_valid.parquet", index=False)
X_test.to_parquet("data/eda/X_test.parquet", index=False)

pd.DataFrame({"target": y_train}).to_parquet("data/eda/y_train.parquet", index=False)
pd.DataFrame({"target": y_valid}).to_parquet("data/eda/y_valid.parquet", index=False)
pd.DataFrame({"target": y_test}).to_parquet("data/eda/y_test.parquet", index=False)

np.save("data/eda/y_train.npy", y_train)
np.save("data/eda/y_valid.npy", y_valid)
np.save("data/eda/y_test.npy", y_test)