<span style= "color: blue">

## **_Part II: Predict past_due loans using application_train dataset_**

<span style="color:blue">

**_Overall objective_**:</span> To predict loans at risk of becoming past due or defaulting using Home Credit Group Kaggle dataset

<span style="color:blue">
    
**_Objectives for Notebook 02_**:</span>
* Model and compare groups of features from main dataset that were defined in Notebook 01
* Identify best model to move forward to Notebook 03, which focuses on creating features from other datasets to improve prediction

<span style="color:blue">
    
**_Summary of models attempted:_**</span> 

|Model features | Dataset |# of features|AUC_ROC|
|:----|:----:|:----:|:----:|
|(a) All features | _application_train_ | 128 | .7487 |
|(b) Features (a) without SmartCorrelatedSelection-selected multicollinear features with Kendall corr coef > .90 | _application_train_ | 77 | .7420 |
|(c) Features (b) with phik corr coef with target > .25) | _application_train_ | 10 | .7181 |
|(d) Features (a) without a SUBSET of highly correlated features (those that refer to measurements of applicant residence) | _application_train_ | 106 | .7404 |

**Best model**: Model (b) was selected as the best model to move forward in the analysis, because even though it performs marginally worse than Model (a), it has substantially fewer features.

**Next step**: Create features from other tables and add to Model (b) to improve performance (see Notebook 03)

In [1]:
from utils.libraries02_03 import *

from utils.functions import (
    process_dataset,
    create_preprocessing_pipeline,
    downsize_dtypes,
    ClampOutliersTransformer
)

current_dir = os.getcwd()
print(current_dir)

End of utils module
C:\Users\yashi\Dropbox\00 Turing Projects\_Turing repo clone\yaslin-ML.4.1


In [2]:
# Format settings
# %load_ext jupyter_black
# %load_ext autoreload
%reload_ext jupyter_black
pd.options.display.float_format = "{:,.2f}".format

set_config(transform_output="pandas")
pd.reset_option("display.max_columns")

In [3]:
train = joblib.load("data/train.joblib")
val = joblib.load("data/val.joblib")
test = joblib.load("data/test.joblib")
train.shape, val.shape, test.shape

((184506, 128), (61502, 128), (61503, 128))

<span style= "color: blue"> 

### **_2.1 Define feature groups for modelling_**

#### **_2.1.1 (a) All features_**

</span>

In [4]:
features_all_lst = joblib.load("data/features_all_lst.joblib")
len(features_all_lst)

128

<span style= "color: blue"> 

### **_2.1.2 Subsets of features pruned of highly correlated features_** 

#### **_(b) All features without (pruned of) highly correlated multicollinear features_**</span>
* _"highly correlated" defined as Kendall correlation coefficient > .90_
* _feature selection via SmartCorrelatedSelection_ 

In [5]:
features_smartcorr_lst = joblib.load("data/features_smartcorr_lst.joblib")
len(features_smartcorr_lst)

78

<span style= "color: blue"> 

#### **_(c) Features (b) with features that are correlated with target_** </span>

* _"correlated" defined as > .25 phik correlation coefficient_

In [6]:
features_hi_phik_lst = joblib.load("data/features_hi_phik_lst.joblib")
len(features_hi_phik_lst)

10

<span style= "color: blue"> 

#### **_(d) All features without a SUBSET of features listed below:_**</span>
* _subset features are all members of highly correlated features representing estimated measurements of the loan applicant residence_
* _selected features for each correlated pair achieved via SmartCorrelatedSelection_

In [7]:
hi_corr_lst = [
    "basementarea_medi",
    "commonarea_medi",
    "floorsmax_medi",
    "floorsmin_medi",
    "nonlivingapartments_medi",
    "years_build_mode",
    "elevators_mode",
    "entrances_mode",
    "landarea_mode",
    "livingarea_mode",
    "nonlivingarea_mode",
    "basementarea_avg",
    "commonarea_avg",
    "floorsmax_avg",
    "floorsmin_avg",
    "nonlivingapartments_avg",
    "years_build_avg",
    "elevators_avg",
    "entrances_avg",
    "landarea_avg",
    "livingarea_avg",
    "nonlivingarea_avg",
]
len(hi_corr_lst)

22

In [8]:
features_smartcorr2_lst = set(features_all_lst) - set(hi_corr_lst)
len(features_smartcorr2_lst)

106

In [9]:
joblib.dump(features_smartcorr2_lst, "data/features_smartcorr2_lst.joblib")

['data/features_smartcorr2_lst.joblib']

<span style= "color: blue">
    
### **_2.2 Test all features from application_train_**</span>

In [10]:
selected_features = features_all_lst
train.shape

(184506, 128)

In [11]:
X_train, y_train, X_train_num, X_train_num_lst, X_train_cat, X_train_cat_lst, train = (
    process_dataset(train, "target", 4, 0.5, selected_features)
)
train.shape

(37145, 128)

In [12]:
X_val, y_val, _, _, _, _, val = process_dataset(
    val, "target", 3, 0.4, selected_features
)

val.shape

(8028, 128)

<span style= "color: blue">
    
* **_To prepare for Catboost, identify categorical features with missing data and recode as "NaN"_**</span>

In [13]:
cat_idx = np.array([X_train.columns.get_loc(col) for col in X_train_cat_lst], dtype=int)
cols_with_missing = np.array(
    [
        X_train.columns.get_loc(col)
        for col in X_train.columns
        if X_train[col].isna().any()
    ],
    dtype=int,
)
common_features_indices = np.intersect1d(cat_idx, cols_with_missing)
cat_features_w_msg = X_train.columns[common_features_indices].tolist()

In [14]:
cat_idx

array([ 0,  1,  2,  3,  9, 10, 11, 12, 13, 26, 30, 38, 84, 85, 87, 88])

In [15]:
X_train.columns[112]

'doc20'

In [16]:
X_train.recent_credit_inquiries.unique()

array([0., 1., 2., 4., 3., 5., 6.], dtype=float32)

In [17]:
# Phind suggested; using countfreq encoder
def create_preprocessing_pipeline(num_columns, cat_columns):
    """Creates a preprocessing pipeline for numerical and categorical features.

    Args:
        num_columns (list): List of numerical column names.
        cat_columns (list): List of categorical column names.

    Returns:
        sklearn.compose.ColumnTransformer: A ColumnTransformer preprocessing pipeline.
    """
    num_transform = Pipeline(
        steps=[
            ("clamp_outliers", ClampOutliersTransformer()),
            ("ss", StandardScaler()),
        ]
    )

    cat_transform = Pipeline(
        steps=[
            (
                "cf_encoder",
                CountFrequencyEncoder(
                    encoding_method="frequency", missing_values="ignore"
                ),
            ),
        ]
    )

    pp_transformer = ColumnTransformer(
        [("num", num_transform, num_columns), ("cat", cat_transform, cat_columns)],
        verbose_feature_names_out=False,
        remainder="drop",
    )

    weights = compute_class_weight(
        class_weight="balanced", classes=np.unique(y_train), y=np.ravel(y_train)
    )
    class_weights = dict(zip(np.unique(y_train), weights))

    return pp_transformer, class_weights


# Assuming X_train_num_lst, X_train_cat_lst, y_train are defined elsewhere
pp_transformer, class_weights = create_preprocessing_pipeline(
    X_train_num_lst, X_train_cat_lst
)

cb_estimator = CatBoostClassifier(
    loss_function="Logloss",
    verbose=100,
    #    class_weights=class_weights,
    random_seed=42,
)

cb_pipe = Pipeline(
    [
        ("pp_transformer", pp_transformer),
        ("cb_estimator", cb_estimator),
    ]
)

# Fit the pipeline
cb_pipe.fit(X_train, y_train)

# Predict
y_pred = cb_pipe.predict(X_val)

Learning rate set to 0.048226
0:	learn: 0.6686723	total: 176ms	remaining: 2m 56s
100:	learn: 0.4304025	total: 5.69s	remaining: 50.6s
200:	learn: 0.4173530	total: 12.6s	remaining: 50.2s
300:	learn: 0.4049457	total: 20.3s	remaining: 47.2s
400:	learn: 0.3935124	total: 28.7s	remaining: 42.8s
500:	learn: 0.3836895	total: 35.1s	remaining: 34.9s
600:	learn: 0.3739829	total: 41.8s	remaining: 27.7s
700:	learn: 0.3641591	total: 48.8s	remaining: 20.8s
800:	learn: 0.3554366	total: 58s	remaining: 14.4s
900:	learn: 0.3471036	total: 1m 6s	remaining: 7.33s
999:	learn: 0.3388756	total: 1m 17s	remaining: 0us


In [18]:
report = classification_report(y_val, y_pred)
print(report)

roc_auc = roc_auc_score(y_val, cb_pipe.predict_proba(X_val)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0       0.78      0.97      0.86      6021
         1.0       0.64      0.18      0.28      2007

    accuracy                           0.77      8028
   macro avg       0.71      0.57      0.57      8028
weighted avg       0.74      0.77      0.72      8028

ROC-AUC Score: 0.7487


<span style= "color: blue">
    
### **_2.3 Test features from application_train with highly correlated features removed_**
##### **_Kendall corr coef>.90_**</span>

In [19]:
train = joblib.load("data/train.joblib")
val = joblib.load("data/val.joblib")

selected_features = features_smartcorr_lst

In [21]:
X_train, y_train, X_train_num, X_train_num_lst, X_train_cat, X_train_cat_lst, train = (
    process_dataset(train, "target", 5, 0.5, selected_features)
)
train.shape

(44574, 78)

In [22]:
X_val, y_val, _, _, _, _, val = process_dataset(
    val, "target", 5, 0.5, selected_features
)
val.shape

(15048, 78)

In [23]:
%%time
pp_transformer, class_weights = create_preprocessing_pipeline(
    X_train_num_lst, X_train_cat_lst
)

cb_estimator = CatBoostClassifier(
    loss_function="Logloss",
    class_weights=class_weights,
    class_names=[0, 1],
    verbose=100,
    random_seed=42,
)

cb_pipe = Pipeline(
    [
        ("pp_transformer", pp_transformer),
        ("cb_estimator", cb_estimator),
    ]
)
cb_pipe.fit(X_train, y_train)
y_pred = cb_pipe.predict(X_val)

Learning rate set to 0.05213
0:	learn: 0.6867930	total: 65.4ms	remaining: 1m 5s
100:	learn: 0.5876157	total: 7.07s	remaining: 1m 2s
200:	learn: 0.5680880	total: 12.3s	remaining: 49.1s
300:	learn: 0.5473939	total: 17.6s	remaining: 40.9s
400:	learn: 0.5289600	total: 22.4s	remaining: 33.5s
500:	learn: 0.5119769	total: 27.4s	remaining: 27.3s
600:	learn: 0.4961056	total: 33.4s	remaining: 22.2s
700:	learn: 0.4814749	total: 38.6s	remaining: 16.5s
800:	learn: 0.4681776	total: 43.3s	remaining: 10.8s
900:	learn: 0.4551895	total: 48.6s	remaining: 5.34s
999:	learn: 0.4427432	total: 54.5s	remaining: 0us
CPU times: total: 2min 9s
Wall time: 56.2 s


In [24]:
report = classification_report(y_val, y_pred)
print(report)

roc_auc = roc_auc_score(y_val, cb_pipe.predict_proba(X_val)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0       0.90      0.76      0.82     12540
         1.0       0.33      0.59      0.42      2508

    accuracy                           0.73     15048
   macro avg       0.62      0.68      0.62     15048
weighted avg       0.81      0.73      0.76     15048

ROC-AUC Score: 0.7420


<span style="color:blue">

#### **_Insights:_**</span>

* As noted in the summary Model (b) was selected as the best model to move forward in the analysis. It performed nearly as well as the model (a) with all application_train features, with only a fraction of the number of features.

<span style= "color: blue">
    
### **_2.4 Test features from application_train that were highly correlated with target (phik)_**</span>

In [25]:
train = joblib.load("data/train.joblib")
val = joblib.load("data/val.joblib")

selected_features = features_hi_phik_lst

In [26]:
X_train, y_train, X_train_num, X_train_num_lst, X_train_cat, X_train_cat_lst, train = (
    process_dataset(train, "target", 5, 0.5, selected_features)
)
train.shape

(44574, 10)

In [27]:
X_val, y_val, _, _, _, _, val = process_dataset(
    val, "target", 5, 0.9, selected_features
)
val.shape

(27090, 10)

In [28]:
%%time
pp_transformer, class_weights = create_preprocessing_pipeline(
    X_train_num_lst, X_train_cat_lst
)

cb_estimator = CatBoostClassifier(
    loss_function="Logloss",
    class_weights=class_weights,
    class_names=[0, 1],
    verbose=100,
    random_seed=42,
)

cb_pipe = Pipeline(
    [
        ("pp_transformer", pp_transformer),
        ("cb_estimator", cb_estimator),
    ]
)
cb_pipe.fit(X_train, y_train)
y_pred = cb_pipe.predict(X_val)

Learning rate set to 0.05213
0:	learn: 0.6863813	total: 31.9ms	remaining: 31.8s
100:	learn: 0.5996525	total: 3.42s	remaining: 30.4s
200:	learn: 0.5878754	total: 7.26s	remaining: 28.9s
300:	learn: 0.5762600	total: 10.9s	remaining: 25.3s
400:	learn: 0.5647347	total: 14.4s	remaining: 21.5s
500:	learn: 0.5540515	total: 18.1s	remaining: 18s
600:	learn: 0.5442358	total: 21.7s	remaining: 14.4s
700:	learn: 0.5352009	total: 25.4s	remaining: 10.8s
800:	learn: 0.5268860	total: 29s	remaining: 7.19s
900:	learn: 0.5186880	total: 32.7s	remaining: 3.59s
999:	learn: 0.5109306	total: 36.5s	remaining: 0us
CPU times: total: 1min 7s
Wall time: 37.2 s


In [29]:
%%time
report = classification_report(y_val, y_pred)
print(report)

roc_auc = roc_auc_score(y_val, cb_pipe.predict_proba(X_val)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0       0.90      0.70      0.79     22575
         1.0       0.29      0.61      0.40      4515

    accuracy                           0.69     27090
   macro avg       0.60      0.66      0.59     27090
weighted avg       0.80      0.69      0.72     27090

ROC-AUC Score: 0.7181
CPU times: total: 156 ms
Wall time: 162 ms


<span style= "color: blue">
    
### **_2.5 Test application_train features without SUBSET of highly correlated features_**</span>

In [31]:
train = joblib.load("data/train.joblib")
val = joblib.load("data/val.joblib")

selected_features = features_smartcorr2_lst

In [32]:
X_train, y_train, X_train_num, X_train_num_lst, X_train_cat, X_train_cat_lst, train = (
    process_dataset(train, "target", 5, 0.5, selected_features)
)
train.shape

(44574, 106)

In [33]:
X_val, y_val, _, _, _, _, val = process_dataset(
    val, "target", 5, 0.9, selected_features
)
val.shape

(27090, 106)

In [35]:
%%time
pp_transformer, class_weights = create_preprocessing_pipeline(
    X_train_num_lst, X_train_cat_lst
)

cb_estimator = CatBoostClassifier(
    loss_function="Logloss",
    class_weights=class_weights,
    class_names=[0, 1],
    verbose=100,
    random_seed=42,
)

cb_pipe = Pipeline(
    [
        ("pp_transformer", pp_transformer),
        ("cb_estimator", cb_estimator),
    ]
)
cb_pipe.fit(X_train, y_train)
y_pred = cb_pipe.predict(X_val)

Learning rate set to 0.05213
0:	learn: 0.6869149	total: 33.3ms	remaining: 33.3s
100:	learn: 0.5829430	total: 3.88s	remaining: 34.6s
200:	learn: 0.5621369	total: 7.33s	remaining: 29.1s
300:	learn: 0.5407117	total: 10.8s	remaining: 25.1s
400:	learn: 0.5205214	total: 14.6s	remaining: 21.8s
500:	learn: 0.5026442	total: 21.3s	remaining: 21.3s
600:	learn: 0.4862161	total: 30.1s	remaining: 20s
700:	learn: 0.4710341	total: 37s	remaining: 15.8s
800:	learn: 0.4570778	total: 43.2s	remaining: 10.7s
900:	learn: 0.4436626	total: 50.7s	remaining: 5.57s
999:	learn: 0.4304299	total: 56.3s	remaining: 0us
CPU times: total: 2min 9s
Wall time: 57.8 s


In [36]:
%%time
report = classification_report(y_val, y_pred)
print(report)

roc_auc = roc_auc_score(y_val, cb_pipe.predict_proba(X_val)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0       0.90      0.76      0.82     22575
         1.0       0.33      0.59      0.42      4515

    accuracy                           0.73     27090
   macro avg       0.62      0.67      0.62     27090
weighted avg       0.81      0.73      0.76     27090

ROC-AUC Score: 0.7404
CPU times: total: 344 ms
Wall time: 589 ms
