# Feature Engineering Notebook

This notebook builds and evaluates feature blocks from `MUST_AGG` tables for the Home Credit task.

## Goals
- Build **Active vs Closed** contract features
- Build **Time-windowed** aggregations (`<=1y`, `1-3y`, `>3y`)
- Build **DPD-conditional** aggregations (e.g., `DPD>=30`, `DPD>=90`)
- Merge blocks one-by-one and keep only blocks that improve validation AUC


## 1) Setup and Base Data

Initialize paths and load the base table used as the master join frame.


In [2]:
#imports and config
from pathlib import Path
import polars as pl

DATA_DIR = Path(r"C:\Users\darre\Downloads\home-credit-credit-risk-model-stability\csv_files\train")

MUST_AGG = [
    "tax_registry_a_1", "applprev_2", "debitcard_1", "deposit_1",
    "person_2", "credit_bureau_b_1", "applprev_1_1", "applprev_1_0", "person_1"
]

TABLES = {
    "base": DATA_DIR / "train_base.csv",
    "tax_registry_a_1": DATA_DIR / "train_tax_registry_a_1.csv",
    "applprev_2": DATA_DIR / "train_applprev_2.csv",
    "debitcard_1": DATA_DIR / "train_debitcard_1.csv",
    "deposit_1": DATA_DIR / "train_deposit_1.csv",
    "person_2": DATA_DIR / "train_person_2.csv",
    "credit_bureau_b_1": DATA_DIR / "train_credit_bureau_b_1.csv",
    "applprev_1_1": DATA_DIR / "train_applprev_1_1.csv",
    "applprev_1_0": DATA_DIR / "train_applprev_1_0.csv",
    "person_1": DATA_DIR / "train_person_1.csv",
}

print("base exists:", TABLES["base"].exists())



base exists: True


In [3]:
# load base
# Why: base table is the master frame that all aggregated features will join into via case_id.
# Keep only key columns needed for joins, time-based validation, and target label.
base = pl.read_csv(TABLES["base"])
base = base.select(["case_id", "WEEK_NUM", "date_decision", "target"])
print(base.shape)
base.head()


(1526659, 4)


case_id,WEEK_NUM,date_decision,target
i64,i64,str,i64
0,0,"""2019-01-03""",0
1,0,"""2019-01-03""",0
2,0,"""2019-01-04""",0
3,0,"""2019-01-03""",0
4,0,"""2019-01-04""",1


## 2) Table Profiling (Why Aggregation Is Needed)

We inspect schema and row-depth per `case_id` to confirm these are 1:many tables that must be aggregated.


In [4]:
#schema + depth profile
def table_schema(path):
    return pl.read_csv(path, n_rows=0).schema

def depth_stats(path):
    return (
        pl.scan_csv(path)
        .group_by("case_id")
        .count()
        .select(
            pl.col("count").mean().alias("avg_rows_per_case"),
            pl.col("count").median().alias("median_rows_per_case"),
            pl.col("count").max().alias("max_rows_per_case"),
        )
        .collect()
    )

for t in MUST_AGG:
    s = table_schema(TABLES[t])
    d = depth_stats(TABLES[t])
    print(f"\n=== {t} ===")
    print("n_cols:", len(s))
    print("first 12 cols:", list(s.keys())[:12])
    print(d)



=== tax_registry_a_1 ===
n_cols: 5
first 12 cols: ['case_id', 'amount_4527230A', 'name_4527232M', 'num_group1', 'recorddate_4527225D']
shape: (1, 3)
┌───────────────────┬──────────────────────┬───────────────────┐
│ avg_rows_per_case ┆ median_rows_per_case ┆ max_rows_per_case │
│ ---               ┆ ---                  ┆ ---               │
│ f64               ┆ f64                  ┆ u32               │
╞═══════════════════╪══════════════════════╪═══════════════════╡
│ 7.153367          ┆ 6.0                  ┆ 99                │
└───────────────────┴──────────────────────┴───────────────────┘


  .count()



=== applprev_2 ===
n_cols: 6
first 12 cols: ['case_id', 'cacccardblochreas_147M', 'conts_type_509L', 'credacc_cards_status_52L', 'num_group1', 'num_group2']
shape: (1, 3)
┌───────────────────┬──────────────────────┬───────────────────┐
│ avg_rows_per_case ┆ median_rows_per_case ┆ max_rows_per_case │
│ ---               ┆ ---                  ┆ ---               │
│ f64               ┆ f64                  ┆ u32               │
╞═══════════════════╪══════════════════════╪═══════════════════╡
│ 11.522909         ┆ 8.0                  ┆ 79                │
└───────────────────┴──────────────────────┴───────────────────┘

=== debitcard_1 ===
n_cols: 6
first 12 cols: ['case_id', 'last180dayaveragebalance_704A', 'last180dayturnover_1134A', 'last30dayturnover_651A', 'num_group1', 'openingdate_857D']
shape: (1, 3)
┌───────────────────┬──────────────────────┬───────────────────┐
│ avg_rows_per_case ┆ median_rows_per_case ┆ max_rows_per_case │
│ ---               ┆ ---                  ┆ ---  

In [5]:
# Inspect status values to map active vs closed
ap0 = pl.read_csv(TABLES["applprev_1_0"])
ap1 = pl.read_csv(TABLES["applprev_1_1"])
cb1 = pl.read_csv(TABLES["credit_bureau_b_1"])

print("applprev_1_0 status_219L")
print(ap0.group_by("status_219L").len().sort("len", descending=True).head(20))

print("\napplprev_1_1 status_219L")
print(ap1.group_by("status_219L").len().sort("len", descending=True).head(20))

print("\ncredit_bureau_b_1 contractst_516M")
print(cb1.group_by("contractst_516M").len().sort("len", descending=True).head(20))


applprev_1_0 status_219L
shape: (12, 2)
┌─────────────┬─────────┐
│ status_219L ┆ len     │
│ ---         ┆ ---     │
│ str         ┆ u32     │
╞═════════════╪═════════╡
│ K           ┆ 1605077 │
│ D           ┆ 1563834 │
│ A           ┆ 431299  │
│ T           ┆ 263947  │
│ N           ┆ 15668   │
│ …           ┆ …       │
│ L           ┆ 470     │
│ H           ┆ 276     │
│ P           ┆ 39      │
│ null        ┆ 35      │
│ R           ┆ 8       │
└─────────────┴─────────┘

applprev_1_1 status_219L
shape: (12, 2)
┌─────────────┬─────────┐
│ status_219L ┆ len     │
│ ---         ┆ ---     │
│ str         ┆ u32     │
╞═════════════╪═════════╡
│ D           ┆ 1104119 │
│ K           ┆ 1053357 │
│ A           ┆ 284608  │
│ T           ┆ 177685  │
│ N           ┆ 14762   │
│ …           ┆ …       │
│ S           ┆ 302     │
│ H           ┆ 196     │
│ null        ┆ 31      │
│ P           ┆ 20      │
│ R           ┆ 15      │
└─────────────┴─────────┘

credit_bureau_b_1 contractst_516M


In [6]:
#prep base decision date
base_dates = (
    base.select(["case_id", "date_decision"])
    .with_columns(pl.col("date_decision").str.strptime(pl.Date, strict=False))
)
base_dates.head()

case_id,date_decision
i64,date
0,2019-01-03
1,2019-01-03
2,2019-01-04
3,2019-01-03
4,2019-01-04


## 3) Initial Feature Prototype

First pass: engineer features for one table (`credit_bureau_b_1`) and test whether adding this block improves AUC.


### Methodology: Active vs Closed Contract Features

How we define contract state in this project:
- **Active/Open** means the contract is still ongoing at `date_decision`.
- **Closed** means the contract ended before `date_decision`.

Implementation rule order:
1. **Date-based rule (preferred):** if end/maturity date exists, use comparison to `date_decision`.
2. **Status fallback:** if date is not available, use status-code mapping (heuristic).
3. **Presence-only fallback:** if neither exists, do not force active/closed lifecycle labels; use activity counts only.

Why this is valid: competition metadata repeatedly distinguishes predictors for active vs closed contracts.


In [7]:
#build credit_bureau_b_1 row-level flags
cb1 = pl.read_csv(TABLES["credit_bureau_b_1"])

cb1f = (
    cb1.join(base_dates, on="case_id", how="left")
    .with_columns([
        pl.col("contractdate_551D").str.strptime(pl.Date, strict=False),
        pl.col("contractmaturitydate_151D").str.strptime(pl.Date, strict=False),
        pl.col("lastupdate_260D").str.strptime(pl.Date, strict=False),
        pl.col("dpd_550P").cast(pl.Float64, strict=False),
        pl.col("dpd_733P").cast(pl.Float64, strict=False),
    ])
    .with_columns([
        # Time anchor for windowing
        pl.coalesce([pl.col("lastupdate_260D"), pl.col("contractdate_551D")]).alias("event_date"),

        # Inferred contract state from maturity vs decision date
        (pl.col("contractmaturitydate_151D") >= pl.col("date_decision")).fill_null(False).alias("is_active"),
        (pl.col("contractmaturitydate_151D") <  pl.col("date_decision")).fill_null(False).alias("is_closed"),

        # DPD condition flags
        (pl.col("dpd_550P") >= 1).fill_null(False).alias("dpd1_active"),
        (pl.col("dpd_550P") >= 30).fill_null(False).alias("dpd30_active"),
        (pl.col("dpd_550P") >= 90).fill_null(False).alias("dpd90_active"),

        (pl.col("dpd_733P") >= 1).fill_null(False).alias("dpd1_closed"),
        (pl.col("dpd_733P") >= 30).fill_null(False).alias("dpd30_closed"),
        (pl.col("dpd_733P") >= 90).fill_null(False).alias("dpd90_closed"),
    ])
    .with_columns([
        ((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years")
    ])
)

cb1f.select([
    pl.len().alias("rows"),
    pl.col("is_active").sum().alias("active_rows"),
    pl.col("is_closed").sum().alias("closed_rows"),
    pl.col("age_years").null_count().alias("age_nulls"),
]).head()


rows,active_rows,closed_rows,age_nulls
u32,u32,u32,u32
85791,79399,2313,3892


In [8]:
#aggregate to one row per case_id (active/closed + windows + DPD-conditional)
cb1_agg = (
    cb1f.group_by("case_id").agg([
        pl.len().alias("cb1_row_count"),

        pl.col("is_active").sum().alias("cb1_active_count"),
        pl.col("is_closed").sum().alias("cb1_closed_count"),

        # active DPD counts
        (pl.col("is_active") & pl.col("dpd30_active")).sum().alias("cb1_active_dpd30_count_all"),
        (pl.col("is_active") & pl.col("dpd90_active")).sum().alias("cb1_active_dpd90_count_all"),

        # closed DPD counts
        (pl.col("is_closed") & pl.col("dpd30_closed")).sum().alias("cb1_closed_dpd30_count_all"),
        (pl.col("is_closed") & pl.col("dpd90_closed")).sum().alias("cb1_closed_dpd90_count_all"),

        # <=1y window
        (pl.col("is_active") & (pl.col("age_years") >= 0) & (pl.col("age_years") <= 1)).sum().alias("cb1_active_count_le1y"),
        (pl.col("is_closed") & (pl.col("age_years") >= 0) & (pl.col("age_years") <= 1)).sum().alias("cb1_closed_count_le1y"),

        # 1-3y window
        (pl.col("is_active") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias("cb1_active_count_1to3y"),
        (pl.col("is_closed") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias("cb1_closed_count_1to3y"),

        # >3y window
        (pl.col("is_active") & (pl.col("age_years") > 3)).sum().alias("cb1_active_count_gt3y"),
        (pl.col("is_closed") & (pl.col("age_years") > 3)).sum().alias("cb1_closed_count_gt3y"),
    ])
)

print(cb1_agg.shape)
cb1_agg.head()


(36500, 14)


case_id,cb1_row_count,cb1_active_count,cb1_closed_count,cb1_active_dpd30_count_all,cb1_active_dpd90_count_all,cb1_closed_dpd30_count_all,cb1_closed_dpd90_count_all,cb1_active_count_le1y,cb1_closed_count_le1y,cb1_active_count_1to3y,cb1_closed_count_1to3y,cb1_active_count_gt3y,cb1_closed_count_gt3y
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
1264418,5,5,0,1,1,0,0,0,0,0,0,0,0
1744684,2,2,0,0,0,0,0,1,0,0,0,0,0
895812,3,3,0,0,0,0,0,0,0,0,0,0,0
1442936,3,3,0,0,0,0,0,1,0,0,0,0,0
994768,2,2,0,0,0,0,0,0,0,0,0,0,0


In [9]:
#merge this feature block into base
model_df = (
    base.join(cb1_agg, on="case_id", how="left")
    .with_columns(pl.col("^cb1_.*$").fill_null(0))
)

print(model_df.shape)
model_df.head()


(1526659, 17)


case_id,WEEK_NUM,date_decision,target,cb1_row_count,cb1_active_count,cb1_closed_count,cb1_active_dpd30_count_all,cb1_active_dpd90_count_all,cb1_closed_dpd30_count_all,cb1_closed_dpd90_count_all,cb1_active_count_le1y,cb1_closed_count_le1y,cb1_active_count_1to3y,cb1_closed_count_1to3y,cb1_active_count_gt3y,cb1_closed_count_gt3y
i64,i64,str,i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,"""2019-01-03""",0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,"""2019-01-03""",0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,"""2019-01-04""",0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,"""2019-01-03""",0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,"""2019-01-04""",1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [10]:
# time-aware split (important for this competition)
# Why: WEEK_NUM captures temporal drift; validating on later weeks is closer to real deployment behavior.
cut_week = model_df.select(pl.col("WEEK_NUM").quantile(0.80)).item()
train_df = model_df.filter(pl.col("WEEK_NUM") <= cut_week)
valid_df = model_df.filter(pl.col("WEEK_NUM") > cut_week)

print("cut_week:", cut_week)
print("train shape:", train_df.shape)
print("valid shape:", valid_df.shape)


cut_week: 60.0
train shape: (1235013, 17)
valid shape: (291646, 17)


In [11]:
#baseline vs +cb1 block
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score

def run_auc(train_df, valid_df, feature_cols):
    X_train = train_df.select(feature_cols).to_pandas()
    y_train = train_df["target"].to_pandas()
    X_valid = valid_df.select(feature_cols).to_pandas()
    y_valid = valid_df["target"].to_pandas()

    clf = HistGradientBoostingClassifier(
        max_depth=6,
        learning_rate=0.05,
        max_iter=200,
        random_state=42
    )
    clf.fit(X_train, y_train)
    pred = clf.predict_proba(X_valid)[:, 1]
    return roc_auc_score(y_valid, pred)

baseline_cols = ["WEEK_NUM"]
cb1_cols = [c for c in model_df.columns if c.startswith("cb1_")]
plus_cb1_cols = baseline_cols + cb1_cols

auc_baseline = run_auc(train_df, valid_df, baseline_cols)
auc_plus_cb1 = run_auc(train_df, valid_df, plus_cb1_cols)

print("AUC baseline:", round(auc_baseline, 6))
print("AUC + cb1   :", round(auc_plus_cb1, 6))
print("Delta       :", round(auc_plus_cb1 - auc_baseline, 6))


AUC baseline: 0.5
AUC + cb1   : 0.521244
Delta       : 0.021244


### Result Interpretation
- `cb1` improved AUC over baseline, so this block adds useful predictive signal.


In [12]:
#applprev_1_0 prep + flags
ap0 = pl.read_csv(TABLES["applprev_1_0"])

ap0f = (
    ap0.join(base_dates, on="case_id", how="left")
    .with_columns([
        pl.col("actualdpd_943P").cast(pl.Float64, strict=False),
        pl.col("maxdpdtolerance_577P").cast(pl.Float64, strict=False),
        pl.col("dateactivated_425D").str.strptime(pl.Date, strict=False),
        pl.col("approvaldate_319D").str.strptime(pl.Date, strict=False),
        pl.col("creationdate_885D").str.strptime(pl.Date, strict=False),
    ])
    .with_columns([
        # use activated date first, then approval, then creation
        pl.coalesce([
            pl.col("dateactivated_425D"),
            pl.col("approvaldate_319D"),
            pl.col("creationdate_885D")
        ]).alias("event_date"),

        # temporary active/closed mapping from status code
        pl.col("status_219L").is_in(["A"]).alias("is_active"),
        pl.col("status_219L").is_in(["D","K","T"]).alias("is_closed"),

        (pl.col("actualdpd_943P") >= 1).fill_null(False).alias("dpd1"),
        (pl.col("actualdpd_943P") >= 30).fill_null(False).alias("dpd30"),
        (pl.col("actualdpd_943P") >= 90).fill_null(False).alias("dpd90"),
    ])
    .with_columns([
        ((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years")
    ])
)

ap0f.select([
    pl.len().alias("rows"),
    pl.col("is_active").sum().alias("active_rows"),
    pl.col("is_closed").sum().alias("closed_rows"),
    pl.col("age_years").null_count().alias("age_nulls"),
]).head()


rows,active_rows,closed_rows,age_nulls
u32,u32,u32,u32
3887684,431299,3432858,35


In [13]:
#aggregate applprev_1_0
ap0_agg = (
    ap0f.group_by("case_id").agg([
        pl.len().alias("ap0_row_count"),
        pl.col("is_active").sum().alias("ap0_active_count"),
        pl.col("is_closed").sum().alias("ap0_closed_count"),

        (pl.col("is_active") & pl.col("dpd30")).sum().alias("ap0_active_dpd30_count_all"),
        (pl.col("is_active") & pl.col("dpd90")).sum().alias("ap0_active_dpd90_count_all"),
        (pl.col("is_closed") & pl.col("dpd30")).sum().alias("ap0_closed_dpd30_count_all"),
        (pl.col("is_closed") & pl.col("dpd90")).sum().alias("ap0_closed_dpd90_count_all"),

        (pl.col("is_active") & (pl.col("age_years") >= 0) & (pl.col("age_years") <= 1)).sum().alias("ap0_active_count_le1y"),
        (pl.col("is_closed") & (pl.col("age_years") >= 0) & (pl.col("age_years") <= 1)).sum().alias("ap0_closed_count_le1y"),

        (pl.col("is_active") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias("ap0_active_count_1to3y"),
        (pl.col("is_closed") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias("ap0_closed_count_1to3y"),

        (pl.col("is_active") & (pl.col("age_years") > 3)).sum().alias("ap0_active_count_gt3y"),
        (pl.col("is_closed") & (pl.col("age_years") > 3)).sum().alias("ap0_closed_count_gt3y"),
    ])
)

print(ap0_agg.shape)
ap0_agg.head()


(782997, 14)


case_id,ap0_row_count,ap0_active_count,ap0_closed_count,ap0_active_dpd30_count_all,ap0_active_dpd90_count_all,ap0_closed_dpd30_count_all,ap0_closed_dpd90_count_all,ap0_active_count_le1y,ap0_closed_count_le1y,ap0_active_count_1to3y,ap0_closed_count_1to3y,ap0_active_count_gt3y,ap0_closed_count_gt3y
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
1277384,2,0,2,0,0,0,0,0,1,0,0,0,1
176961,8,0,8,0,0,0,0,0,1,0,2,0,5
1603507,1,1,0,0,0,0,0,0,0,1,0,0,0
2549778,1,0,1,0,0,0,0,0,0,0,1,0,0
1614401,6,0,6,0,0,0,0,0,1,0,1,0,4


In [14]:
#evaluate +ap0 on top of +cb1
model_df2 = (
    model_df.join(ap0_agg, on="case_id", how="left")
    .with_columns(pl.col("^ap0_.*$").fill_null(0))
)

train_df2 = model_df2.filter(pl.col("WEEK_NUM") <= cut_week)
valid_df2 = model_df2.filter(pl.col("WEEK_NUM") > cut_week)

ap0_cols = [c for c in model_df2.columns if c.startswith("ap0_")]
plus_cb1_ap0_cols = baseline_cols + cb1_cols + ap0_cols

auc_plus_cb1_ap0 = run_auc(train_df2, valid_df2, plus_cb1_ap0_cols)

print("AUC + cb1         :", round(auc_plus_cb1, 6))
print("AUC + cb1 + ap0   :", round(auc_plus_cb1_ap0, 6))
print("Delta vs +cb1     :", round(auc_plus_cb1_ap0 - auc_plus_cb1, 6))


AUC + cb1         : 0.521244
AUC + cb1 + ap0   : 0.521315
Delta vs +cb1     : 7.1e-05


### Result Interpretation
- `ap0` produced a very small positive delta.
- Keep as optional/weak block and verify in final model.


In [15]:
ap1 = pl.read_csv(TABLES["applprev_1_1"])

ap1f = (
    ap1.join(base_dates, on="case_id", how="left")
    .with_columns([
        pl.col("actualdpd_943P").cast(pl.Float64, strict=False),
        pl.col("maxdpdtolerance_577P").cast(pl.Float64, strict=False),
        pl.col("dateactivated_425D").str.strptime(pl.Date, strict=False),
        pl.col("approvaldate_319D").str.strptime(pl.Date, strict=False),
        pl.col("creationdate_885D").str.strptime(pl.Date, strict=False),
    ])
    .with_columns([
        pl.coalesce([pl.col("dateactivated_425D"), pl.col("approvaldate_319D"), pl.col("creationdate_885D")]).alias("event_date"),
        pl.col("status_219L").is_in(["A"]).alias("is_active"),
        pl.col("status_219L").is_in(["D","K","T"]).alias("is_closed"),
        (pl.col("actualdpd_943P") >= 30).fill_null(False).alias("dpd30"),
        (pl.col("actualdpd_943P") >= 90).fill_null(False).alias("dpd90"),
    ])
    .with_columns([
        ((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years")
    ])
)

ap1_agg = (
    ap1f.group_by("case_id").agg([
        pl.len().alias("ap1_row_count"),
        pl.col("is_active").sum().alias("ap1_active_count"),
        pl.col("is_closed").sum().alias("ap1_closed_count"),
        (pl.col("is_active") & pl.col("dpd30")).sum().alias("ap1_active_dpd30_count_all"),
        (pl.col("is_closed") & pl.col("dpd30")).sum().alias("ap1_closed_dpd30_count_all"),
    ])
)


In [16]:
# Cell A: build ap1_agg
ap1 = pl.read_csv(TABLES["applprev_1_1"])

ap1f = (
    ap1.join(base_dates, on="case_id", how="left")
    .with_columns([
        pl.col("actualdpd_943P").cast(pl.Float64, strict=False),
        pl.col("maxdpdtolerance_577P").cast(pl.Float64, strict=False),
        pl.col("dateactivated_425D").str.strptime(pl.Date, strict=False),
        pl.col("approvaldate_319D").str.strptime(pl.Date, strict=False),
        pl.col("creationdate_885D").str.strptime(pl.Date, strict=False),
    ])
    .with_columns([
        pl.coalesce([
            pl.col("dateactivated_425D"),
            pl.col("approvaldate_319D"),
            pl.col("creationdate_885D")
        ]).alias("event_date"),

        pl.col("status_219L").is_in(["A"]).alias("is_active"),
        pl.col("status_219L").is_in(["D", "K", "T"]).alias("is_closed"),

        (pl.col("actualdpd_943P") >= 30).fill_null(False).alias("dpd30"),
        (pl.col("actualdpd_943P") >= 90).fill_null(False).alias("dpd90"),
    ])
    .with_columns([
        ((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years")
    ])
)

ap1_agg = (
    ap1f.group_by("case_id").agg([
        pl.len().alias("ap1_row_count"),
        pl.col("is_active").sum().alias("ap1_active_count"),
        pl.col("is_closed").sum().alias("ap1_closed_count"),
        (pl.col("is_active") & pl.col("dpd30")).sum().alias("ap1_active_dpd30_count_all"),
        (pl.col("is_active") & pl.col("dpd90")).sum().alias("ap1_active_dpd90_count_all"),
        (pl.col("is_closed") & pl.col("dpd30")).sum().alias("ap1_closed_dpd30_count_all"),
        (pl.col("is_closed") & pl.col("dpd90")).sum().alias("ap1_closed_dpd90_count_all"),
    ])
)

print(ap1_agg.shape)
ap1_agg.head()


(438525, 8)


case_id,ap1_row_count,ap1_active_count,ap1_closed_count,ap1_active_dpd30_count_all,ap1_active_dpd90_count_all,ap1_closed_dpd30_count_all,ap1_closed_dpd90_count_all
i64,u32,u32,u32,u32,u32,u32,u32
213138,18,2,16,0,0,0,0
1766847,3,0,3,0,0,0,0
1776416,5,1,4,0,0,0,0
1806558,6,0,6,0,0,0,0
1845218,5,1,4,0,0,0,0


In [17]:
# Cell B: merge ap1 features
model_df3 = (
    model_df2.join(ap1_agg, on="case_id", how="left")
    .with_columns(pl.col("^ap1_.*$").fill_null(0))
)

train_df3 = model_df3.filter(pl.col("WEEK_NUM") <= cut_week)
valid_df3 = model_df3.filter(pl.col("WEEK_NUM") > cut_week)

ap1_cols = [c for c in model_df3.columns if c.startswith("ap1_")]
plus_cb1_ap0_ap1_cols = baseline_cols + cb1_cols + ap0_cols + ap1_cols

print("n ap1 cols:", len(ap1_cols))
print("total model cols:", len(plus_cb1_ap0_ap1_cols))


n ap1 cols: 7
total model cols: 34


In [18]:
# Cell C: evaluate
auc_plus_cb1_ap0_ap1 = run_auc(train_df3, valid_df3, plus_cb1_ap0_ap1_cols)

print("AUC + cb1 + ap0       :", round(auc_plus_cb1_ap0, 6))
print("AUC + cb1 + ap0 + ap1 :", round(auc_plus_cb1_ap0_ap1, 6))
print("Delta vs previous      :", round(auc_plus_cb1_ap0_ap1 - auc_plus_cb1_ap0, 6))


AUC + cb1 + ap0       : 0.521315
AUC + cb1 + ap0 + ap1 : 0.56944
Delta vs previous      : 0.048125


### Result Interpretation
- `ap1` produced a strong gain and should be retained.


In [19]:
[c for c in plus_cb1_ap0_ap1_cols if "target" in c.lower() or "case_id" in c.lower() or "date_decision" in c.lower()]


[]

## 4) Leak-Safe Reusable Feature Engineering Function

Refactor into a reusable function that enforces decision-time safety (`age_years >= 0` when event dates exist).


### Methodology: Time-Windowed Aggregation

We transform event recency into windows using:
`age_years = (date_decision - event_date) / 365.25`

Chosen windows:
- `<=1y` (recent behavior)
- `1-3y` (mid-term behavior)
- `>3y` (older behavior)

Justification:
- Coarse windows capture recency while controlling feature explosion and overfitting risk.
- Drift-aware: weekly target behavior changes over time, so recency matters.
- This phase uses a practical baseline split; exhaustive window search was not performed.

Leak-safety:
- When event dates exist, only rows with `age_years >= 0` are used (known at decision time).


In [20]:
# Reusable leak-safe aggregation template
# What it does:
# 1) Reads a source table and joins date_decision from base
# 2) Creates active/closed flags and optional DPD flags
# 3) Builds decision-time-safe window features
# 4) Aggregates to one row per case_id for model training
def build_contract_features(
    table_path: Path,
    prefix: str,
    base_dates: pl.DataFrame,
    event_date_cols: list[str],        # ordered priority
    dpd_col: str | None = None,        # e.g., "actualdpd_943P"
    active_flag_expr: pl.Expr | None = None,
    closed_flag_expr: pl.Expr | None = None,
):
    df = pl.read_csv(table_path).join(base_dates, on="case_id", how="left")

    # Parse event dates
    parse_exprs = [pl.col(c).str.strptime(pl.Date, strict=False).alias(c) for c in event_date_cols if c in df.columns]
    cast_exprs = [pl.col(dpd_col).cast(pl.Float64, strict=False).alias(dpd_col)] if (dpd_col is not None and dpd_col in df.columns) else []

    df = df.with_columns(parse_exprs + cast_exprs)

    # Event date fallback by priority
    available_dates = [pl.col(c) for c in event_date_cols if c in df.columns]
    df = df.with_columns(pl.coalesce(available_dates).alias("event_date"))

    # Default flags if not provided
    if active_flag_expr is None:
        active_flag_expr = pl.lit(False)
    if closed_flag_expr is None:
        closed_flag_expr = pl.lit(False)

    df = df.with_columns([
        active_flag_expr.fill_null(False).alias("is_active"),
        closed_flag_expr.fill_null(False).alias("is_closed"),
        ((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years"),
    ])

    # Leak-safe mask: only records known at decision time
    known_mask = (pl.col("age_years") >= 0)

    # DPD flags if present
    if dpd_col is not None and dpd_col in df.columns:
        df = df.with_columns([
            (pl.col(dpd_col) >= 30).fill_null(False).alias("dpd30"),
            (pl.col(dpd_col) >= 90).fill_null(False).alias("dpd90"),
        ])
    else:
        df = df.with_columns([
            pl.lit(False).alias("dpd30"),
            pl.lit(False).alias("dpd90"),
        ])

    agg = df.group_by("case_id").agg([
        pl.len().alias(f"{prefix}_row_count_all"),
        known_mask.sum().alias(f"{prefix}_known_count"),

        (known_mask & pl.col("is_active")).sum().alias(f"{prefix}_active_count_all"),
        (known_mask & pl.col("is_closed")).sum().alias(f"{prefix}_closed_count_all"),

        (known_mask & pl.col("is_active") & pl.col("dpd30")).sum().alias(f"{prefix}_active_dpd30_count_all"),
        (known_mask & pl.col("is_active") & pl.col("dpd90")).sum().alias(f"{prefix}_active_dpd90_count_all"),
        (known_mask & pl.col("is_closed") & pl.col("dpd30")).sum().alias(f"{prefix}_closed_dpd30_count_all"),
        (known_mask & pl.col("is_closed") & pl.col("dpd90")).sum().alias(f"{prefix}_closed_dpd90_count_all"),

        # time windows
        (known_mask & pl.col("is_active") & (pl.col("age_years") <= 1)).sum().alias(f"{prefix}_active_count_le1y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") <= 1)).sum().alias(f"{prefix}_closed_count_le1y"),

        (known_mask & pl.col("is_active") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias(f"{prefix}_active_count_1to3y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias(f"{prefix}_closed_count_1to3y"),

        (known_mask & pl.col("is_active") & (pl.col("age_years") > 3)).sum().alias(f"{prefix}_active_count_gt3y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") > 3)).sum().alias(f"{prefix}_closed_count_gt3y"),
    ])

    # Optional rates (safe division)
    agg = agg.with_columns([
        pl.when(pl.col(f"{prefix}_active_count_all") > 0)
        .then(pl.col(f"{prefix}_active_dpd30_count_all") / pl.col(f"{prefix}_active_count_all"))
        .otherwise(0.0)
        .alias(f"{prefix}_active_dpd30_rate"),

        pl.when(pl.col(f"{prefix}_closed_count_all") > 0)
        .then(pl.col(f"{prefix}_closed_dpd30_count_all") / pl.col(f"{prefix}_closed_count_all"))
        .otherwise(0.0)
        .alias(f"{prefix}_closed_dpd30_rate"),
    ])

    return agg


In [21]:
# cb1: infer active/closed from maturity date vs decision date
cb1_agg_v2 = build_contract_features(
    table_path=TABLES["credit_bureau_b_1"],
    prefix="cb1",
    base_dates=base_dates,
    event_date_cols=["lastupdate_260D", "contractdate_551D"],
    dpd_col="dpd_550P",  # active-oriented DPD field
    active_flag_expr=(pl.col("contractmaturitydate_151D").str.strptime(pl.Date, strict=False) >= pl.col("date_decision")),
    closed_flag_expr=(pl.col("contractmaturitydate_151D").str.strptime(pl.Date, strict=False) < pl.col("date_decision")),
)

ap0_agg_v2 = build_contract_features(
    table_path=TABLES["applprev_1_0"],
    prefix="ap0",
    base_dates=base_dates,
    event_date_cols=["dateactivated_425D", "approvaldate_319D", "creationdate_885D"],
    dpd_col="actualdpd_943P",
    active_flag_expr=pl.col("status_219L").is_in(["A"]),
    closed_flag_expr=pl.col("status_219L").is_in(["D", "K", "T"]),
)

ap1_agg_v2 = build_contract_features(
    table_path=TABLES["applprev_1_1"],
    prefix="ap1",
    base_dates=base_dates,
    event_date_cols=["dateactivated_425D", "approvaldate_319D", "creationdate_885D"],
    dpd_col="actualdpd_943P",
    active_flag_expr=pl.col("status_219L").is_in(["A"]),
    closed_flag_expr=pl.col("status_219L").is_in(["D", "K", "T"]),
)

print(cb1_agg_v2.shape, ap0_agg_v2.shape, ap1_agg_v2.shape)


(36500, 17) (782997, 17) (438525, 17)


In [22]:
model_v2 = (
    base
    .join(cb1_agg_v2, on="case_id", how="left")
    .join(ap0_agg_v2, on="case_id", how="left")
    .join(ap1_agg_v2, on="case_id", how="left")
    .with_columns([
        pl.col("^cb1_.*$").fill_null(0),
        pl.col("^ap0_.*$").fill_null(0),
        pl.col("^ap1_.*$").fill_null(0),
    ])
)

cut_week_v2 = model_v2.select(pl.col("WEEK_NUM").quantile(0.80)).item()
tr = model_v2.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va = model_v2.filter(pl.col("WEEK_NUM") > cut_week_v2)

base_cols = ["WEEK_NUM"]
cb1_cols_v2 = [c for c in model_v2.columns if c.startswith("cb1_")]
ap0_cols_v2 = [c for c in model_v2.columns if c.startswith("ap0_")]
ap1_cols_v2 = [c for c in model_v2.columns if c.startswith("ap1_")]

auc_base_v2 = run_auc(tr, va, base_cols)
auc_cb1_v2 = run_auc(tr, va, base_cols + cb1_cols_v2)
auc_cb1_ap0_v2 = run_auc(tr, va, base_cols + cb1_cols_v2 + ap0_cols_v2)
auc_all_v2 = run_auc(tr, va, base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2)

print("AUC baseline:", round(auc_base_v2, 6))
print("AUC + cb1:", round(auc_cb1_v2, 6))
print("AUC + cb1 + ap0:", round(auc_cb1_ap0_v2, 6))
print("AUC + cb1 + ap0 + ap1:", round(auc_all_v2, 6))


AUC baseline: 0.5
AUC + cb1: 0.521059
AUC + cb1 + ap0: 0.521084
AUC + cb1 + ap0 + ap1: 0.613932


### Leak-Safe Rebuild Interpretation
- After moving to the leak-safe template, performance remained strong, confirming feature value is not just leakage.


### Methodology: DPD-Conditional Aggregation

DPD = **Days Past Due** (how many days a payment is late).

How we engineer DPD-conditional features:
1. Select DPD column(s) in the table (e.g., `actualdpd_943P`, `dpd_550P`).
2. Create threshold flags: `DPD>=30` and `DPD>=90`.
3. Aggregate by `case_id` to produce counts/rates, optionally split by active/closed and time windows.

Example outputs:
- `*_active_dpd30_count_all`
- `*_closed_dpd90_count_all`
- `*_active_dpd30_rate`


## 5) Incremental Block Evaluation

Add one table block at a time, retrain, and compare AUC deltas. Keep blocks that improve AUC and drop those that hurt performance.


In [23]:
# Build deposit_1 features 
dep_agg_v2 = build_contract_features(
    table_path=TABLES["deposit_1"],
    prefix="dep",
    base_dates=base_dates,
    event_date_cols=["openingdate_313D", "contractenddate_991D"],
    dpd_col=None,
    active_flag_expr=(pl.col("contractenddate_991D") >= pl.col("date_decision")),
    closed_flag_expr=(pl.col("contractenddate_991D") < pl.col("date_decision")),
)

print(dep_agg_v2.shape)
dep_agg_v2.head()


(105111, 17)


case_id,dep_row_count_all,dep_known_count,dep_active_count_all,dep_closed_count_all,dep_active_dpd30_count_all,dep_active_dpd90_count_all,dep_closed_dpd30_count_all,dep_closed_dpd90_count_all,dep_active_count_le1y,dep_closed_count_le1y,dep_active_count_1to3y,dep_closed_count_1to3y,dep_active_count_gt3y,dep_closed_count_gt3y,dep_active_dpd30_rate,dep_closed_dpd30_rate
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,f64,f64
2683214,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
926231,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0.0,0.0
1487378,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0.0,0.0
1536274,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2543743,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [24]:
# Merge + evaluate dep block
model_v3 = (
    model_v2
    .join(dep_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^dep_.*$").fill_null(0))
)

tr3 = model_v3.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va3 = model_v3.filter(pl.col("WEEK_NUM") > cut_week_v2)

dep_cols = [c for c in model_v3.columns if c.startswith("dep_")]
auc_all_v3 = run_auc(tr3, va3, base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols)

print("AUC prev (cb1+ap0+ap1):", round(auc_all_v2, 6))
print("AUC + dep             :", round(auc_all_v3, 6))
print("Delta                 :", round(auc_all_v3 - auc_all_v2, 6))


AUC prev (cb1+ap0+ap1): 0.613932
AUC + dep             : 0.618341
Delta                 : 0.004409


### Result Interpretation
- `deposit_1` improved AUC and is retained.


In [25]:
# Build debitcard_1 features
db_agg_v2 = build_contract_features(
    table_path=TABLES["debitcard_1"],
    prefix="db",
    base_dates=base_dates,
    event_date_cols=["openingdate_857D"],
    dpd_col=None,
    # No explicit close date in this table; treat records as active history presence
    active_flag_expr=pl.lit(True),
    closed_flag_expr=pl.lit(False),
)

print(db_agg_v2.shape)
db_agg_v2.head()


(111772, 17)


case_id,db_row_count_all,db_known_count,db_active_count_all,db_closed_count_all,db_active_dpd30_count_all,db_active_dpd90_count_all,db_closed_dpd30_count_all,db_closed_dpd90_count_all,db_active_count_le1y,db_closed_count_le1y,db_active_count_1to3y,db_closed_count_1to3y,db_active_count_gt3y,db_closed_count_gt3y,db_active_dpd30_rate,db_closed_dpd30_rate
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,f64,f64
1832249,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
1363371,2,2,2,0,0,0,0,0,0,0,0,0,2,0,0.0,0.0
2665925,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
1286158,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
1629987,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0


In [26]:
# Merge + evaluate db block
model_v4 = (
    model_v3
    .join(db_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^db_.*$").fill_null(0))
)

tr4 = model_v4.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va4 = model_v4.filter(pl.col("WEEK_NUM") > cut_week_v2)

db_cols = [c for c in model_v4.columns if c.startswith("db_")]
auc_all_v4 = run_auc(tr4, va4, base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols + db_cols)

print("AUC prev (+dep):", round(auc_all_v3, 6))
print("AUC + db       :", round(auc_all_v4, 6))
print("Delta          :", round(auc_all_v4 - auc_all_v3, 6))


AUC prev (+dep): 0.618341
AUC + db       : 0.617901
Delta          : -0.00044


### Result Interpretation
- `debitcard_1` reduced AUC slightly, so this block is dropped.


In [27]:
# Build tax_registry_a_1 features
taxa_agg_v2 = build_contract_features(
    table_path=TABLES["tax_registry_a_1"],
    prefix="taxa",
    base_dates=base_dates,
    event_date_cols=["recorddate_4527225D"],
    dpd_col=None,
    # Not contract-status data; use presence as active-like signal
    active_flag_expr=pl.lit(True),
    closed_flag_expr=pl.lit(False),
)

print(taxa_agg_v2.shape)
taxa_agg_v2.head()


(457934, 17)


case_id,taxa_row_count_all,taxa_known_count,taxa_active_count_all,taxa_closed_count_all,taxa_active_dpd30_count_all,taxa_active_dpd90_count_all,taxa_closed_dpd30_count_all,taxa_closed_dpd90_count_all,taxa_active_count_le1y,taxa_closed_count_le1y,taxa_active_count_1to3y,taxa_closed_count_1to3y,taxa_active_count_gt3y,taxa_closed_count_gt3y,taxa_active_dpd30_rate,taxa_closed_dpd30_rate
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,f64,f64
1738518,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
881786,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1832237,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1682673,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2625928,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [28]:
# Merge + evaluate taxa block (on top of model_v3, not model_v4)
model_v5 = (
    model_v3
    .join(taxa_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^taxa_.*$").fill_null(0))
)

tr5 = model_v5.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va5 = model_v5.filter(pl.col("WEEK_NUM") > cut_week_v2)

taxa_cols = [c for c in model_v5.columns if c.startswith("taxa_")]
auc_all_v5 = run_auc(tr5, va5, base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols)

print("AUC prev (without db):", round(auc_all_v3, 6))
print("AUC + taxa           :", round(auc_all_v5, 6))
print("Delta                :", round(auc_all_v5 - auc_all_v3, 6))


AUC prev (without db): 0.618341
AUC + taxa           : 0.630826
Delta                : 0.012485


### Result Interpretation
- `tax_registry_a_1` gave a strong lift and is retained.


In [29]:
# Build applprev_2 features
ap2 = pl.read_csv(TABLES["applprev_2"]).join(base_dates, on="case_id", how="left")

ap2_agg_v2 = (
    ap2.group_by("case_id").agg([
        pl.len().alias("ap2_row_count_all"),
        pl.len().alias("ap2_known_count"),  # no event date, treat as known
        pl.len().alias("ap2_active_count_all"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_count_all"),

        # no dpd in this table
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_active_dpd30_count_all"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_active_dpd90_count_all"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_dpd30_count_all"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_dpd90_count_all"),

        # no windows available
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_active_count_le1y"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_count_le1y"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_active_count_1to3y"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_count_1to3y"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_active_count_gt3y"),
        pl.lit(0).sum().cast(pl.Int64).alias("ap2_closed_count_gt3y"),
    ])
    .with_columns([
        pl.lit(0.0).alias("ap2_active_dpd30_rate"),
        pl.lit(0.0).alias("ap2_closed_dpd30_rate"),
    ])
)

print(ap2_agg_v2.shape)
ap2_agg_v2.head()



(1221522, 17)


case_id,ap2_row_count_all,ap2_known_count,ap2_active_count_all,ap2_closed_count_all,ap2_active_dpd30_count_all,ap2_active_dpd90_count_all,ap2_closed_dpd30_count_all,ap2_closed_dpd90_count_all,ap2_active_count_le1y,ap2_closed_count_le1y,ap2_active_count_1to3y,ap2_closed_count_1to3y,ap2_active_count_gt3y,ap2_closed_count_gt3y,ap2_active_dpd30_rate,ap2_closed_dpd30_rate
i64,u32,u32,u32,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,f64
1623175,6,6,6,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
774961,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
185363,15,15,15,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1335450,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1768410,10,10,10,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [30]:
# Merge + evaluate ap2 block
model_v6 = (
    model_v5
    .join(ap2_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^ap2_.*$").fill_null(0))
)

tr6 = model_v6.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va6 = model_v6.filter(pl.col("WEEK_NUM") > cut_week_v2)

ap2_cols = [c for c in model_v6.columns if c.startswith("ap2_")]
auc_all_v6 = run_auc(
    tr6, va6,
    base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols + ap2_cols
)

print("AUC prev (+taxa):", round(auc_all_v5, 6))
print("AUC + ap2       :", round(auc_all_v6, 6))
print("Delta           :", round(auc_all_v6 - auc_all_v5, 6))


AUC prev (+taxa): 0.630826
AUC + ap2       : 0.646143
Delta           : 0.015317


### Result Interpretation
- `applprev_2` improved AUC strongly and is retained.


In [31]:
# Cell 1: patch function to handle empty event_date_cols safely
def build_contract_features(
    table_path: Path,
    prefix: str,
    base_dates: pl.DataFrame,
    event_date_cols: list[str],
    dpd_col: str | None = None,
    active_flag_expr: pl.Expr | None = None,
    closed_flag_expr: pl.Expr | None = None,
):
    df = pl.read_csv(table_path).join(base_dates, on="case_id", how="left")

    parse_exprs = [pl.col(c).str.strptime(pl.Date, strict=False).alias(c) for c in event_date_cols if c in df.columns]
    cast_exprs = [pl.col(dpd_col).cast(pl.Float64, strict=False).alias(dpd_col)] if (dpd_col is not None and dpd_col in df.columns) else []
    df = df.with_columns(parse_exprs + cast_exprs)

    available_dates = [pl.col(c) for c in event_date_cols if c in df.columns]
    if len(available_dates) > 0:
        df = df.with_columns(pl.coalesce(available_dates).alias("event_date"))
        df = df.with_columns(((pl.col("date_decision") - pl.col("event_date")).dt.total_days() / 365.25).alias("age_years"))
        known_mask = (pl.col("age_years") >= 0)
    else:
        df = df.with_columns([
            pl.lit(None).cast(pl.Date).alias("event_date"),
            pl.lit(None).cast(pl.Float64).alias("age_years"),
        ])
        known_mask = pl.lit(True)

    if active_flag_expr is None:
        active_flag_expr = pl.lit(False)
    if closed_flag_expr is None:
        closed_flag_expr = pl.lit(False)

    df = df.with_columns([
        active_flag_expr.fill_null(False).alias("is_active"),
        closed_flag_expr.fill_null(False).alias("is_closed"),
    ])

    if dpd_col is not None and dpd_col in df.columns:
        df = df.with_columns([
            (pl.col(dpd_col) >= 30).fill_null(False).alias("dpd30"),
            (pl.col(dpd_col) >= 90).fill_null(False).alias("dpd90"),
        ])
    else:
        df = df.with_columns([
            pl.lit(False).alias("dpd30"),
            pl.lit(False).alias("dpd90"),
        ])

    agg = df.group_by("case_id").agg([
        pl.len().alias(f"{prefix}_row_count_all"),
        known_mask.sum().alias(f"{prefix}_known_count"),

        (known_mask & pl.col("is_active")).sum().alias(f"{prefix}_active_count_all"),
        (known_mask & pl.col("is_closed")).sum().alias(f"{prefix}_closed_count_all"),

        (known_mask & pl.col("is_active") & pl.col("dpd30")).sum().alias(f"{prefix}_active_dpd30_count_all"),
        (known_mask & pl.col("is_active") & pl.col("dpd90")).sum().alias(f"{prefix}_active_dpd90_count_all"),
        (known_mask & pl.col("is_closed") & pl.col("dpd30")).sum().alias(f"{prefix}_closed_dpd30_count_all"),
        (known_mask & pl.col("is_closed") & pl.col("dpd90")).sum().alias(f"{prefix}_closed_dpd90_count_all"),

        (known_mask & pl.col("is_active") & (pl.col("age_years") <= 1)).sum().alias(f"{prefix}_active_count_le1y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") <= 1)).sum().alias(f"{prefix}_closed_count_le1y"),

        (known_mask & pl.col("is_active") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias(f"{prefix}_active_count_1to3y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") > 1) & (pl.col("age_years") <= 3)).sum().alias(f"{prefix}_closed_count_1to3y"),

        (known_mask & pl.col("is_active") & (pl.col("age_years") > 3)).sum().alias(f"{prefix}_active_count_gt3y"),
        (known_mask & pl.col("is_closed") & (pl.col("age_years") > 3)).sum().alias(f"{prefix}_closed_count_gt3y"),
    ])

    agg = agg.with_columns([
        pl.when(pl.col(f"{prefix}_active_count_all") > 0)
        .then(pl.col(f"{prefix}_active_dpd30_count_all") / pl.col(f"{prefix}_active_count_all"))
        .otherwise(0.0)
        .alias(f"{prefix}_active_dpd30_rate"),

        pl.when(pl.col(f"{prefix}_closed_count_all") > 0)
        .then(pl.col(f"{prefix}_closed_dpd30_count_all") / pl.col(f"{prefix}_closed_count_all"))
        .otherwise(0.0)
        .alias(f"{prefix}_closed_dpd30_rate"),
    ])

    return agg


In [32]:
# Cell 2: person_1 features
p1_agg_v2 = build_contract_features(
    table_path=TABLES["person_1"],
    prefix="p1",
    base_dates=base_dates,
    event_date_cols=["empl_employedfrom_271D", "birthdate_87D", "birth_259D"],
    dpd_col=None,
    active_flag_expr=(pl.col("num_group1") == 0),   # applicant
    closed_flag_expr=(pl.col("num_group1") != 0),   # related persons
)

print(p1_agg_v2.shape)
p1_agg_v2.head()


(1526659, 17)


case_id,p1_row_count_all,p1_known_count,p1_active_count_all,p1_closed_count_all,p1_active_dpd30_count_all,p1_active_dpd90_count_all,p1_closed_dpd30_count_all,p1_closed_dpd90_count_all,p1_active_count_le1y,p1_closed_count_le1y,p1_active_count_1to3y,p1_closed_count_1to3y,p1_active_count_gt3y,p1_closed_count_gt3y,p1_active_dpd30_rate,p1_closed_dpd30_rate
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,f64,f64
241190,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
631682,2,2,1,1,0,0,0,0,1,0,0,0,0,1,0.0,0.0
1258636,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
1657783,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0
1778896,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0


In [33]:
# Cell 3: merge + evaluate person_1
model_v7 = (
    model_v6
    .join(p1_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^p1_.*$").fill_null(0))
)

tr7 = model_v7.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va7 = model_v7.filter(pl.col("WEEK_NUM") > cut_week_v2)

p1_cols = [c for c in model_v7.columns if c.startswith("p1_")]

auc_all_v7 = run_auc(
    tr7, va7,
    base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols + ap2_cols + p1_cols
)

print("AUC prev (+ap2):", round(auc_all_v6, 6))
print("AUC + p1       :", round(auc_all_v7, 6))
print("Delta          :", round(auc_all_v7 - auc_all_v6, 6))


AUC prev (+ap2): 0.646143
AUC + p1       : 0.646034
Delta          : -0.000108


### Result Interpretation
- `person_1` slightly reduced AUC, so this block is dropped.


In [34]:
# Build person_2 features
p2_agg_v2 = build_contract_features(
    table_path=TABLES["person_2"],
    prefix="p2",
    base_dates=base_dates,
    event_date_cols=["empls_employedfrom_796D"],  # if missing, function handles it
    dpd_col=None,
    active_flag_expr=(pl.col("num_group1") == 0),
    closed_flag_expr=(pl.col("num_group1") != 0),
)

print(p2_agg_v2.shape)
p2_agg_v2.head()


(1435105, 17)


case_id,p2_row_count_all,p2_known_count,p2_active_count_all,p2_closed_count_all,p2_active_dpd30_count_all,p2_active_dpd90_count_all,p2_closed_dpd30_count_all,p2_closed_dpd90_count_all,p2_active_count_le1y,p2_closed_count_le1y,p2_active_count_1to3y,p2_closed_count_1to3y,p2_active_count_gt3y,p2_closed_count_gt3y,p2_active_dpd30_rate,p2_closed_dpd30_rate
i64,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,f64,f64
1248409,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1539159,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1736044,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1814159,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
1636543,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0


In [35]:
# Merge + evaluate p2 (on model_v6, since p1 is dropped)
model_v8 = (
    model_v6
    .join(p2_agg_v2, on="case_id", how="left")
    .with_columns(pl.col("^p2_.*$").fill_null(0))
)

tr8 = model_v8.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va8 = model_v8.filter(pl.col("WEEK_NUM") > cut_week_v2)

p2_cols = [c for c in model_v8.columns if c.startswith("p2_")]

auc_all_v8 = run_auc(
    tr8, va8,
    base_cols + cb1_cols_v2 + ap0_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols + ap2_cols + p2_cols
)

print("AUC prev (no p1):", round(auc_all_v6, 6))
print("AUC + p2        :", round(auc_all_v8, 6))
print("Delta           :", round(auc_all_v8 - auc_all_v6, 6))


AUC prev (no p1): 0.646143
AUC + p2        : 0.617628
Delta           : -0.028514


### Result Interpretation
- `person_2` significantly reduced AUC, so this block is dropped.


## 6) Final Feature Set and Model

Assemble retained blocks only and report final AUC + feature count.


### Note: Intermediate Final (Superseded)
This section reflects an earlier checkpoint (`USE_AP0=True`).
Use **Section 8: Final Locked Configuration** as the latest final result.


In [36]:
# Final model with retained blocks (+ optional ap0)
USE_AP0 = True  # set False if you want to exclude weak block

# Build final feature list
final_feature_cols = (
    base_cols
    + cb1_cols_v2
    + ap1_cols_v2
    + dep_cols
    + taxa_cols
    + ap2_cols
)
if USE_AP0:
    final_feature_cols = final_feature_cols + ap0_cols_v2

# Build final dataframe from best "kept" model state
# model_v6 already has: cb1 + ap0 + ap1 + dep + taxa + ap2
# if not using ap0, just exclude ap0 columns at feature list level.
model_final = model_v6

tr_final = model_final.filter(pl.col("WEEK_NUM") <= cut_week_v2)
va_final = model_final.filter(pl.col("WEEK_NUM") > cut_week_v2)

auc_final = run_auc(tr_final, va_final, final_feature_cols)
print("Final AUC:", round(auc_final, 6))
print("Num features:", len(final_feature_cols))
print("Use ap0:", USE_AP0)


Final AUC: 0.646143
Num features: 97
Use ap0: True


### Final Outcome
- Final retained blocks: `cb1`, `ap1`, `dep`, `taxa`, `ap2` (+ optional `ap0`).
- Reported final AUC and feature count summarize your final deliverable.


## 7) Stability Validation (Rolling Time Splits)

Purpose:
- Validate feature choices across multiple time cut points, not only one split.
- Compare `with_ap0` vs `no_ap0` by `mean_auc` and `std_auc` over cuts `(50,60,70)`.


In [37]:
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score

def run_auc_for_cut(df, feature_cols, cut_week):
    tr = df.filter(pl.col("WEEK_NUM") <= cut_week)
    va = df.filter(pl.col("WEEK_NUM") > cut_week)

    Xtr = tr.select(feature_cols).to_pandas()
    ytr = tr["target"].to_pandas()
    Xva = va.select(feature_cols).to_pandas()
    yva = va["target"].to_pandas()

    clf = HistGradientBoostingClassifier(max_depth=6, learning_rate=0.05, max_iter=200, random_state=42)
    clf.fit(Xtr, ytr)
    p = clf.predict_proba(Xva)[:, 1]
    return roc_auc_score(yva, p)

def eval_stability(df, feature_cols, cuts=(50, 60, 70), label="model"):
    rows = []
    for c in cuts:
        auc = run_auc_for_cut(df, feature_cols, c)
        rows.append({"model": label, "cut_week": c, "auc": auc})
    out = pd.DataFrame(rows)
    summary = {
        "model": label,
        "mean_auc": out["auc"].mean(),
        "std_auc": out["auc"].std(ddof=0),
        "min_auc": out["auc"].min(),
        "max_auc": out["auc"].max(),
    }
    return out, pd.DataFrame([summary])


In [38]:
# Compare with and without ap0 under rolling stability
final_cols_with_ap0 = base_cols + cb1_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols + ap2_cols + ap0_cols_v2
final_cols_no_ap0 = base_cols + cb1_cols_v2 + ap1_cols_v2 + dep_cols + taxa_cols + ap2_cols

d1, s1 = eval_stability(model_v6, final_cols_with_ap0, cuts=(50, 60, 70), label="with_ap0")
d2, s2 = eval_stability(model_v6, final_cols_no_ap0, cuts=(50, 60, 70), label="no_ap0")

display(pd.concat([d1, d2], ignore_index=True))
display(pd.concat([s1, s2], ignore_index=True).sort_values("mean_auc", ascending=False))


Unnamed: 0,model,cut_week,auc
0,with_ap0,50,0.621022
1,with_ap0,60,0.646143
2,with_ap0,70,0.63867
3,no_ap0,50,0.625022
4,no_ap0,60,0.646517
5,no_ap0,70,0.635056


Unnamed: 0,model,mean_auc,std_auc,min_auc,max_auc
1,no_ap0,0.635532,0.008782,0.625022,0.646517
0,with_ap0,0.635278,0.010532,0.621022,0.646143


### Stability Interpretation
- Keep `no_ap0` if it has higher `mean_auc` and lower/equal `std_auc` than `with_ap0`.
- This selection criterion prioritizes both performance and temporal robustness.


## 8) Final Locked Configuration (Latest)

This section is the **latest selected configuration** after stability-first validation and window-scheme comparison.

### Why this section exists
- Earlier sections show the full experiment trail and are intentionally preserved.
- This section summarizes the final choice to avoid confusion during presentation.


### Final Selection (Supersedes Earlier Final Snapshot)
- Retained blocks: `cb1`, `ap1 (window scheme: <=0.5y, 0.5-2y, >2y)`, `dep`, `taxa`, `ap2`
- Dropped blocks: `ap0`, `db`, `p1`, `p2`

### Final Reported Metrics
- Single-split AUC (`cut_week=60`): **0.646666**
- Rolling stability (cuts 50/60/70):
  - mean AUC: **0.638389**
  - std AUC: **0.008758**
  - min AUC: **0.626271**
  - max AUC: **0.646666**
- Final feature count: **74**


### How We Selected Best AP1 Window

We compared three AP1 window schemes using rolling time splits (`WEEK_NUM` cuts: 50, 60, 70).
Selection criterion:
- maximize `mean_auc`
- prefer lower `std_auc` for stability

Schemes compared:
- `w_1_3`: `<=1y`, `1-3y`, `>3y`
- `w_0_5_2`: `<=0.5y`, `0.5-2y`, `>2y`
- `w_1_2`: `<=1y`, `1-2y`, `>2y`


In [39]:
# Helper for AP1 window-scheme testing (used in final locked section)
def build_contract_features_custom_windows(
    table_path: Path,
    prefix: str,
    base_dates: pl.DataFrame,
    event_date_cols: list[str],
    windows=((0,1),(1,3),(3,999)),
    dpd_col: str | None = None,
    active_flag_expr: pl.Expr | None = None,
    closed_flag_expr: pl.Expr | None = None,
):
    df = pl.read_csv(table_path).join(base_dates, on="case_id", how="left")

    parse_exprs = [pl.col(c).str.strptime(pl.Date, strict=False).alias(c) for c in event_date_cols if c in df.columns]
    cast_exprs = [pl.col(dpd_col).cast(pl.Float64, strict=False).alias(dpd_col)] if (dpd_col is not None and dpd_col in df.columns) else []
    df = df.with_columns(parse_exprs + cast_exprs)

    dates = [pl.col(c) for c in event_date_cols if c in df.columns]
    if len(dates) > 0:
        df = df.with_columns([
            pl.coalesce(dates).alias("event_date"),
            ((pl.col("date_decision") - pl.coalesce(dates)).dt.total_days() / 365.25).alias("age_years"),
        ])
        known = (pl.col("age_years") >= 0)
    else:
        df = df.with_columns([pl.lit(None).cast(pl.Float64).alias("age_years")])
        known = pl.lit(True)

    if active_flag_expr is None:
        active_flag_expr = pl.lit(False)
    if closed_flag_expr is None:
        closed_flag_expr = pl.lit(False)

    df = df.with_columns([
        active_flag_expr.fill_null(False).alias("is_active"),
        closed_flag_expr.fill_null(False).alias("is_closed"),
    ])

    aggs = [
        pl.len().alias(f"{prefix}_row_count_all"),
        (known & pl.col("is_active")).sum().alias(f"{prefix}_active_count_all"),
        (known & pl.col("is_closed")).sum().alias(f"{prefix}_closed_count_all"),
    ]

    for lo, hi in windows:
        tag = f"{lo}to{hi}" if hi < 999 else f"gt{lo}"
        cond = (pl.col("age_years") > lo) & (pl.col("age_years") <= hi) if hi < 999 else (pl.col("age_years") > lo)
        aggs += [
            (known & pl.col("is_active") & cond).sum().alias(f"{prefix}_active_count_{tag}"),
            (known & pl.col("is_closed") & cond).sum().alias(f"{prefix}_closed_count_{tag}"),
        ]

    return df.group_by("case_id").agg(aggs)



In [40]:
# Reproducible AP1 window-scheme comparison
schemes = {
    "w_1_3": ((0,1),(1,3),(3,999)),
    "w_0_5_2": ((0,0.5),(0.5,2),(2,999)),
    "w_1_2": ((0,1),(1,2),(2,999)),
}

rows = []
for name, win in schemes.items():
    ap1_alt = build_contract_features_custom_windows(
        table_path=TABLES["applprev_1_1"],
        prefix="ap1w",
        base_dates=base_dates,
        event_date_cols=["dateactivated_425D", "approvaldate_319D", "creationdate_885D"],
        windows=win,
        dpd_col=None,
        active_flag_expr=pl.col("status_219L").is_in(["A"]),
        closed_flag_expr=pl.col("status_219L").is_in(["D", "K", "T"]),
    )

    # keep other retained blocks fixed and swap only AP1 window features
    model_tmp = (
        model_v6
        .drop([c for c in model_v6.columns if c.startswith("ap0_") or c.startswith("ap1_")])
        .join(ap1_alt, on="case_id", how="left")
        .with_columns(pl.col("^ap1w_.*$").fill_null(0))
    )

    ap1w_cols = [c for c in model_tmp.columns if c.startswith("ap1w_")]
    feat_cols = base_cols + cb1_cols_v2 + dep_cols + taxa_cols + ap2_cols + ap1w_cols

    _, s = eval_stability(model_tmp, feat_cols, cuts=(50,60,70), label=name)
    rows.append(s.iloc[0].to_dict())

window_comp = pd.DataFrame(rows).sort_values(["mean_auc", "std_auc"], ascending=[False, True])
display(window_comp)
print("Selected scheme:", window_comp.iloc[0]["model"])


Unnamed: 0,model,mean_auc,std_auc,min_auc,max_auc
1,w_0_5_2,0.638389,0.008758,0.626271,0.646666
0,w_1_3,0.632906,0.009835,0.622304,0.646003
2,w_1_2,0.632019,0.009331,0.621741,0.644325


Selected scheme: w_0_5_2


In [41]:
# Final locked code (reproducible): no_ap0 + best ap1 windows (<=0.5y, 0.5-2y, >2y)

# Rebuild ap1 with selected best windows
ap1_best = build_contract_features_custom_windows(
    table_path=TABLES["applprev_1_1"],
    prefix="ap1w",
    base_dates=base_dates,
    event_date_cols=["dateactivated_425D", "approvaldate_319D", "creationdate_885D"],
    windows=((0,0.5), (0.5,2), (2,999)),
    dpd_col=None,
    active_flag_expr=pl.col("status_219L").is_in(["A"]),
    closed_flag_expr=pl.col("status_219L").is_in(["D", "K", "T"]),
)

# Start from model_v6, remove superseded blocks ap0/ap1, then join new ap1w block
model_locked = (
    model_v6
    .drop([c for c in model_v6.columns if c.startswith("ap0_") or c.startswith("ap1_")])
    .join(ap1_best, on="case_id", how="left")
    .with_columns(pl.col("^ap1w_.*$").fill_null(0))
)

# Final feature list
ap1w_cols = [c for c in model_locked.columns if c.startswith("ap1w_")]
final_locked_cols = base_cols + cb1_cols_v2 + dep_cols + taxa_cols + ap2_cols + ap1w_cols

# Single-split metric (continuity with earlier reporting)
auc_single_locked = run_auc_for_cut(model_locked, final_locked_cols, 60)

# Rolling stability metrics
detail_locked, summary_locked = eval_stability(
    model_locked, final_locked_cols, cuts=(50, 60, 70), label="final_locked_w_0_5_2"
)

print("Single-split AUC (cut=60):", round(auc_single_locked, 6))
display(detail_locked)
display(summary_locked)
print("Num features:", len(final_locked_cols))



Single-split AUC (cut=60): 0.646666


Unnamed: 0,model,cut_week,auc
0,final_locked_w_0_5_2,50,0.626271
1,final_locked_w_0_5_2,60,0.646666
2,final_locked_w_0_5_2,70,0.642232


Unnamed: 0,model,mean_auc,std_auc,min_auc,max_auc
0,final_locked_w_0_5_2,0.638389,0.008758,0.626271,0.646666


Num features: 74


### Interpretation for Group Discussion
- Stability-first validation confirmed that removing `ap0` improved consistency.
- Window tuning on `ap1` improved both average performance and stability versus the original windows.
- Final model is both stronger and simpler (fewer features) than the earlier intermediate final.


## Final Summary

### What was done
- Built case-level features from MUST-aggregate tables using:
  1. Active vs closed contract logic
  2. Time-windowed aggregation
  3. DPD-conditional aggregation
- Aggregated all engineered signals to one row per `case_id`.
- Evaluated feature blocks incrementally and kept only beneficial blocks.
- Applied stability-first validation using rolling time splits (`WEEK_NUM` cuts: 50, 60, 70).
- Tuned AP1 window scheme and selected best split by mean/stability.

### Final retained feature blocks
- `cb1`, `ap1w` (best windows: `<=0.5y`, `0.5-2y`, `>2y`), `dep`, `taxa`, `ap2`
- Dropped: `ap0`, `db`, `p1`, `p2`

### Final performance
- Single-split AUC (`cut=60`): **0.646666**
- Rolling stability AUC (cuts 50/60/70):
  - Mean: **0.638389**
  - Std: **0.008758**
  - Min: **0.626271**
  - Max: **0.646666**
- Final feature count: **74**

### Conclusion
The final engineered feature set improves predictive performance and is more stable over time, while remaining relatively compact.


## Final Feature Dictionary (74 Features)

| # | Feature | Meaning |
|---:|---|---|
| 1 | `WEEK_NUM` | Application week index (time context). |
| 2 | `cb1_row_count_all` | Total `credit_bureau_b_1` rows for the case. |
| 3 | `cb1_known_count` | `cb1` rows usable at decision time. |
| 4 | `cb1_active_count_all` | Active/open `cb1` contract rows count. |
| 5 | `cb1_closed_count_all` | Closed `cb1` contract rows count. |
| 6 | `cb1_active_dpd30_count_all` | Active `cb1` rows with DPD >= 30. |
| 7 | `cb1_active_dpd90_count_all` | Active `cb1` rows with DPD >= 90. |
| 8 | `cb1_closed_dpd30_count_all` | Closed `cb1` rows with DPD >= 30. |
| 9 | `cb1_closed_dpd90_count_all` | Closed `cb1` rows with DPD >= 90. |
| 10 | `cb1_active_count_le1y` | Active `cb1` rows within <=1 year. |
| 11 | `cb1_closed_count_le1y` | Closed `cb1` rows within <=1 year. |
| 12 | `cb1_active_count_1to3y` | Active `cb1` rows in 1–3 years. |
| 13 | `cb1_closed_count_1to3y` | Closed `cb1` rows in 1–3 years. |
| 14 | `cb1_active_count_gt3y` | Active `cb1` rows >3 years old. |
| 15 | `cb1_closed_count_gt3y` | Closed `cb1` rows >3 years old. |
| 16 | `cb1_active_dpd30_rate` | Active DPD30 ratio in `cb1`. |
| 17 | `cb1_closed_dpd30_rate` | Closed DPD30 ratio in `cb1`. |
| 18 | `dep_row_count_all` | Total `deposit_1` rows for the case. |
| 19 | `dep_known_count` | `deposit_1` rows usable at decision time. |
| 20 | `dep_active_count_all` | Active deposit rows count. |
| 21 | `dep_closed_count_all` | Closed deposit rows count. |
| 22 | `dep_active_dpd30_count_all` | Active deposit rows with DPD >= 30 (structural placeholder). |
| 23 | `dep_active_dpd90_count_all` | Active deposit rows with DPD >= 90 (structural placeholder). |
| 24 | `dep_closed_dpd30_count_all` | Closed deposit rows with DPD >= 30 (structural placeholder). |
| 25 | `dep_closed_dpd90_count_all` | Closed deposit rows with DPD >= 90 (structural placeholder). |
| 26 | `dep_active_count_le1y` | Active deposit rows <=1 year. |
| 27 | `dep_closed_count_le1y` | Closed deposit rows <=1 year. |
| 28 | `dep_active_count_1to3y` | Active deposit rows in 1–3 years. |
| 29 | `dep_closed_count_1to3y` | Closed deposit rows in 1–3 years. |
| 30 | `dep_active_count_gt3y` | Active deposit rows >3 years. |
| 31 | `dep_closed_count_gt3y` | Closed deposit rows >3 years. |
| 32 | `dep_active_dpd30_rate` | Active DPD30 ratio for deposits (structural). |
| 33 | `dep_closed_dpd30_rate` | Closed DPD30 ratio for deposits (structural). |
| 34 | `taxa_row_count_all` | Total `tax_registry_a_1` rows for the case. |
| 35 | `taxa_known_count` | Tax rows usable at decision time. |
| 36 | `taxa_active_count_all` | Active-like tax rows count. |
| 37 | `taxa_closed_count_all` | Closed-like tax rows count. |
| 38 | `taxa_active_dpd30_count_all` | Active-like tax rows with DPD >= 30 (structural). |
| 39 | `taxa_active_dpd90_count_all` | Active-like tax rows with DPD >= 90 (structural). |
| 40 | `taxa_closed_dpd30_count_all` | Closed-like tax rows with DPD >= 30 (structural). |
| 41 | `taxa_closed_dpd90_count_all` | Closed-like tax rows with DPD >= 90 (structural). |
| 42 | `taxa_active_count_le1y` | Active-like tax rows <=1 year. |
| 43 | `taxa_closed_count_le1y` | Closed-like tax rows <=1 year. |
| 44 | `taxa_active_count_1to3y` | Active-like tax rows in 1–3 years. |
| 45 | `taxa_closed_count_1to3y` | Closed-like tax rows in 1–3 years. |
| 46 | `taxa_active_count_gt3y` | Active-like tax rows >3 years. |
| 47 | `taxa_closed_count_gt3y` | Closed-like tax rows >3 years. |
| 48 | `taxa_active_dpd30_rate` | Active-like DPD30 ratio in tax block (structural). |
| 49 | `taxa_closed_dpd30_rate` | Closed-like DPD30 ratio in tax block (structural). |
| 50 | `ap2_row_count_all` | Total `applprev_2` rows for the case. |
| 51 | `ap2_known_count` | `applprev_2` usable rows count. |
| 52 | `ap2_active_count_all` | Active-like `ap2` rows count. |
| 53 | `ap2_closed_count_all` | Closed-like `ap2` rows count. |
| 54 | `ap2_active_dpd30_count_all` | Active-like `ap2` rows with DPD >= 30 (structural). |
| 55 | `ap2_active_dpd90_count_all` | Active-like `ap2` rows with DPD >= 90 (structural). |
| 56 | `ap2_closed_dpd30_count_all` | Closed-like `ap2` rows with DPD >= 30 (structural). |
| 57 | `ap2_closed_dpd90_count_all` | Closed-like `ap2` rows with DPD >= 90 (structural). |
| 58 | `ap2_active_count_le1y` | Active-like `ap2` rows <=1 year (structural). |
| 59 | `ap2_closed_count_le1y` | Closed-like `ap2` rows <=1 year (structural). |
| 60 | `ap2_active_count_1to3y` | Active-like `ap2` rows in 1–3 years (structural). |
| 61 | `ap2_closed_count_1to3y` | Closed-like `ap2` rows in 1–3 years (structural). |
| 62 | `ap2_active_count_gt3y` | Active-like `ap2` rows >3 years (structural). |
| 63 | `ap2_closed_count_gt3y` | Closed-like `ap2` rows >3 years (structural). |
| 64 | `ap2_active_dpd30_rate` | Active-like DPD30 ratio in `ap2` (structural). |
| 65 | `ap2_closed_dpd30_rate` | Closed-like DPD30 ratio in `ap2` (structural). |
| 66 | `ap1w_row_count_all` | Total `applprev_1_1` rows for the case. |
| 67 | `ap1w_active_count_all` | Active `ap1` rows count. |
| 68 | `ap1w_closed_count_all` | Closed `ap1` rows count. |
| 69 | `ap1w_active_count_0to0.5` | Active `ap1` rows in 0–0.5 years. |
| 70 | `ap1w_closed_count_0to0.5` | Closed `ap1` rows in 0–0.5 years. |
| 71 | `ap1w_active_count_0.5to2` | Active `ap1` rows in 0.5–2 years. |
| 72 | `ap1w_closed_count_0.5to2` | Closed `ap1` rows in 0.5–2 years. |
| 73 | `ap1w_active_count_gt2` | Active `ap1` rows >2 years. |
| 74 | `ap1w_closed_count_gt2` | Closed `ap1` rows >2 years. |

**Note:** Some DPD/time columns in `dep_*`, `taxa_*`, and `ap2_*` are structural placeholders from the unified template (often near-zero), kept for consistency.
