# KKBox Churn Prediction — Embedding + MLP (Hyperparameter Summary)

## 1) Reproducibility / Split
- `RANDOM_STATE`: **719**
- Split: **Train / Valid / Test = 0.70 / 0.15 / 0.15**
  - Step1: `train_test_split(test_size=0.30, stratify=y)`
  - Step2: `train_test_split(test_size=0.50, stratify=y_tmp)`  → valid/test 15%/15%

## 2) Feature Set
- Target: `TARGET_COL = "is_churn"` (binary, 0/1)
- ID: `ID_COL = "msno"`
- Features: `FEATURE_COLS = CATEGORICAL_COLS + NUMERICAL_COLS`
  - Categorical (Embedding input): `CATEGORICAL_COLS = [...]`
  - Numerical (Scaler input): `NUMERICAL_COLS = [...]`
- 실험(e0/e1/e2/e3/e3.1)은 위 리스트에서 **컬럼 주석 ON/OFF**로 통제

## 3) Preprocess
### Numerical
- Missing value: **train median impute**
- Scaling: **StandardScaler** (fit on train, transform on valid/test)

### Categorical
- Encoding: **train-vocab 기반 index mapping**
  - UNK/NaN → **0**
  - seen category → **1..V**
- Input dtype: `object` 강제(카테고리 에러 방지)

## 4) Model Architecture (Embedding + MLP)
- Output: **1 logit** → `sigmoid(logit)` = `P(is_churn=1)`
- Hidden sizes: `HIDDEN = (256, 128, 64)`
- Dropout: `DROPOUT = 0.35`
- Normalization: **BatchNorm1d(in_dim)** (concat 입력에 1회 적용)
- Activation: **ReLU**

### Embedding dimension rule (per categorical column)
- `d_i = min(50, max(2, round(sqrt(V_i) * 2)))`
  - `V_i`: 해당 컬럼의 train 카테고리 개수(UNK 제외)

## 5) Optimization / Loss (Imbalance Handling)
- Loss: **BCEWithLogitsLoss**
- Class imbalance weight:
  - `pos_weight = n_neg / n_pos`  (train에서 자동 계산)
- Optimizer: **AdamW**
  - Learning rate: `LR = 2e-3`
  - Weight decay: `WEIGHT_DECAY = 1e-4`

## 6) Training
- Batch size: `BATCH_SIZE = 4096`
- Max epochs: `MAX_EPOCHS = 30`
- Early stopping:
  - Metric: **Valid PR-AUC**
  - Patience: `PATIENCE = 5`

## 7) Evaluation
- Core metrics (Valid/Test 공통):
  - **PR-AUC**, **ROC-AUC**, **LogLoss**, **Accuracy@0.5**
- Threshold-based report:
  - `threshold = 0.5`
  - Confusion Matrix / Classification Report @ 0.5

## 8) (Optional) Permutation Importance (PR-AUC drop)
- `RUN_PERM_IMPORTANCE`: True/False
- `PERM_TOP_N`: 30
- `PERM_MAX_SAMPLES`: 200000
- Importance definition:
  - drop = (baseline PR-AUC) − (PR-AUC after shuffling one feature)


In [32]:
# ============================================================
# 0) Imports & Seed
# ============================================================
import os
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, log_loss,
    accuracy_score, confusion_matrix, classification_report
)

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

RANDOM_STATE = 719

def seed_everything(seed=RANDOM_STATE):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

seed_everything()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)


device: cpu


## 1) Config (한 곳에서만 수정)

- 아래에서 **컬럼 ON/OFF(주석)** 으로 feature set을 관리하세요.
- `DATA_PATH`만 본인 환경에 맞게 조정하면 됩니다.


In [33]:
# ============================================================
# 1) Config
# ============================================================
DATA_PATH = "kkbox_train_feature_v3.parquet"   # 필요 시 수정

RANDOM_STATE = 719

ID_COL = "msno"
TARGET_COL = "is_churn"

CATEGORICAL_COLS = [
    "city", "gender", "registered_via", "last_payment_method",
    "has_ever_paid", "has_ever_cancelled",
    # "is_auto_renew_last",
    # "is_free_user",
]

NUMERICAL_COLS = [
    "reg_days",

    # ======================
    # w7
    # ======================
    "num_days_active_w7", "total_secs_w7", "avg_secs_per_day_w7", "std_secs_w7",
    "num_songs_w7", "avg_songs_per_day_w7", "num_unq_w7", "num_25_w7", "num_100_w7",
    "short_play_w7", "skip_ratio_w7", "completion_ratio_w7", "short_play_ratio_w7", "variety_ratio_w7",

    # ======================
    # w14
    # ======================
    "num_days_active_w14", "total_secs_w14", "avg_secs_per_day_w14", "std_secs_w14",
    "num_songs_w14", "avg_songs_per_day_w14", "num_unq_w14", "num_25_w14", "num_100_w14",
    "short_play_w14", "skip_ratio_w14", "completion_ratio_w14", "short_play_ratio_w14", "variety_ratio_w14",

    # ======================
    # w21
    # ======================
    "num_days_active_w21", "total_secs_w21", "avg_secs_per_day_w21", "std_secs_w21",
    "num_songs_w21", "avg_songs_per_day_w21", "num_unq_w21", "num_25_w21", "num_100_w21",
    "short_play_w21", "skip_ratio_w21", "completion_ratio_w21", "short_play_ratio_w21", "variety_ratio_w21",

    # ======================
    # w30  (OFF → 주석 처리)
    # ======================
    # "num_days_active_w30", "total_secs_w30", "avg_secs_per_day_w30", "std_secs_w30",
    # "num_songs_w30", "avg_songs_per_day_w30", "num_unq_w30", "num_25_w30", "num_100_w30",
    # "short_play_w30", "skip_ratio_w30", "completion_ratio_w30", "short_play_ratio_w30", "variety_ratio_w30",

    # ======================
    # trend (주의: 상위 window에 종속됨)
    # ======================
    # w7–w14
    "days_trend_w7_w14",

    # w7–w30 / w14–w30 (w30 OFF 시 같이 OFF)
    # "secs_trend_w7_w30", "secs_trend_w14_w30",
    # "days_trend_w7_w30",
    # "songs_trend_w7_w30", "songs_trend_w14_w30",
    # "skip_trend_w7_w30", "completion_trend_w7_w30",

    # ======================
    # transactions (logs-only 실험 시 OFF)
    # ======================
    # "days_since_last_payment", "days_since_last_cancel", "last_plan_days",
    # "total_payment_count", "total_amount_paid", "avg_amount_per_payment",
    # "unique_plan_count", "subscription_months_est",
    # "payment_count_last_30d", "payment_count_last_90d",
]

FEATURE_COLS = CATEGORICAL_COLS + NUMERICAL_COLS

# ------------------------------
# Training Hyperparams
# ------------------------------
BATCH_SIZE = 4096
MAX_EPOCHS = 30
PATIENCE = 5
LR = 2e-3
WEIGHT_DECAY = 1e-4

HIDDEN = (256, 128, 64)
DROPOUT = 0.35

# ------------------------------
# Optional: Permutation Importance
# ------------------------------
RUN_PERM_IMPORTANCE = True
PERM_TOP_N = 30          # 상위 N개만 출력
PERM_MAX_SAMPLES = 200000 # 너무 오래 걸리면 줄이세요 (예: 50000)

def print_config():
    print("CATEGORICAL_COLS:", len(CATEGORICAL_COLS))
    print("NUMERICAL_COLS  :", len(NUMERICAL_COLS))
    print("FEATURE_COLS    :", len(FEATURE_COLS))
    print("BATCH_SIZE      :", BATCH_SIZE)
    print("MAX_EPOCHS      :", MAX_EPOCHS, "PATIENCE:", PATIENCE)
    print("LR/WEIGHT_DECAY :", LR, WEIGHT_DECAY)
    print("HIDDEN/DROPOUT  :", HIDDEN, DROPOUT)

print_config()


CATEGORICAL_COLS: 6
NUMERICAL_COLS  : 44
FEATURE_COLS    : 50
BATCH_SIZE      : 4096
MAX_EPOCHS      : 30 PATIENCE: 5
LR/WEIGHT_DECAY : 0.002 0.0001
HIDDEN/DROPOUT  : (256, 128, 64) 0.35


## 2) Load & Validate → X, y 생성

- 지정 컬럼이 누락되면 즉시 에러로 중단합니다(실험 공정성/재현성).
- `X = df[FEATURE_COLS]`, `y = df[TARGET_COL]` 방식은 요청하신 그대로 유지합니다.


In [34]:
# ============================================================
# 2) Load & Validate
# ============================================================
# (로컬 실행/샌드박스 실행 모두 대비)
candidate_paths = [DATA_PATH, f"/mnt/data/{os.path.basename(DATA_PATH)}", "/mnt/data/kkbox_train_feature_v3.parquet"]
for p in candidate_paths:
    if os.path.exists(p):
        DATA_PATH = p
        break

print("Using DATA_PATH:", DATA_PATH)

df = pd.read_parquet(DATA_PATH)

required_cols = [ID_COL, TARGET_COL] + FEATURE_COLS
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"[ERROR] Missing columns ({len(missing)}): {missing}")

X = df[FEATURE_COLS].copy()
y = df[TARGET_COL].astype(int).copy()
ids = df[ID_COL].copy()

print("df shape:", df.shape)
print("X shape :", X.shape)
print("y shape :", y.shape, "pos rate:", float(y.mean()))


Using DATA_PATH: kkbox_train_feature_v3.parquet
df shape: (860966, 85)
X shape : (860966, 50)
y shape : (860966,) pos rate: 0.09460071594000227


## 3) Fixed Split (Train/Valid/Test)

In [35]:
# ============================================================
# 3) Fixed Split (70/15/15) - no file save
# ============================================================

idx_all = np.arange(len(df))

# 1) Train 70%, Temp 30%
tr_idx, tmp_idx = train_test_split(
    idx_all,
    test_size=0.30,
    stratify=y.values,
    random_state=RANDOM_STATE
)

# 2) Temp 30%를 Valid 15%, Test 15%로 50:50 분할
va_idx, te_idx = train_test_split(
    tmp_idx,
    test_size=0.50,
    stratify=y.values[tmp_idx],
    random_state=RANDOM_STATE
)

X_tr, y_tr = X.iloc[tr_idx].copy(), y.iloc[tr_idx].copy()
X_va, y_va = X.iloc[va_idx].copy(), y.iloc[va_idx].copy()
X_te, y_te = X.iloc[te_idx].copy(), y.iloc[te_idx].copy()

print("split sizes:", len(tr_idx), len(va_idx), len(te_idx))
print("split ratios:", len(tr_idx)/len(df), len(va_idx)/len(df), len(te_idx)/len(df))
print("pos rate | train/valid/test:",
      float(y_tr.mean()), float(y_va.mean()), float(y_te.mean()))

split sizes: 602676 129145 129145
split ratios: 0.6999997677027897 0.15000011614860517 0.15000011614860517
pos rate | train/valid/test: 0.09460141104009451 0.09459909404158116 0.09459909404158116


## 4) Preprocess

- Numerical: median impute + StandardScaler
- Categorical: train-vocab mapping → index (unseen/NaN = 0)
  - `pandas.Categorical` 관련 에러 방지를 위해 `astype("object")`를 강제합니다.


In [36]:
# ============================================================
# 4) Preprocess
# ============================================================
# ---- Numeric: median + scaler (train 기준)
num_median = X_tr[NUMERICAL_COLS].median(numeric_only=True)

X_tr_num = X_tr[NUMERICAL_COLS].fillna(num_median)
X_va_num = X_va[NUMERICAL_COLS].fillna(num_median)
X_te_num = X_te[NUMERICAL_COLS].fillna(num_median)

scaler = StandardScaler()
X_tr_num = scaler.fit_transform(X_tr_num).astype(np.float32)
X_va_num = scaler.transform(X_va_num).astype(np.float32)
X_te_num = scaler.transform(X_te_num).astype(np.float32)

# ---- Categorical: mapping (train 기준) + transform
def fit_cat_map(train_series: pd.Series):
    s = train_series.astype("object")
    vals = s.dropna().unique().tolist()
    vals = sorted(vals, key=lambda v: str(v))
    mapping = {v: i + 1 for i, v in enumerate(vals)}  # 1..N
    size = len(vals) + 1  # +1 for UNK=0
    return mapping, size

def transform_cat(df_part: pd.DataFrame, col: str, mapping: dict) -> np.ndarray:
    s = df_part[col].astype("object")
    return s.map(mapping).fillna(0).astype(np.int64).values

cat_maps = {}
cat_sizes = {}
X_tr_cat_list, X_va_cat_list, X_te_cat_list = [], [], []

for c in CATEGORICAL_COLS:
    m, size = fit_cat_map(X_tr[c])
    cat_maps[c] = m
    cat_sizes[c] = size

    X_tr_cat_list.append(transform_cat(X_tr, c, m))
    X_va_cat_list.append(transform_cat(X_va, c, m))
    X_te_cat_list.append(transform_cat(X_te, c, m))

X_tr_cat = np.stack(X_tr_cat_list, axis=1)
X_va_cat = np.stack(X_va_cat_list, axis=1)
X_te_cat = np.stack(X_te_cat_list, axis=1)

print("Numeric shape:", X_tr_num.shape)
print("Cat shape    :", X_tr_cat.shape)
print("cat_sizes    :", cat_sizes)


Numeric shape: (602676, 44)
Cat shape    : (602676, 6)
cat_sizes    : {'city': 22, 'gender': 4, 'registered_via': 6, 'last_payment_method': 34, 'has_ever_paid': 3, 'has_ever_cancelled': 3}


## 5) Dataset / DataLoader


In [37]:
# ============================================================
# 5) Dataset / DataLoader
# ============================================================
class KKBoxDataset(Dataset):
    def __init__(self, X_num, X_cat, y):
        self.X_num = torch.from_numpy(X_num)                  # float32
        self.X_cat = torch.from_numpy(X_cat)                  # int64
        self.y = torch.from_numpy(y.values.astype(np.float32))# float32

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X_num[idx], self.X_cat[idx], self.y[idx]

train_loader = DataLoader(KKBoxDataset(X_tr_num, X_tr_cat, y_tr), batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
valid_loader = DataLoader(KKBoxDataset(X_va_num, X_va_cat, y_va), batch_size=BATCH_SIZE, shuffle=False, num_workers=0)
test_loader  = DataLoader(KKBoxDataset(X_te_num, X_te_cat, y_te), batch_size=BATCH_SIZE, shuffle=False, num_workers=0)


## 6) Model (Embedding + MLP)

- 출력은 `logit` 1개이며, 확률은 `sigmoid(logit)`로 변환합니다.


In [38]:
# ============================================================
# 6) Model
# ============================================================
def choose_emb_dim(n_cat: int) -> int:
    # 간단 휴리스틱: 너무 작지 않게 / 너무 크지 않게
    return int(min(50, max(2, round(np.sqrt(n_cat) * 2))))

class EmbeddingMLP(nn.Module):
    def __init__(self, num_numeric, cat_sizes, hidden=HIDDEN, dropout=DROPOUT):
        super().__init__()
        self.cat_cols = list(cat_sizes.keys())
        self.emb_layers = nn.ModuleDict()

        emb_out_dim = 0
        for c in self.cat_cols:
            n_cat = cat_sizes[c]
            d = choose_emb_dim(n_cat)
            self.emb_layers[c] = nn.Embedding(num_embeddings=n_cat, embedding_dim=d)
            emb_out_dim += d

        in_dim = num_numeric + emb_out_dim

        layers = [nn.BatchNorm1d(in_dim)]
        prev = in_dim
        for h in hidden:
            layers += [nn.Linear(prev, h), nn.ReLU(), nn.Dropout(dropout)]
            prev = h
        layers += [nn.Linear(prev, 1)]
        self.mlp = nn.Sequential(*layers)

    def forward(self, x_num, x_cat):
        embs = []
        for i, c in enumerate(self.cat_cols):
            embs.append(self.emb_layers[c](x_cat[:, i]))
        x = torch.cat([x_num] + embs, dim=1)
        logit = self.mlp(x).squeeze(1)
        return logit

model = EmbeddingMLP(num_numeric=X_tr_num.shape[1], cat_sizes=cat_sizes).to(device)
print(model)


EmbeddingMLP(
  (emb_layers): ModuleDict(
    (city): Embedding(22, 9)
    (gender): Embedding(4, 4)
    (registered_via): Embedding(6, 5)
    (last_payment_method): Embedding(34, 12)
    (has_ever_paid): Embedding(3, 3)
    (has_ever_cancelled): Embedding(3, 3)
  )
  (mlp): Sequential(
    (0): BatchNorm1d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Linear(in_features=80, out_features=256, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.35, inplace=False)
    (4): Linear(in_features=256, out_features=128, bias=True)
    (5): ReLU()
    (6): Dropout(p=0.35, inplace=False)
    (7): Linear(in_features=128, out_features=64, bias=True)
    (8): ReLU()
    (9): Dropout(p=0.35, inplace=False)
    (10): Linear(in_features=64, out_features=1, bias=True)
  )
)


## 7) Train (Early Stopping by PR-AUC)

- 불균형 대응: `pos_weight = #neg / #pos`
- Early stopping 기준: **Valid PR-AUC**


In [39]:
# ============================================================
# 7) Train
# ============================================================
n_pos = int((y_tr.values == 1).sum())
n_neg = int((y_tr.values == 0).sum())
pos_weight = torch.tensor([n_neg / max(1, n_pos)], dtype=torch.float32, device=device)
print("pos_weight:", float(pos_weight.item()))

criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

@torch.no_grad()
def predict_proba(loader):
    model.eval()
    probs, ys = [], []
    for x_num, x_cat, yb in loader:
        x_num = x_num.to(device)
        x_cat = x_cat.to(device)

        logit = model(x_num, x_cat)
        p = torch.sigmoid(logit).detach().cpu().numpy()
        probs.append(p)
        ys.append(yb.numpy())
    return np.concatenate(ys), np.concatenate(probs)

def eval_core_metrics(y_true, y_prob, threshold=0.5):
    y_prob = np.clip(y_prob, 1e-7, 1 - 1e-7)
    y_pred = (y_prob >= threshold).astype(int)
    return {
        "PR_AUC": average_precision_score(y_true, y_prob),
        "ROC_AUC": roc_auc_score(y_true, y_prob),
        "LogLoss": log_loss(y_true, y_prob),
        "Accuracy@0.5": accuracy_score(y_true, y_pred),
    }

best_pr = -1.0
best_state = None
pat_cnt = 0

for epoch in range(1, MAX_EPOCHS + 1):
    model.train()
    total_loss, n_batches = 0.0, 0

    for x_num, x_cat, yb in train_loader:
        x_num = x_num.to(device)
        x_cat = x_cat.to(device)
        yb = yb.to(device)

        optimizer.zero_grad()
        logit = model(x_num, x_cat)
        loss = criterion(logit, yb)
        loss.backward()
        optimizer.step()

        total_loss += float(loss.item())
        n_batches += 1

    # valid
    yv, pv = predict_proba(valid_loader)
    m = eval_core_metrics(yv, pv, threshold=0.5)
    train_loss = total_loss / max(1, n_batches)

    print(f"[Epoch {epoch:02d}] train_loss={train_loss:.5f} | "
          f"valid PR_AUC={m['PR_AUC']:.5f} ROC_AUC={m['ROC_AUC']:.5f} "
          f"LogLoss={m['LogLoss']:.5f} Acc@0.5={m['Accuracy@0.5']:.5f}")

    if m["PR_AUC"] > best_pr + 1e-5:
        best_pr = m["PR_AUC"]
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        pat_cnt = 0
    else:
        pat_cnt += 1
        if pat_cnt >= PATIENCE:
            print(f"Early stopping. Best PR_AUC={best_pr:.5f}")
            break

if best_state is not None:
    model.load_state_dict({k: v.to(device) for k, v in best_state.items()})


pos_weight: 9.570667266845703
[Epoch 01] train_loss=0.61835 | valid PR_AUC=0.78297 ROC_AUC=0.94116 LogLoss=0.29753 Acc@0.5=0.89608
[Epoch 02] train_loss=0.55407 | valid PR_AUC=0.78772 ROC_AUC=0.94173 LogLoss=0.28619 Acc@0.5=0.89823
[Epoch 03] train_loss=0.54934 | valid PR_AUC=0.78866 ROC_AUC=0.94280 LogLoss=0.26818 Acc@0.5=0.89982
[Epoch 04] train_loss=0.54626 | valid PR_AUC=0.78978 ROC_AUC=0.94303 LogLoss=0.26548 Acc@0.5=0.90129
[Epoch 05] train_loss=0.54513 | valid PR_AUC=0.79060 ROC_AUC=0.94319 LogLoss=0.29138 Acc@0.5=0.89854
[Epoch 06] train_loss=0.54212 | valid PR_AUC=0.79025 ROC_AUC=0.94336 LogLoss=0.29700 Acc@0.5=0.89184
[Epoch 07] train_loss=0.54256 | valid PR_AUC=0.79114 ROC_AUC=0.94326 LogLoss=0.28076 Acc@0.5=0.89496
[Epoch 08] train_loss=0.54025 | valid PR_AUC=0.79163 ROC_AUC=0.94370 LogLoss=0.27417 Acc@0.5=0.89502
[Epoch 09] train_loss=0.54202 | valid PR_AUC=0.79211 ROC_AUC=0.94378 LogLoss=0.27840 Acc@0.5=0.89681
[Epoch 10] train_loss=0.54053 | valid PR_AUC=0.79179 ROC_AUC=

## 8) Test Evaluation + Gap (Valid ↔ Test)

- Valid/Test 성능을 모두 기록하고, gap을 수치로 계산합니다.
- Confusion Matrix / Classification Report는 threshold=0.5 기준(필요 시 추후 조정).


In [40]:
# ============================================================
# 8) Eval (Valid/Test + gap) + Confusion/Report
# ============================================================
yv, pv = predict_proba(valid_loader)
yt, pt = predict_proba(test_loader)

mv = eval_core_metrics(yv, pv, threshold=0.5)
mt = eval_core_metrics(yt, pt, threshold=0.5)

gap = {f"gap_{k}": (mt[k] - mv[k]) for k in mv.keys()}

print("\n[VALID METRICS]")
for k, v in mv.items():
    print(f"{k:12s}: {v:.6f}")

print("\n[TEST METRICS]")
for k, v in mt.items():
    print(f"{k:12s}: {v:.6f}")

print("\n[GAP (TEST - VALID)]")
for k, v in gap.items():
    print(f"{k:12s}: {v:.6f}")

# Confusion Matrix + Classification Report (threshold=0.5)
y_pred = (pt >= 0.5).astype(int)

print("\n[Confusion Matrix @0.5]")
print(confusion_matrix(yt, y_pred))

print("\n[Classification Report @0.5]")
print(classification_report(yt, y_pred, digits=4, zero_division=0))



[VALID METRICS]
PR_AUC      : 0.793281
ROC_AUC     : 0.943961
LogLoss     : 0.284601
Accuracy@0.5: 0.892989

[TEST METRICS]
PR_AUC      : 0.788933
ROC_AUC     : 0.942006
LogLoss     : 0.285698
Accuracy@0.5: 0.890929

[GAP (TEST - VALID)]
gap_PR_AUC  : -0.004348
gap_ROC_AUC : -0.001955
gap_LogLoss : 0.001098
gap_Accuracy@0.5: -0.002060

[Confusion Matrix @0.5]
[[104688  12240]
 [  1846  10371]]

[Classification Report @0.5]
              precision    recall  f1-score   support

         0.0     0.9827    0.8953    0.9370    116928
         1.0     0.4587    0.8489    0.5956     12217

    accuracy                         0.8909    129145
   macro avg     0.7207    0.8721    0.7663    129145
weighted avg     0.9331    0.8909    0.9047    129145



## 9) (옵션) Permutation Importance (PR-AUC drop)

- 딥러닝은 트리 모델처럼 `feature_importances_`가 기본 제공되지 않아,
  **Permutation Importance**로 “해당 피처를 섞었을 때 PR-AUC가 얼마나 떨어지는지(drop)”를 계산합니다.
- drop이 클수록 모델이 해당 피처에 더 의존합니다.


In [41]:
# ============================================================
# 9) Permutation Importance (Optional)
# ============================================================
from sklearn.metrics import average_precision_score

@torch.no_grad()
def batched_predict_proba_from_arrays(X_num, X_cat, batch_size=8192):
    model.eval()
    probs = []
    n = X_num.shape[0]
    for i in range(0, n, batch_size):
        xb_num = torch.from_numpy(X_num[i:i+batch_size]).to(device)
        xb_cat = torch.from_numpy(X_cat[i:i+batch_size]).to(device)
        logit = model(xb_num, xb_cat)
        p = torch.sigmoid(logit).detach().cpu().numpy()
        probs.append(p)
    return np.concatenate(probs)

def permutation_importance_pr_auc(
    y_true, X_num, X_cat,
    numeric_cols, categorical_cols,
    base_prob=None,
    top_n=30,
    random_state=RANDOM_STATE,
):
    rng = np.random.default_rng(random_state)

    if base_prob is None:
        base_prob = batched_predict_proba_from_arrays(X_num, X_cat)
    base_pr = average_precision_score(y_true, base_prob)

    rows = []

    # numeric
    for j, fname in enumerate(numeric_cols):
        X_num_p = X_num.copy()
        X_num_p[:, j] = rng.permutation(X_num_p[:, j])
        p = batched_predict_proba_from_arrays(X_num_p, X_cat)
        pr = average_precision_score(y_true, p)
        rows.append({"feature": fname, "group": "NUMERIC", "base_pr_auc": base_pr, "pr_auc": pr, "drop": base_pr - pr})

    # categorical (embedding index)
    for j, fname in enumerate(categorical_cols):
        X_cat_p = X_cat.copy()
        X_cat_p[:, j] = rng.permutation(X_cat_p[:, j])
        p = batched_predict_proba_from_arrays(X_num, X_cat_p)
        pr = average_precision_score(y_true, p)
        rows.append({"feature": fname, "group": "CAT", "base_pr_auc": base_pr, "pr_auc": pr, "drop": base_pr - pr})

    imp_df = pd.DataFrame(rows).sort_values("drop", ascending=False).reset_index(drop=True)
    return imp_df.head(top_n)

if RUN_PERM_IMPORTANCE:
    # 샘플 수 제한(속도)
    n = X_te_num.shape[0]
    use_n = min(n, PERM_MAX_SAMPLES)
    Xn = X_te_num[:use_n]
    Xc = X_te_cat[:use_n]
    yy = y_te.values[:use_n]

    base_prob = batched_predict_proba_from_arrays(Xn, Xc)
    imp_df = permutation_importance_pr_auc(
        y_true=yy, X_num=Xn, X_cat=Xc,
        numeric_cols=NUMERICAL_COLS,
        categorical_cols=CATEGORICAL_COLS,
        base_prob=base_prob,
        top_n=PERM_TOP_N,
        random_state=RANDOM_STATE,
    )
    print("\n[Permutation Importance] Top (by PR-AUC drop)")
    display(imp_df)
else:
    print("RUN_PERM_IMPORTANCE=False (skipped)")



[Permutation Importance] Top (by PR-AUC drop)


Unnamed: 0,feature,group,base_pr_auc,pr_auc,drop
0,has_ever_cancelled,CAT,0.788933,0.354401,0.434532
1,has_ever_paid,CAT,0.788933,0.432195,0.356738
2,last_payment_method,CAT,0.788933,0.64579,0.143143
3,reg_days,NUMERIC,0.788933,0.769726,0.019207
4,num_days_active_w7,NUMERIC,0.788933,0.779369,0.009564
5,registered_via,CAT,0.788933,0.783266,0.005667
6,city,CAT,0.788933,0.784257,0.004676
7,completion_ratio_w21,NUMERIC,0.788933,0.786745,0.002188
8,num_days_active_w14,NUMERIC,0.788933,0.787428,0.001505
9,gender,CAT,0.788933,0.787707,0.001226
