# Tabular GAD 실험 (ADBench + TabPFN + LLM Embedding)

- 다음 3가지 표현을 만들어 anomaly detection / generalization 성능을 비교

1) Raw X: 원본 tabular feature
2) TabPFN-Residual: TabPFN으로 각 column을 다른 column으로 예측 → 샘플별 residual vector 생성
3) LLM-ColumnEmb → RowPooling: 컬럼 분포 요약 텍스트를 임베딩(Qwen3 Embedding) → row 값을 가중치로 풀링하여 row embedding 생성

평가:
- In-domain AD: 같은 데이터셋에서 train(inlier-only) → test(mixed)
- Cross-domain GAD: source dataset에서 학습한 anomaly scorer를 target datasets에 적용

기본 anomaly scorer: 일단 IsolationForest (높을수록 anomaly)
(이후 ResAD로 scorer 부분만 교체예정)


In [22]:
import os, glob, random
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.ensemble import IsolationForest

import torch
print("torch:", torch.__version__)

CLASSICAL_DIR = "/home/haeylee/main/Classical"
print("CLASSICAL_DIR =", CLASSICAL_DIR)


torch: 2.7.1+cu118
CLASSICAL_DIR = /home/haeylee/main/Classical


In [23]:
# 데이터셋확인
def list_npz_datasets(root=CLASSICAL_DIR):
    files = sorted(glob.glob(os.path.join(root, "*.npz")))
    names = [os.path.splitext(os.path.basename(f))[0] for f in files]
    return names, files

DATASET_NAMES, DATASET_FILES = list_npz_datasets()
print("Found datasets:", len(DATASET_NAMES))
print("Examples:", DATASET_NAMES[:15])



Found datasets: 47
Examples: ['10_cover', '11_donors', '12_fault', '13_fraud', '14_glass', '15_Hepatitis', '16_http', '17_InternetAds', '18_Ionosphere', '19_landsat', '1_ALOI', '20_letter', '21_Lymphography', '22_magic.gamma', '23_mammography']


In [None]:
# 로더, seed, split, 표준화

def set_seed(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

def load_npz(dataset_name, root=CLASSICAL_DIR):
    path = os.path.join(root, f"{dataset_name}.npz")
    assert os.path.isfile(path), f"파일 없음: {path}"
    data = np.load(path, allow_pickle=True)
    X = np.asarray(data["X"], dtype=np.float32)
    y = np.asarray(data["y"]).astype(int)

    # y가 {0,1} 아닐 때 보정(ADBench는 보통 0=normal, 1=anomaly)
    uniq = np.unique(y)
    if not (len(uniq) == 2 and set(uniq) == {0,1}):
        y0 = np.min(uniq)
        y = (y != y0).astype(int)

    X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0)
    return X, y

def od_split_inlier_train(X, y, seed=0, test_size=0.2):
    # 전체 stratified split
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=test_size, random_state=seed, stratify=y
    )
    # OD 세팅: train은 정상(inlier=0)만 사용
    X_tr_in = X_tr[y_tr == 0]
    return X_tr_in, X_te, y_te

def standardize_fit_on_train(X_train_inlier, X_any):
    scaler = StandardScaler()
    Xtr = scaler.fit_transform(X_train_inlier)
    Xany = scaler.transform(X_any)
    return Xtr, Xany, scaler

def metrics(y_true, anomaly_score):
    return {
        "auroc": float(roc_auc_score(y_true, anomaly_score)),
        "auprc": float(average_precision_score(y_true, anomaly_score)),
    }


In [None]:
# 예시

set_seed(0)
dataset_name = "4_breastw"
X, y = load_npz(dataset_name)

print("dataset:", dataset_name)
print("X shape:", X.shape, "dtype:", X.dtype)
print("y shape:", y.shape, "unique:", np.unique(y, return_counts=True))

# 샘플 몇 개 보기
print("\nX[0] first 10 dims:", X[0, :10])
print("y[0:20]:", y[:20])


dataset: 4_breastw
X shape: (683, 9) dtype: float32
y shape: (683,) unique: (array([0, 1]), array([444, 239]))

X[0] first 10 dims: [5. 1. 1. 1. 2. 1. 3. 1. 1.]
y[0:20]: [0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0]


In [30]:
Xtr_in, Xte, yte = od_split_inlier_train(X, y, seed=0, test_size=0.2)
print("Xtr_in (inlier-only) shape:", Xtr_in.shape)
print("Xte shape:", Xte.shape, " yte ratio:", yte.mean())

Xtr_in_s, Xte_s, scaler = standardize_fit_on_train(Xtr_in, Xte)
print("After standardize:")
print("  Xtr_in_s mean (first 5 dims):", Xtr_in_s.mean(axis=0)[:5])
print("  Xtr_in_s std  (first 5 dims):", Xtr_in_s.std(axis=0)[:5])
print("  Xte_s[0] first 10 dims:", Xte_s[0,:10])


Xtr_in (inlier-only) shape: (355, 9)
Xte shape: (137, 9)  yte ratio: 0.35036496350364965
After standardize:
  Xtr_in_s mean (first 5 dims): [ 1.08127864e-07  1.15851284e-08 -1.28275914e-07 -7.00144724e-08
  1.57238730e-07]
  Xtr_in_s std  (first 5 dims): [1.0000005  0.9999985  0.99999815 0.99999875 0.9999991 ]
  Xte_s[0] first 10 dims: [-1.1402882  -0.37100318 -0.45178962 -0.3580195  -1.285264   -0.28385895
  0.8608856  -0.26681283 -0.1348383 ]


In [31]:
# baseline : isolationforest (raw 표준화 feature)

def fit_iforest(X_train, seed=0):
    model = IsolationForest(
        n_estimators=500,
        contamination="auto",
        random_state=seed,
        n_jobs=-1,
    )
    model.fit(X_train)
    return model

def score_iforest(model, X_test):
    # score_samples 높을수록 정상 → anomaly score는 부호반전
    return -model.score_samples(X_test)

m_if = fit_iforest(Xtr_in_s, seed=0)
s_if = score_iforest(m_if, Xte_s)
print("IForest metrics:", metrics(yte, s_if))
print("anomaly score sample:", s_if[:10])


IForest metrics: {'auroc': 0.9906367041198502, 'auprc': 0.9819594265365031}
anomaly score sample: [0.36572993 0.82083933 0.41430815 0.33062756 0.33316394 0.76056935
 0.7691908  0.33107951 0.33062756 0.43313967]


#### TabPFN을 OD에 적용 : 관계 기반 residual 만들기
- “column 간 relation”을 TabPFN이 피처 예측기로 학습하도록 해서, 각 샘플에서 **관계가 깨진 정도(residual 벡터)**를 만들기
- inlier train으로만 학습
- 임의로 K개 column을 뽑아서, 각 column을 나머지 column으로 예측
- 테스트에서 예측 오차(residual)를 모아 z(x) ∈ R^K 구성
- z(x) 위에서 IForest 같은 scorer로 anomaly score

In [32]:
import tabpfn
print("tabpfn package version:", getattr(tabpfn, "__version__", "unknown"))

from tabpfn import TabPFNClassifier, TabPFNRegressor
from tabpfn.constants import ModelVersion

print("Has ModelVersion.V2_5 ?", hasattr(ModelVersion, "V2_5"))
print("ModelVersion members:", [m for m in dir(ModelVersion) if m.startswith("V")])


tabpfn package version: 6.0.6
Has ModelVersion.V2_5 ? True
ModelVersion members: ['V2', 'V2_5']


In [33]:
def pick_target_columns_by_variance(X_inlier, k=8, seed=0):
    # variance 큰 컬럼 위주로(너무 상수 같은 컬럼 피하려고)
    rng = np.random.default_rng(seed)
    var = X_inlier.var(axis=0)
    idx = np.argsort(-var)
    idx = idx[var[idx] > 1e-8]
    if len(idx) < k:
        # fallback: 랜덤
        all_idx = np.arange(X_inlier.shape[1])
        rng.shuffle(all_idx)
        return all_idx[:k].tolist()
    return idx[:k].tolist()

def tabpfn_regressor_v25():
    # API 호환용(버전에 따라 메서드 이름이 다를 수 있어서 안전하게)
    if hasattr(TabPFNRegressor, "get_default_for_version"):
        return TabPFNRegressor.get_default_for_version(ModelVersion.V2_5)
    if hasattr(TabPFNRegressor, "create_default_for_version"):
        return TabPFNRegressor.create_default_for_version(ModelVersion.V2_5)
    # 최신 기본이 2.5인 경우도 있어서 마지막 fallback
    return TabPFNRegressor()

@torch.no_grad()
def predict_reg_rowwise(model, X):
    # "테스트-테스트 상호작용(트랜스덕티브) 가능성"을 최대한 피하려고 row-by-row
    preds = []
    for i in range(X.shape[0]):
        preds.append(float(model.predict(X[i:i+1])[0]))
    return np.array(preds, dtype=np.float32)

def build_tabpfn_relation_residual_repr(
    Xtr_in_s, Xte_s, k_targets=8, max_train_points=512, seed=0
):
    set_seed(seed)
    d = Xtr_in_s.shape[1]
    targets = pick_target_columns_by_variance(Xtr_in_s, k=k_targets, seed=seed)
    print("Selected target columns:", targets)

    # train subsample (속도/안정)
    if Xtr_in_s.shape[0] > max_train_points:
        rng = np.random.default_rng(seed)
        idx = rng.choice(Xtr_in_s.shape[0], size=max_train_points, replace=False)
        Xtr_fit = Xtr_in_s[idx]
        print(f"Subsample train inliers: {Xtr_in_s.shape[0]} -> {Xtr_fit.shape[0]}")
    else:
        Xtr_fit = Xtr_in_s

    Ztr = np.zeros((Xtr_in_s.shape[0], k_targets), dtype=np.float32)
    Zte = np.zeros((Xte_s.shape[0], k_targets), dtype=np.float32)

    for t_i, j in enumerate(targets):
        # 입력: X without j, 타겟: x_j
        Xtr_inp = np.delete(Xtr_fit, j, axis=1)
        ytr = Xtr_fit[:, j]
        Xtr_inp_full = np.delete(Xtr_in_s, j, axis=1)
        Xte_inp = np.delete(Xte_s, j, axis=1)

        reg = tabpfn_regressor_v25()

        print(f"\n[TabPFN v2.5] Fit regressor for col={j} with X shape {Xtr_inp.shape} -> y shape {ytr.shape}")
        reg.fit(Xtr_inp, ytr)

        # 예측 (row-wise로 안전하게)
        yhat_tr = predict_reg_rowwise(reg, Xtr_inp_full)
        yhat_te = predict_reg_rowwise(reg, Xte_inp)

        # residual (절대오차)
        Ztr[:, t_i] = np.abs(Xtr_in_s[:, j] - yhat_tr)
        Zte[:, t_i] = np.abs(Xte_s[:, j] - yhat_te)

        print("  residual stats (test): mean", float(Zte[:,t_i].mean()), "max", float(Zte[:,t_i].max()))

    meta = {"targets": targets, "k_targets": k_targets, "max_train_points": max_train_points}
    return Ztr, Zte, meta


In [34]:
set_seed(0)

Ztr, Zte, meta = build_tabpfn_relation_residual_repr(
    Xtr_in_s, Xte_s,
    k_targets=6,          # 처음엔 4~8 추천
    max_train_points=256, # 먼저 작게
    seed=0
)

print("\nZtr shape:", Ztr.shape, "Zte shape:", Zte.shape)
print("Zte[0]:", Zte[0])
print("meta:", meta)

# residual space에서 IForest
m_if_z = fit_iforest(Ztr, seed=0)
s_if_z = score_iforest(m_if_z, Zte)

print("\nTabPFN-residual + IForest metrics:", metrics(yte, s_if_z))
print("scores head:", s_if_z[:10])


Selected target columns: [8, 7, 0, 6, 4, 5]
Subsample train inliers: 355 -> 256

[TabPFN v2.5] Fit regressor for col=8 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 0.9817519187927246 max 15.274371147155762

[TabPFN v2.5] Fit regressor for col=7 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 1.6046957969665527 max 8.23944091796875

[TabPFN v2.5] Fit regressor for col=0 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 1.0966465473175049 max 4.030312538146973

[TabPFN v2.5] Fit regressor for col=6 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 1.3231803178787231 max 6.23377799987793

[TabPFN v2.5] Fit regressor for col=4 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 0.9214044213294983 max 6.729058265686035

[TabPFN v2.5] Fit regressor for col=5 with X shape (256, 8) -> y shape (256,)
  residual stats (test): mean 1.6779112815856934 max 7.18416166305542

Ztr shape: (355, 6) Z

##### 정보 섞이는지 간단 check
- TabPFN이 내부적으로 predict 시 test batch를 같이 처리하면서 query-query attention이 있으면, 배치 크기에 따라 결과가 달라질 수 있어. 그래서 위에서는 안전하게 row-wise로 했어.

아래는 row-wise vs 한번에(batch) 예측 결과가 얼마나 다른지 보는 체크야(차이가 0이 아니면 “상호작용 가능성” 시그널).

In [35]:
# 한 컬럼(j)으로만 quick check
j = meta["targets"][0]
Xtr_inp = np.delete(Xtr_in_s, j, axis=1)
ytr = Xtr_in_s[:, j]
Xte_inp = np.delete(Xte_s, j, axis=1)

reg = tabpfn_regressor_v25()
reg.fit(Xtr_inp[:256], ytr[:256])

# batch 예측
yhat_batch = reg.predict(Xte_inp)

# row-wise 예측
yhat_row = predict_reg_rowwise(reg, Xte_inp)

diff = np.max(np.abs(yhat_batch - yhat_row))
print("max |batch - rowwise| =", float(diff))
print("first 5 batch:", yhat_batch[:5])
print("first 5 row  :", yhat_row[:5])


max |batch - rowwise| = 0.002199709415435791
first 5 batch: [-0.13281395  0.8498919  -0.1249333  -0.132583   -0.13314563]
first 5 row  : [-0.13284117  0.84813434 -0.12503633 -0.13259807 -0.1331743 ]
