# Pembuka
Notebook ini mendokumentasikan eksperimen training V1: memeriksa dataset & fold, bake-off backbone frozen, ekstraksi embedding penuh, evaluasi klasifikasi tertile, training final, hingga demo inferensi. Jalankan sel berurutan; setiap penjelasan sel merinci tujuan dan artefak yang dihasilkan.

In [None]:
import pandas as pd
from pathlib import Path

BASE = Path("/kaggle/input/traningv1")
df = pd.read_csv(BASE/"fi_v2_meta.with_feats.clean.folds.csv")

# --- Samakan nama kolom person ---
if "group_id" in df.columns and "person_id" not in df.columns:
    df = df.rename(columns={"group_id": "person_id"})

# --- Pastikan kolom fold ada; kalau tidak, join dari person_folds.csv ---
if "fold" not in df.columns:
    folds = pd.read_csv(BASE/"person_folds.csv")
    if "group_id" in folds.columns and "person_id" not in folds.columns:
        folds = folds.rename(columns={"group_id": "person_id"})
    df = df.merge(folds[["person_id","fold"]], on="person_id", how="left")
    assert df["fold"].notna().all(), "Ada person tanpa fold — cek person_folds.csv"

# --- Deteksi target & fitur ---
target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]
meta_like = {"person_id","fold","raw_id","relpath","wav_path","duration_sec"}
num_cols = [c for c in df.select_dtypes("number").columns if c not in meta_like|set(target_cols)]

print("Targets:", target_cols)
print("Group column: person_id | Fold column: fold")
print("X:", len(num_cols), "features | Rows:", len(df))

# --- Cek kebocoran: tiap person harus 1 fold saja ---
assert (df.groupby("person_id")["fold"].nunique() == 1).all(), "LEAKAGE detected!"


Targets: ['extraversion', 'agreeableness', 'conscientiousness', 'neuroticism', 'openness']
Group column: person_id | Fold column: fold
X: 25 features | Rows: 5425


#### Penjelasan (Sel 1)
- Memuat dataset fitur bersih, menstandarkan kolom person_id/fold, dan menentukan target serta fitur numerik.
- Validasi: cek tidak ada leakage (tiap person hanya satu fold) dan menampilkan jumlah fitur/baris.

In [None]:
import pandas as pd, numpy as np

# Asumsi df, target_cols, num_cols sudah ada dari cell sebelumnya
print("Label ranges:")
print(df[target_cols].agg(['min','max','mean']).T)

print("\nDistribusi per fold:")
byf = df.groupby('fold').agg(rows=('fold','size'),
                             persons=('person_id','nunique'))
byf['rows_per_person'] = (byf['rows']/byf['persons']).round(2)
display(byf)


Label ranges:
                        min       max      mean
extraversion       0.018692  0.925234  0.483551
agreeableness      0.000000  1.000000  0.555661
conscientiousness  0.000000  0.970874  0.531477
neuroticism        0.020833  0.979167  0.529956
openness           0.000000  1.000000  0.573542

Distribusi per fold:


Unnamed: 0_level_0,rows,persons,rows_per_person
fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1093,1093,1.0
1,1017,1017,1.0
2,1121,1121,1.0
3,1106,1106,1.0
4,1088,1088,1.0


#### Penjelasan (Sel 2)
- Menyajikan rentang label dan distribusi ukuran per fold/person untuk memastikan skala 0-1 serta keseimbangan data.

Label 0–1 semua (bagus; konsisten). Rata-rata ~0.48–0.57 wajar.

Fold balance: 1017–1121 per fold (imbalance < ~5%) → oke.

rows_per_person = 1.0 ⇒ satu sampel per orang; pakai person_id tetap aman (efeknya sama seperti KFold, tapi future-proof kalau nanti ada multi-clip per orang).

### **Cek apakah distribusi label per fold mirip global**

In [None]:
fold_stats = df.groupby('fold')[target_cols].mean()
global_mean = df[target_cols].mean()
print("Max deviasi mean per trait (per fold vs global):")
print((fold_stats - global_mean).abs().max())
fold_stats  # lihat tabel mean per fold

Max deviasi mean per trait (per fold vs global):
extraversion         0.002532
agreeableness        0.004962
conscientiousness    0.008739
neuroticism          0.007431
openness             0.003056
dtype: float64


Unnamed: 0_level_0,extraversion,agreeableness,conscientiousness,neuroticism,openness
fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.48516,0.555372,0.540216,0.536339,0.576599
1,0.486082,0.560623,0.53041,0.530502,0.574588
2,0.481513,0.552048,0.531131,0.522525,0.571791
3,0.48299,0.556425,0.528889,0.532164,0.573739
4,0.482236,0.554258,0.526681,0.528445,0.571099


#### Penjelasan (Sel 3)
- Mengukur deviasi rata-rata label per fold dibanding rata-rata global untuk melihat keseimbangan target antar fold.

# Skenario 1 — Transformer Pra-latih (**Frozen**)

---

## Ringkasan Singkat

Di skenario ini kita **tidak melatih ulang** backbone transformer (Wav2Vec2/HuBERT/WavLM), melainkan **membekukannya (frozen)**, lalu **mengekstrak embedding** audio dan melatih **head ringan** (Ridge/MLP kecil) untuk memetakan embedding → **skor Big Five (0–1)**.
**Step-0** bertujuan memilih **backbone terbaik** secara cepat dan objektif sebelum masuk eksperimen penuh.

---

## Tujuan

* Memilih **1 backbone pra-latih** terbaik (basis “base”) untuk skenario frozen.
* Menetapkan **konfigurasi awal**: 16 kHz mono, **mean pooling**, **head Ridge**, **fold per-speaker**.
* Mendapatkan **patokan performa** (MAE & lift vs **null baseline**) + **waktu ekstraksi**.

---

## Dataset & Protokol Validasi

* **Data**: subset audio-only First Impressions V2 (baris = 1 klip / 1 orang).
* **Split**: 5-fold **speaker-independent** (kolom `fold`, group = `person_id`).
* **Evaluasi**: gunakan fold yang **sudah ada**; **fit** scaler/model hanya di **train fold** → evaluasi di **val fold** (anti-leak).

---

## Apa itu “Frozen” (cerita singkat)

Bayangkan backbone sebagai **“guru telinga”** yang sudah belajar dari jutaan jam suara.
Di mode **frozen**, kita **tidak mengajari ulang gurunya**—kita cuma minta dia **mendengarkan** dan mengeluarkan **catatan (embedding)**. Lalu **asisten kecil** (Ridge/MLP) menerjemahkan catatan itu menjadi skor Big Five.
Secara teknis: `model.eval()`, `torch.no_grad()`, semua bobot **tanpa grad** → **tidak ada backprop** ke backbone.

---

## Backbone Kandidat (base, ringan, stabil)

* `microsoft/wavlm-base-plus`
* `facebook/hubert-base-ls960`
* `facebook/wav2vec2-base-960h`

> Cukup 2–3 kandidat di Step-0. Pemenang dibawa ke eksperimen penuh.

---

## Konfigurasi Awal (konstan untuk semua kandidat)

* **Audio**: 16 kHz, mono (resample dulu agar seragam).
* **Pooling**: **mean pooling** di atas `last_hidden_state` (nanti bisa tambah **stats pooling** = mean ⨁ std).
* **Head**: **Ridge** (per-trait).
* **Prediksi**: **clip [0,1]** (bounded target).
* **Fold**: pakai **predefined** 5-fold (speaker-independent).

---

## Metrik & Baseline

* **Primary**: **MAE** per-trait + **rata-rata MAE** 5 trait.
* **Tambahan**: RMSE, R² (opsional di Step-0).
* **Null baseline**: menebak **mean** label di train fold → **MAE_null** (patokan minimum).

---

## Acceptance Criteria (Keputusan Step-0)

* Pemenang = backbone dengan **MAE rata-rata terendah** dan **lift** (MAE_null − MAE_model) **positif** di semua trait.
* Jika beda tipis (≤ ~0.002–0.003 MAE), utamakan **yang lebih cepat** (detik/klip) dan **memori lebih hemat**.
* (Opsional) Uji cepat **stats pooling** pada 2 kandidat teratas: pilih jika memberi **+1–3%** relatif.

---

## Alur Step-0 (Bake-off)

1. **Subset pilot (±20%)**: sampling proporsional per-fold untuk percepat uji.
2. **Ekstraksi embedding** (frozen) untuk tiap backbone → **cache Parquet**
   `ssl_emb_<backbone>_mean_pilot.parquet` (kolom: `person_id, fold, clip_id, …, emb_0..emb_D-1`).
3. **Train cepat**: Ridge per-trait dengan split yang sama → hitung **MAE** & **MAE_null**.
4. **Catat waktu ekstraksi** (detik/klip) → efisiensi.
5. **Bandingkan & putuskan pemenang** → lanjut ke ekstraksi **full dataset**.

---

## Struktur Output yang Disarankan

```
/emb/
  ssl_emb_<backbone>_mean_pilot.parquet
/results/
  results_frozen_<backbone>_mean_pilot.csv
  leaderboard_step0.csv                # rangkuman per backbone
```

**Kolom hasil yang ideal (per backbone):**
`backbone, emb_dim, sec_per_clip, mae_avg, lift_abs, lift_rel_%`

---

## Reproducibility Checklist

* Set **seed** (numpy/torch) + **log versi paket** (transformers/torch/sklearn).
* Gunakan **config** per run (JSON kecil: backbone, pooling, head, grid, seed, fold).
* **Cache embedding sekali**, **reuse** untuk semua percobaan head.

---

## Risiko & Mitigasi

* **OOM** saat ekstraksi → kecilkan batch / pakai **window 5s stride 2–3s**, lalu **rata-rata** embedding window.
* **Prediksi keluar [0,1]** → selalu **clip** (atau pakai transform logit lewat `TransformedTargetRegressor`).
* **Beda performa tipis** → cobain **stats pooling** dan pilih yang **lebih efisien**.

---

## Next Steps (setelah Step-0)

1. Ekstraksi **full dataset** dengan backbone pemenang (**mean pooling**).
2. Coba **stats pooling** di pemenang untuk cek gains.
3. Train **Ridge** & **MLP kecil** (multi-output) → pilih setup terbaik untuk **S2 (fine-tune)** atau **fusion**.


In [None]:
import sys, torch, pkgutil
def _v(mod): 
    try:
        m=__import__(mod); return getattr(m,"__version__", "unknown")
    except: 
        return "not-installed"

print("python :", sys.version)
print("torch  :", torch.__version__)
print("transformers:", _v("transformers"))
print("huggingface_hub:", _v("huggingface_hub"))
print("httpx :", _v("httpx"))
print("tokenizers:", _v("tokenizers"))


python : 3.11.13 (main, Jun  4 2025, 08:57:29) [GCC 11.4.0]
torch  : 2.6.0+cu124
transformers: 4.53.3
huggingface_hub: 1.0.0.rc2
httpx : 0.28.1
tokenizers: 0.21.2


#### Penjelasan (Sel 4)
- Audit versi Python/Torch/Transformers/HF Hub agar environment sesuai untuk inferensi backbone.

In [None]:
import os, pandas as pd
from huggingface_hub import hf_hub_download

# 1) koneksi & izin akses model (download file kecil saja)
repo = "microsoft/wavlm-base-plus"
cfg_path = hf_hub_download(repo_id=repo, filename="config.json")
print("HF OK →", cfg_path)

# 2) ambil 1 path audio dari CSV kamu
base = "/kaggle/input/traningv1"
df0 = pd.read_csv(f"{base}/fi_v2_meta.with_feats.clean.folds.csv")
audio_col = "wav_path" if "wav_path" in df0.columns else "relpath"
one_path = os.path.join(base, df0[audio_col].iloc[0].lstrip("./"))
print("Audio sample:", one_path, "| exists:", os.path.exists(one_path))


config.json: 0.00B [00:00, ?B/s]

HF OK → /root/.cache/huggingface/hub/models--microsoft--wavlm-base-plus/snapshots/4c66d4806a428f2e922ccfa1a962776e232d487b/config.json
Audio sample: /kaggle/input/traningv1/wav/J4GQm9j0JZ0.003.wav | exists: True


#### Penjelasan (Sel 5)
- Tes koneksi HuggingFace Hub lewat unduh config.json dan verifikasi ada contoh path audio lokal yang valid.

In [None]:
import os, pathlib
from huggingface_hub import snapshot_download

# taruh cache di working untuk menghindari konflik
HF_CACHE = "/kaggle/working/hf_cache"
os.environ["HF_HOME"] = HF_CACHE
os.environ["HF_HUB_CACHE"] = HF_CACHE
pathlib.Path(HF_CACHE).mkdir(parents=True, exist_ok=True)

BACKBONE_ID = "microsoft/wavlm-base-plus"   # bisa ganti: "facebook/wav2vec2-base-960h" / "facebook/hubert-base-ls960"

local_dir = snapshot_download(
    repo_id=BACKBONE_ID,
    cache_dir=HF_CACHE,
    allow_patterns=[
        "config.json",
        "preprocessor_config.json", "feature_extractor.json",
        "pytorch_model*.bin", "model.safetensors", "*.json"
    ],
)
print("Local snapshot dir:", local_dir)


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Local snapshot dir: /kaggle/working/hf_cache/models--microsoft--wavlm-base-plus/snapshots/4c66d4806a428f2e922ccfa1a962776e232d487b


#### Penjelasan (Sel 6)
- Mengunduh snapshot backbone pilihan ke cache lokal (/kaggle/working/hf_cache) agar ekstraksi embedding offline dan reproducible.

In [None]:
import os, pandas as pd, numpy as np, librosa, torch
from transformers import AutoFeatureExtractor, AutoModel

# ambil 1 file audio dari CSV kamu
BASE = "/kaggle/input/traningv1"
df0  = pd.read_csv(f"{BASE}/fi_v2_meta.with_feats.clean.folds.csv")
audio_col = "wav_path" if "wav_path" in df0.columns else "relpath"
one_path  = os.path.join(BASE, df0[audio_col].iloc[0].lstrip("./"))

SR = 16000
y, _ = librosa.load(one_path, sr=SR, mono=True)

# >>> perhatikan: load dari local_dir + local_files_only=True <<<
fe    = AutoFeatureExtractor.from_pretrained(local_dir, local_files_only=True)
model = AutoModel.from_pretrained(local_dir, local_files_only=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.eval().to(device)

with torch.no_grad():
    inp = fe(y, sampling_rate=SR, return_tensors="pt")
    inp = {k: v.to(device) for k, v in inp.items()}
    hs  = model(**inp).last_hidden_state     # [1, T, D]
    emb = hs.mean(dim=1).squeeze(0).cpu().numpy()

print("Embedding shape:", emb.shape, "| first 5 dims:", np.round(emb[:5], 4))


2025-11-12 05:56:12.430220: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762926972.627476      39 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762926972.678893      39 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Embedding shape: (768,) | first 5 dims: [-0.0448 -0.0012 -0.0368  0.0174  0.0229]




#### Penjelasan (Sel 7)
- Memuat satu WAV, menyiapkan extractor/model lokal, lalu menghitung embedding mean sebagai sanity check; menampilkan bentuk vektor.

In [None]:
from huggingface_hub import snapshot_download
import os, pathlib

HF_CACHE = "/kaggle/working/hf_cache"
os.environ["HF_HOME"] = HF_CACHE
os.environ["HF_HUB_CACHE"] = HF_CACHE
pathlib.Path(HF_CACHE).mkdir(parents=True, exist_ok=True)

BACKBONES = {
    "wavlm-base-plus": "microsoft/wavlm-base-plus",
    "hubert-base": "facebook/hubert-base-ls960",
    "wav2vec2-base": "facebook/wav2vec2-base-960h",
}
LOCAL_DIRS = {}
for name, repo in BACKBONES.items():
    LOCAL_DIRS[name] = snapshot_download(
        repo_id=repo, cache_dir=HF_CACHE,
        allow_patterns=["config.json","preprocessor_config.json","feature_extractor.json",
                        "pytorch_model*.bin","model.safetensors","*.json"]
    )
LOCAL_DIRS


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/213 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

config.json: 0.00B [00:00, ?B/s]

feature_extractor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

{'wavlm-base-plus': '/kaggle/working/hf_cache/models--microsoft--wavlm-base-plus/snapshots/4c66d4806a428f2e922ccfa1a962776e232d487b',
 'hubert-base': '/kaggle/working/hf_cache/models--facebook--hubert-base-ls960/snapshots/dba3bb02fda4248b6e082697eee756de8fe8aa8a',
 'wav2vec2-base': '/kaggle/working/hf_cache/models--facebook--wav2vec2-base-960h/snapshots/22aad52d435eb6dbaf354bdad9b0da84ce7d6156'}

#### Penjelasan (Sel 8)
- Mengunduh tiga backbone kandidat (wavlm, hubert, wav2vec2) ke cache dan menyimpan path di LOCAL_DIRS untuk uji banding.

In [None]:
import pandas as pd, os

BASE = "/kaggle/input/traningv1"
df = pd.read_csv(f"{BASE}/fi_v2_meta.with_feats.clean.folds.csv")

audio_col = "wav_path" if "wav_path" in df.columns else "relpath"
df["audio_path"] = df[audio_col].apply(lambda p: os.path.join(BASE, p.lstrip("./")))

target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

df_pilot = (df.groupby("fold", group_keys=False)
              .apply(lambda g: g.sample(frac=0.2, random_state=42))
              .reset_index(drop=True))
df_pilot.shape, df_pilot["fold"].value_counts().sort_index()


  .apply(lambda g: g.sample(frac=0.2, random_state=42))


((1085, 39),
 fold
 0    219
 1    203
 2    224
 3    221
 4    218
 Name: count, dtype: int64)

#### Penjelasan (Sel 9)
- Menambahkan kolom audio_path absolut dan membuat subset pilot 20% per fold yang akan dipakai bake-off cepat.

In [None]:
import numpy as np, torch, time, pandas as pd
import librosa
from transformers import AutoFeatureExtractor, AutoModel

SR = 16000

def load_16k(path):  # simple loader
    y, _ = librosa.load(path, sr=SR, mono=True)
    return y

@torch.no_grad()
def extract_embed_mean_local_batch(paths, local_dir, batch_size=4):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    fe = AutoFeatureExtractor.from_pretrained(local_dir, local_files_only=True)
    model = AutoModel.from_pretrained(local_dir, local_files_only=True).to(device)
    model.eval()

    embs, t0 = [], time.time()
    for i in range(0, len(paths), batch_size):
        waves = [load_16k(p) for p in paths[i:i+batch_size]]
        inputs = fe(waves, sampling_rate=SR, return_tensors="pt", padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        hs = model(**inputs).last_hidden_state   # [B, T, D]
        embs.append(hs.mean(dim=1).cpu().numpy())  # mean over time
    embs = np.vstack(embs)
    spc = (time.time() - t0) / max(len(paths), 1)
    return embs, spc


#### Penjelasan (Sel 10)
- Mendefinisikan fungsi ekstraksi embedding batch (mean pooling) yang mengembalikan waktu rata-rata per klip sebagai indikator efisiensi.

In [None]:
# --- NORMALISASI ID (buat person_id) ---
import os, pandas as pd, numpy as np

BASE = "/kaggle/input/traningv1"
df = pd.read_csv(f"{BASE}/fi_v2_meta.with_feats.clean.folds.csv")

audio_col = "wav_path" if "wav_path" in df.columns else "relpath"
df["audio_path"] = df[audio_col].apply(lambda p: os.path.join(BASE, p.lstrip("./")))

# if person_id missing → derive from group_id/raw_id/filename
if "person_id" not in df.columns:
    if "group_id" in df.columns:
        df["person_id"] = df["group_id"]
    elif "raw_id" in df.columns:
        # contoh raw_id:  J4QGm9j0JZ0.003.mp4  → ambil sebelum segmen
        df["person_id"] = df["raw_id"].str.split(".").str[0]
    else:
        df["person_id"] = df["audio_path"].apply(lambda p: os.path.basename(p).split(".")[0])

# (opsional) clip_id untuk audit
df["clip_id"] = df[audio_col].apply(lambda p: os.path.basename(p))

target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]
assert all(c in df.columns for c in target_cols), "Cek lagi target_cols di CSV."

# --- PILOT 20% PER FOLD ---
df_pilot = (df.groupby("fold", group_keys=False)
              .apply(lambda g: g.sample(frac=0.2, random_state=42))
              .reset_index(drop=True))
df_pilot.shape, df_pilot["fold"].value_counts().sort_index()


  .apply(lambda g: g.sample(frac=0.2, random_state=42))


((1085, 41),
 fold
 0    219
 1    203
 2    224
 3    221
 4    218
 Name: count, dtype: int64)

#### Penjelasan (Sel 11)
- Menormalkan person_id/clip_id jika belum ada dan menyiapkan ulang subset pilot dengan kolom audit agar konsisten di semua percobaan.

In [None]:
import os

os.makedirs("/kaggle/working/emb", exist_ok=True)

pilot_meta = []
for name, local_dir in LOCAL_DIRS.items():
    embs, sec_per_clip = extract_embed_mean_local_batch(
        df_pilot["audio_path"].tolist(), local_dir, batch_size=4
    )
    D = embs.shape[1]
    out_path = f"/kaggle/working/emb/ssl_emb_{name}_mean_pilot.parquet"

    cols = [f"emb_{i}" for i in range(D)]
    emb_df = pd.concat([
        df_pilot[["person_id","fold","clip_id","audio_path"] + target_cols].reset_index(drop=True),
        pd.DataFrame(embs, columns=cols)
    ], axis=1)
    emb_df.to_parquet(out_path, index=False)

    pilot_meta.append({"backbone": name, "emb_dim": D, "sec_per_clip": sec_per_clip, "parquet": out_path})

pd.DataFrame(pilot_meta)

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at /kaggle/working/hf_cache/models--facebook--wav2vec2-base-960h/snapshots/22aad52d435eb6dbaf354bdad9b0da84ce7d6156 and are newly initialized: ['masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,backbone,emb_dim,sec_per_clip,parquet
0,wavlm-base-plus,768,0.112399,/kaggle/working/emb/ssl_emb_wavlm-base-plus_me...
1,hubert-base,768,0.094765,/kaggle/working/emb/ssl_emb_hubert-base_mean_p...
2,wav2vec2-base,768,0.095511,/kaggle/working/emb/ssl_emb_wav2vec2-base_mean...


#### Penjelasan (Sel 12)
- Mengekstrak embedding pilot untuk setiap backbone, menyimpannya ke Parquet, dan mencatat dimensi embedding serta detik/klip.

### **SCORING BACKBONE**

In [None]:
import numpy as np, pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

# pastikan sudah ada
# pilot_meta -> list of {"backbone","emb_dim","sec_per_clip","parquet"}
# target_cols -> ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

def eval_ridge_cv(parquet_path, target_cols, fold_col="fold"):
    dat = pd.read_parquet(parquet_path)
    feat_cols = [c for c in dat.columns if c.startswith("emb_")]
    folds = dat[fold_col].values
    uniq = sorted(np.unique(folds))
    splits = [(np.where(folds != f)[0], np.where(folds == f)[0]) for f in uniq]

    rows = []
    for t in target_cols:
        y = dat[t].values

        # null baseline
        null_mae = np.mean([
            mean_absolute_error(y[val], np.full(len(val), y[trn].mean()))
            for trn, val in splits
        ])

        # Ridge (grid kecil)
        best_mae, best_alpha = 1e9, None
        for a in [0.1, 0.3, 1, 3, 10, 30, 100]:
            fold_mae = []
            for trn, val in splits:
                pipe = Pipeline([("scaler", StandardScaler()),
                                 ("ridge", Ridge(alpha=a))])
                pipe.fit(dat.iloc[trn][feat_cols], y[trn])
                pred = pipe.predict(dat.iloc[val][feat_cols]).clip(0, 1)
                fold_mae.append(mean_absolute_error(y[val], pred))
            mae = float(np.mean(fold_mae))
            if mae < best_mae:
                best_mae, best_alpha = mae, a

        rows.append({"trait": t, "mae_model": best_mae, "mae_null": null_mae, "best_alpha": best_alpha})

    dfres = pd.DataFrame(rows)
    summary = {
        "mae_avg": dfres["mae_model"].mean(),
        "lift_abs": (dfres["mae_null"] - dfres["mae_model"]).mean(),
        "lift_rel_%": (100 * ((dfres["mae_null"] - dfres["mae_model"]) / dfres["mae_null"])).mean(),
    }
    return dfres, summary

leader = []
for m in pilot_meta:
    dfres, summ = eval_ridge_cv(m["parquet"], target_cols)
    dfres.to_csv(f"/kaggle/working/results_step0_{m['backbone']}_pertrait.csv", index=False)
    leader.append({
        "backbone": m["backbone"],
        "emb_dim": m["emb_dim"],
        "sec_per_clip": round(m["sec_per_clip"], 4),
        "mae_avg": round(summ["mae_avg"], 4),
        "lift_abs": round(summ["lift_abs"], 4),
        "lift_rel_%": round(summ["lift_rel_%"], 2),
        "pertrait_csv": f"/kaggle/working/results_step0_{m['backbone']}_pertrait.csv",
    })

leader_df = pd.DataFrame(leader).sort_values(["mae_avg", "sec_per_clip"]).reset_index(drop=True)
leader_df.to_csv("/kaggle/working/results_step0_leaderboard.csv", index=False)
leader_df


  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, 

Unnamed: 0,backbone,emb_dim,sec_per_clip,mae_avg,lift_abs,lift_rel_%,pertrait_csv
0,hubert-base,768,0.0948,0.1125,0.0047,3.77,/kaggle/working/results_step0_hubert-base_pert...
1,wavlm-base-plus,768,0.1124,0.113,0.0042,3.35,/kaggle/working/results_step0_wavlm-base-plus_...
2,wav2vec2-base,768,0.0955,0.1183,-0.0011,-1.12,/kaggle/working/results_step0_wav2vec2-base_pe...


#### Penjelasan (Sel 13)
- Mengevaluasi backbone pilot memakai Ridge CV per fold, menghitung MAE dan lift vs baseline, lalu membuat leaderboard results_step0_leaderboard.csv.

Keputusan Step-0: pakai hubert-base sebagai pemenang untuk lanjut full run.

## **1) Ekstrak FULL embedding (hubert-base)**

In [None]:
# === Setup meta & kolom ===
import os, pandas as pd, numpy as np, time, librosa, torch
from transformers import AutoFeatureExtractor, AutoModel

BASE = "/kaggle/input/traningv1"
df = pd.read_csv(f"{BASE}/fi_v2_meta.with_feats.clean.folds.csv")

audio_col = "wav_path" if "wav_path" in df.columns else "relpath"
df["audio_path"] = df[audio_col].apply(lambda p: os.path.join(BASE, p.lstrip("./")))
if "person_id" not in df.columns:
    df["person_id"] = df.get("group_id", df[audio_col].apply(lambda p: os.path.basename(p).split(".")[0]))
df["clip_id"] = df[audio_col].apply(lambda p: os.path.basename(p))

target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]
assert all(c in df.columns for c in target_cols)

# === Ekstraksi sekali untuk 2 pooling: mean & stats ===
SR = 16000
def load_16k(p): 
    y, _ = librosa.load(p, sr=SR, mono=True); return y

def extract_full_two_poolings(local_dir, paths, batch_size=8):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    fe = AutoFeatureExtractor.from_pretrained(local_dir, local_files_only=True)
    model = AutoModel.from_pretrained(local_dir, local_files_only=True).to(device).eval()

    mean_chunks, stats_chunks = [], []
    t0 = time.time()
    with torch.no_grad():
        for i in range(0, len(paths), batch_size):
            waves = [load_16k(p) for p in paths[i:i+batch_size]]
            inputs = fe(waves, sampling_rate=SR, return_tensors="pt", padding=True)
            inputs = {k:v.to(device) for k,v in inputs.items()}
            hs = model(**inputs).last_hidden_state   # [B,T,D]
            mean = hs.mean(dim=1)                    # [B,D]
            std  = hs.std(dim=1)                     # [B,D]
            mean_chunks.append(mean.cpu().numpy())
            stats_chunks.append(torch.cat([mean, std], dim=1).cpu().numpy())
    spc = (time.time()-t0)/max(len(paths),1)
    mean_mat  = np.vstack(mean_chunks)
    stats_mat = np.vstack(stats_chunks)
    return mean_mat, stats_mat, spc, mean_mat.shape[1]

# Path lokal model pemenang
local_dir = LOCAL_DIRS["hubert-base"]  # dari langkah sebelumnya (snapshot_download)

mean_mat, stats_mat, sec_per_clip, D = extract_full_two_poolings(local_dir, df["audio_path"].tolist(), batch_size=8)
print("sec/clip ~", round(sec_per_clip, 4), "| emb_dim =", D, "| rows =", len(df))

os.makedirs("/kaggle/working/emb", exist_ok=True)

# Simpan Parquet: mean pooling
cols_mean = [f"emb_{i}" for i in range(D)]
emb_mean_df = pd.concat([
    df[["person_id","fold","clip_id","audio_path"] + target_cols].reset_index(drop=True),
    pd.DataFrame(mean_mat, columns=cols_mean)
], axis=1)
PQ_MEAN = "/kaggle/working/emb/ssl_emb_hubert-base_mean_full.parquet"
emb_mean_df.to_parquet(PQ_MEAN, index=False)

# Simpan Parquet: stats pooling (mean ⨁ std)
cols_stats = [f"emb_{i}" for i in range(2*D)]
emb_stats_df = pd.concat([
    df[["person_id","fold","clip_id","audio_path"] + target_cols].reset_index(drop=True),
    pd.DataFrame(stats_mat, columns=cols_stats)
], axis=1)
PQ_STATS = "/kaggle/working/emb/ssl_emb_hubert-base_stats_full.parquet"
emb_stats_df.to_parquet(PQ_STATS, index=False)

PQ_MEAN, PQ_STATS


sec/clip ~ 0.0978 | emb_dim = 768 | rows = 5425


('/kaggle/working/emb/ssl_emb_hubert-base_mean_full.parquet',
 '/kaggle/working/emb/ssl_emb_hubert-base_stats_full.parquet')

#### Penjelasan (Sel 14)
- Menjalankan ekstraksi penuh untuk backbone pemenang (hubert-base) dengan pooling mean dan stats; menyimpan Parquet PQ_MEAN & PQ_STATS.

## **Evaluasi klasifikasi (mean vs stats) — macro-F1 & balanced accuracy**

In [None]:
# --- PROGRESS EVAL (versi detail) ---
import numpy as np, pandas as pd, time, json, os, warnings
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score
from tqdm.auto import tqdm
warnings.filterwarnings("ignore")

target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

def eval_cls_tertile_from_parquet_progress(
    parquet_path,
    target_cols,
    fold_col="fold",
    n_classes=3,
    save_prefix="/kaggle/working/_progress",
    solver="saga",
    max_iter=2000,
    random_state=42,
    verbose=3,  # 0: sunyi, 1: per-trait saja, 2: +running mean, 3: +log per-fold
):
    os.makedirs(save_prefix, exist_ok=True)
    base = os.path.basename(parquet_path)
    log_path = f"{save_prefix}/log_{base}.txt"

    def _log(msg):
        if verbose >= 3:
            tqdm.write(msg)
        with open(log_path, "a", encoding="utf-8") as f:
            f.write(msg + "\n")

    dat = pd.read_parquet(parquet_path)
    feat_cols = [c for c in dat.columns if c.startswith("emb_")]
    X = dat[feat_cols].values
    folds = dat[fold_col].values
    uniq = sorted(np.unique(folds))

    rows, t_global0 = [], time.time()
    trait_iter = tqdm(target_cols, desc=f"Traits ({base})", disable=(verbose==0))
    for t in trait_iter:
        # wadah metrik per-fold (disimpan bertahap)
        fold_rows = []

        y_cont = dat[t].values
        f1s, bals, f1s_null, bals_null = [], [], [], []
        t0 = time.time()

        # progress bar per-fold
        fold_iter = tqdm(uniq, leave=False, desc=f"{t}: folds", disable=(verbose<2))
        for f in fold_iter:
            trn = np.where(folds != f)[0]
            val = np.where(folds == f)[0]

            # ambang anti-leak dari TRAIN
            if n_classes == 3:
                bins = np.quantile(y_cont[trn], [1/3, 2/3])
            else:
                bins = [np.quantile(y_cont[trn], 0.5)]
            y_trn = np.digitize(y_cont[trn], bins, right=True)
            y_val = np.digitize(y_cont[val], bins, right=True)

            # model
            pipe = Pipeline([
                ("scaler", StandardScaler()),
                ("logreg", LogisticRegression(
                    max_iter=max_iter, solver=solver,
                    multi_class="multinomial", class_weight="balanced",
                    n_jobs=(-1 if solver in {"saga","sag"} else None),
                    random_state=random_state,
                ))
            ])

            t_fit0 = time.time()
            pipe.fit(X[trn], y_trn)
            y_hat = pipe.predict(X[val])
            sec_fit = time.time() - t_fit0

            # baseline mayoritas
            maj = np.bincount(y_trn).argmax()
            y_null = np.full_like(y_val, maj)

            # metrik
            f1  = f1_score(y_val, y_hat, average="macro")
            bal = balanced_accuracy_score(y_val, y_hat)
            f1n = f1_score(y_val, y_null, average="macro")
            baln= balanced_accuracy_score(y_val, y_null)

            f1s.append(f1); bals.append(bal); f1s_null.append(f1n); bals_null.append(baln)

            # log detail per-fold
            dist = dict(zip(*np.unique(y_trn, return_counts=True)))
            _log(
                f"[{t}] fold={f} | trn={len(trn)} val={len(val)} | bins={np.round(bins,4).tolist()} "
                f"| train_dist={dist} | sec_fit={sec_fit:.2f} "
                f"| F1={f1:.3f} Bal={bal:.3f} | nullF1={f1n:.3f} nullBal={baln:.3f}"
            )

            fold_rows.append({
                "trait": t, "fold": int(f),
                "train_n": int(len(trn)), "val_n": int(len(val)),
                "bin_lo": float(bins[0]), "bin_hi": float(bins[-1]),
                "train_dist_0": int(dist.get(0,0)),
                "train_dist_1": int(dist.get(1,0)),
                "train_dist_2": int(dist.get(2,0)),
                "sec_fit": round(sec_fit,3),
                "f1": float(f1), "bal": float(bal),
                "f1_null": float(f1n), "bal_null": float(baln),
            })

            if verbose >= 2:
                fold_iter.set_postfix({
                    "F1(run)": f"{np.mean(f1s):.3f}",
                    "Bal(run)": f"{np.mean(bals):.3f}"
                })

        # simpan metrik per-fold untuk trait ini
        pd.DataFrame(fold_rows).to_csv(
            f"{save_prefix}/fold_metrics_{base}_{t}.csv", index=False
        )

        row = {
            "trait": t,
            "f1_macro": float(np.mean(f1s)),
            "bal_acc":  float(np.mean(bals)),
            "f1_macro_null": float(np.mean(f1s_null)),
            "bal_acc_null":  float(np.mean(bals_null)),
            "time_s": round(time.time() - t0, 2),
        }
        rows.append(row)

        # partial CSV per-trait (rolling)
        pd.DataFrame(rows).to_csv(f"{save_prefix}/partial_{base}.csv", index=False)

        if verbose >= 1:
            trait_iter.set_postfix({
                "last_trait": t,
                "F1": f"{row['f1_macro']:.3f}",
                "Bal": f"{row['bal_acc']:.3f}"
            })

    dfres = pd.DataFrame(rows)
    summary = {
        "f1_macro_avg": dfres["f1_macro"].mean(),
        "bal_acc_avg":  dfres["bal_acc"].mean(),
        "lift_f1_pct":  100*(dfres["f1_macro"].mean() - dfres["f1_macro_null"].mean()),
        "time_total_s": round(time.time() - t_global0, 2),
        "parquet": parquet_path,
        "n_rows": len(dat),
        "n_feats": len(feat_cols),
        "solver": solver,
        "max_iter": max_iter,
        "log_path": log_path,
        "partial_csv": f"{save_prefix}/partial_{base}.csv",
    }
    with open(f"{save_prefix}/summary_{base}.json","w") as f:
        json.dump(summary, f, indent=2)
    return dfres, summary


#### Penjelasan (Sel 15)
- Mendefinisikan evaluator klasifikasi tertile (logreg SAGA) dengan logging progress/partial per fold agar proses panjang bisa dilanjutkan.

In [None]:
# jalankan evaluasi dengan progress
# evaluasi dengan log super-detail
res_mean, sum_mean = eval_cls_tertile_from_parquet_progress(
    PQ_MEAN, target_cols,
    save_prefix="/kaggle/working/_progress_mean",
    verbose=3  # tampilkan log per-fold
)
res_stats, sum_stats = eval_cls_tertile_from_parquet_progress(
    PQ_STATS, target_cols,
    save_prefix="/kaggle/working/_progress_stats",
    verbose=3
)

pool_leader = pd.DataFrame([
    {"pooling":"mean",  **sum_mean},
    {"pooling":"stats", **sum_stats},
]).sort_values(["f1_macro_avg","bal_acc_avg"], ascending=[False, False]).reset_index(drop=True)

display(pool_leader)
res_mean.to_csv("/kaggle/working/results_cls_hubert_mean_pertrait.csv", index=False)
res_stats.to_csv("/kaggle/working/results_cls_hubert_stats_pertrait.csv", index=False)
pool_leader.to_csv("/kaggle/working/results_cls_hubert_pool_leader.csv", index=False)


Traits (ssl_emb_hubert-base_mean_full.parquet):   0%|          | 0/5 [00:00<?, ?it/s]

extraversion: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[extraversion] fold=0 | trn=4332 val=1093 | bins=[0.4206, 0.5514] | train_dist={0: 1538, 1: 1357, 2: 1437} | sec_fit=151.31 | F1=0.469 Bal=0.471 | nullF1=0.172 nullBal=0.333
[extraversion] fold=1 | trn=4408 val=1017 | bins=[0.4206, 0.5514] | train_dist={0: 1556, 1: 1391, 2: 1461} | sec_fit=153.63 | F1=0.494 Bal=0.494 | nullF1=0.175 nullBal=0.333
[extraversion] fold=2 | trn=4304 val=1121 | bins=[0.4206, 0.5607] | train_dist={0: 1531, 1: 1437, 2: 1336} | sec_fit=149.89 | F1=0.453 Bal=0.461 | nullF1=0.171 nullBal=0.333
[extraversion] fold=3 | trn=4319 val=1106 | bins=[0.4206, 0.5514] | train_dist={0: 1516, 1: 1382, 2: 1421} | sec_fit=150.55 | F1=0.487 Bal=0.486 | nullF1=0.178 nullBal=0.333
[extraversion] fold=4 | trn=4337 val=1088 | bins=[0.4206, 0.5514] | train_dist={0: 1531, 1: 1381, 2: 1425} | sec_fit=150.96 | F1=0.459 Bal=0.458 | nullF1=0.175 nullBal=0.333


agreeableness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[agreeableness] fold=0 | trn=4332 val=1093 | bins=[0.5055, 0.6154] | train_dist={0: 1504, 1: 1418, 2: 1410} | sec_fit=152.83 | F1=0.404 Bal=0.407 | nullF1=0.168 nullBal=0.333
[agreeableness] fold=1 | trn=4408 val=1017 | bins=[0.5055, 0.6154] | train_dist={0: 1548, 1: 1446, 2: 1414} | sec_fit=155.68 | F1=0.403 Bal=0.404 | nullF1=0.161 nullBal=0.333
[agreeableness] fold=2 | trn=4304 val=1121 | bins=[0.5055, 0.6154] | train_dist={0: 1468, 1: 1421, 2: 1415} | sec_fit=149.68 | F1=0.406 Bal=0.407 | nullF1=0.177 nullBal=0.333
[agreeableness] fold=3 | trn=4319 val=1106 | bins=[0.5055, 0.6154] | train_dist={0: 1491, 1: 1439, 2: 1389} | sec_fit=150.32 | F1=0.428 Bal=0.431 | nullF1=0.171 nullBal=0.333
[agreeableness] fold=4 | trn=4337 val=1088 | bins=[0.5055, 0.6154] | train_dist={0: 1481, 1: 1448, 2: 1408} | sec_fit=150.79 | F1=0.393 Bal=0.393 | nullF1=0.177 nullBal=0.333


conscientiousness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[conscientiousness] fold=0 | trn=4332 val=1093 | bins=[0.466, 0.6019] | train_dist={0: 1516, 1: 1442, 2: 1374} | sec_fit=150.94 | F1=0.469 Bal=0.470 | nullF1=0.168 nullBal=0.333
[conscientiousness] fold=1 | trn=4408 val=1017 | bins=[0.466, 0.6019] | train_dist={0: 1521, 1: 1454, 2: 1433} | sec_fit=153.29 | F1=0.452 Bal=0.451 | nullF1=0.176 nullBal=0.333
[conscientiousness] fold=2 | trn=4304 val=1121 | bins=[0.466, 0.6019] | train_dist={0: 1503, 1: 1385, 2: 1416} | sec_fit=150.60 | F1=0.450 Bal=0.454 | nullF1=0.169 nullBal=0.333
[conscientiousness] fold=3 | trn=4319 val=1106 | bins=[0.466, 0.6019] | train_dist={0: 1501, 1: 1409, 2: 1409} | sec_fit=150.88 | F1=0.501 Bal=0.507 | nullF1=0.172 nullBal=0.333
[conscientiousness] fold=4 | trn=4337 val=1088 | bins=[0.466, 0.6019] | train_dist={0: 1499, 1: 1414, 2: 1424} | sec_fit=151.64 | F1=0.461 Bal=0.461 | nullF1=0.175 nullBal=0.333


neuroticism: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[neuroticism] fold=0 | trn=4332 val=1093 | bins=[0.4688, 0.6042] | train_dist={0: 1508, 1: 1459, 2: 1365} | sec_fit=151.32 | F1=0.459 Bal=0.465 | nullF1=0.160 nullBal=0.333
[neuroticism] fold=1 | trn=4408 val=1017 | bins=[0.4688, 0.6042] | train_dist={0: 1500, 1: 1515, 2: 1393} | sec_fit=154.23 | F1=0.463 Bal=0.463 | nullF1=0.164 nullBal=0.333
[neuroticism] fold=2 | trn=4304 val=1121 | bins=[0.4688, 0.6042] | train_dist={0: 1448, 1: 1456, 2: 1400} | sec_fit=150.58 | F1=0.447 Bal=0.450 | nullF1=0.173 nullBal=0.333
[neuroticism] fold=3 | trn=4319 val=1106 | bins=[0.4688, 0.6042] | train_dist={0: 1484, 1: 1482, 2: 1353} | sec_fit=151.05 | F1=0.509 Bal=0.513 | nullF1=0.167 nullBal=0.333
[neuroticism] fold=4 | trn=4337 val=1088 | bins=[0.4688, 0.6042] | train_dist={0: 1472, 1: 1480, 2: 1385} | sec_fit=151.82 | F1=0.450 Bal=0.450 | nullF1=0.168 nullBal=0.333


openness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[openness] fold=0 | trn=4332 val=1093 | bins=[0.5111, 0.6333] | train_dist={0: 1457, 1: 1454, 2: 1421} | sec_fit=151.57 | F1=0.485 Bal=0.489 | nullF1=0.161 nullBal=0.333
[openness] fold=1 | trn=4408 val=1017 | bins=[0.5222, 0.6333] | train_dist={0: 1589, 1: 1350, 2: 1469} | sec_fit=154.26 | F1=0.464 Bal=0.464 | nullF1=0.176 nullBal=0.333
[openness] fold=2 | trn=4304 val=1121 | bins=[0.5222, 0.6444] | train_dist={0: 1540, 1: 1437, 2: 1327} | sec_fit=149.68 | F1=0.472 Bal=0.475 | nullF1=0.180 nullBal=0.333
[openness] fold=3 | trn=4319 val=1106 | bins=[0.5111, 0.6333] | train_dist={0: 1445, 1: 1448, 2: 1426} | sec_fit=150.57 | F1=0.497 Bal=0.497 | nullF1=0.166 nullBal=0.333
[openness] fold=4 | trn=4337 val=1088 | bins=[0.5222, 0.6444] | train_dist={0: 1555, 1: 1439, 2: 1343} | sec_fit=150.99 | F1=0.451 Bal=0.453 | nullF1=0.179 nullBal=0.333


Traits (ssl_emb_hubert-base_stats_full.parquet):   0%|          | 0/5 [00:00<?, ?it/s]

extraversion: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[extraversion] fold=0 | trn=4332 val=1093 | bins=[0.4206, 0.5514] | train_dist={0: 1538, 1: 1357, 2: 1437} | sec_fit=298.76 | F1=0.464 Bal=0.465 | nullF1=0.172 nullBal=0.333
[extraversion] fold=1 | trn=4408 val=1017 | bins=[0.4206, 0.5514] | train_dist={0: 1556, 1: 1391, 2: 1461} | sec_fit=304.13 | F1=0.420 Bal=0.421 | nullF1=0.175 nullBal=0.333
[extraversion] fold=2 | trn=4304 val=1121 | bins=[0.4206, 0.5607] | train_dist={0: 1531, 1: 1437, 2: 1336} | sec_fit=301.45 | F1=0.458 Bal=0.460 | nullF1=0.171 nullBal=0.333
[extraversion] fold=3 | trn=4319 val=1106 | bins=[0.4206, 0.5514] | train_dist={0: 1516, 1: 1382, 2: 1421} | sec_fit=298.39 | F1=0.475 Bal=0.474 | nullF1=0.178 nullBal=0.333
[extraversion] fold=4 | trn=4337 val=1088 | bins=[0.4206, 0.5514] | train_dist={0: 1531, 1: 1381, 2: 1425} | sec_fit=299.58 | F1=0.438 Bal=0.437 | nullF1=0.175 nullBal=0.333


agreeableness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[agreeableness] fold=0 | trn=4332 val=1093 | bins=[0.5055, 0.6154] | train_dist={0: 1504, 1: 1418, 2: 1410} | sec_fit=299.40 | F1=0.421 Bal=0.424 | nullF1=0.168 nullBal=0.333
[agreeableness] fold=1 | trn=4408 val=1017 | bins=[0.5055, 0.6154] | train_dist={0: 1548, 1: 1446, 2: 1414} | sec_fit=304.55 | F1=0.393 Bal=0.393 | nullF1=0.161 nullBal=0.333
[agreeableness] fold=2 | trn=4304 val=1121 | bins=[0.5055, 0.6154] | train_dist={0: 1468, 1: 1421, 2: 1415} | sec_fit=297.27 | F1=0.417 Bal=0.417 | nullF1=0.177 nullBal=0.333
[agreeableness] fold=3 | trn=4319 val=1106 | bins=[0.5055, 0.6154] | train_dist={0: 1491, 1: 1439, 2: 1389} | sec_fit=298.41 | F1=0.433 Bal=0.432 | nullF1=0.171 nullBal=0.333
[agreeableness] fold=4 | trn=4337 val=1088 | bins=[0.5055, 0.6154] | train_dist={0: 1481, 1: 1448, 2: 1408} | sec_fit=299.80 | F1=0.391 Bal=0.391 | nullF1=0.177 nullBal=0.333


conscientiousness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[conscientiousness] fold=0 | trn=4332 val=1093 | bins=[0.466, 0.6019] | train_dist={0: 1516, 1: 1442, 2: 1374} | sec_fit=299.42 | F1=0.432 Bal=0.432 | nullF1=0.168 nullBal=0.333
[conscientiousness] fold=1 | trn=4408 val=1017 | bins=[0.466, 0.6019] | train_dist={0: 1521, 1: 1454, 2: 1433} | sec_fit=304.46 | F1=0.415 Bal=0.414 | nullF1=0.176 nullBal=0.333
[conscientiousness] fold=2 | trn=4304 val=1121 | bins=[0.466, 0.6019] | train_dist={0: 1503, 1: 1385, 2: 1416} | sec_fit=297.27 | F1=0.432 Bal=0.433 | nullF1=0.169 nullBal=0.333
[conscientiousness] fold=3 | trn=4319 val=1106 | bins=[0.466, 0.6019] | train_dist={0: 1501, 1: 1409, 2: 1409} | sec_fit=298.57 | F1=0.449 Bal=0.450 | nullF1=0.172 nullBal=0.333
[conscientiousness] fold=4 | trn=4337 val=1088 | bins=[0.466, 0.6019] | train_dist={0: 1499, 1: 1414, 2: 1424} | sec_fit=301.49 | F1=0.440 Bal=0.441 | nullF1=0.175 nullBal=0.333


neuroticism: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[neuroticism] fold=0 | trn=4332 val=1093 | bins=[0.4688, 0.6042] | train_dist={0: 1508, 1: 1459, 2: 1365} | sec_fit=301.16 | F1=0.449 Bal=0.451 | nullF1=0.160 nullBal=0.333
[neuroticism] fold=1 | trn=4408 val=1017 | bins=[0.4688, 0.6042] | train_dist={0: 1500, 1: 1515, 2: 1393} | sec_fit=306.52 | F1=0.432 Bal=0.431 | nullF1=0.164 nullBal=0.333
[neuroticism] fold=2 | trn=4304 val=1121 | bins=[0.4688, 0.6042] | train_dist={0: 1448, 1: 1456, 2: 1400} | sec_fit=299.06 | F1=0.420 Bal=0.420 | nullF1=0.173 nullBal=0.333
[neuroticism] fold=3 | trn=4319 val=1106 | bins=[0.4688, 0.6042] | train_dist={0: 1484, 1: 1482, 2: 1353} | sec_fit=300.09 | F1=0.469 Bal=0.470 | nullF1=0.167 nullBal=0.333
[neuroticism] fold=4 | trn=4337 val=1088 | bins=[0.4688, 0.6042] | train_dist={0: 1472, 1: 1480, 2: 1385} | sec_fit=301.37 | F1=0.441 Bal=0.441 | nullF1=0.168 nullBal=0.333


openness: folds:   0%|          | 0/5 [00:00<?, ?it/s]

[openness] fold=0 | trn=4332 val=1093 | bins=[0.5111, 0.6333] | train_dist={0: 1457, 1: 1454, 2: 1421} | sec_fit=300.96 | F1=0.435 Bal=0.435 | nullF1=0.161 nullBal=0.333
[openness] fold=1 | trn=4408 val=1017 | bins=[0.5222, 0.6333] | train_dist={0: 1589, 1: 1350, 2: 1469} | sec_fit=306.02 | F1=0.425 Bal=0.425 | nullF1=0.176 nullBal=0.333
[openness] fold=2 | trn=4304 val=1121 | bins=[0.5222, 0.6444] | train_dist={0: 1540, 1: 1437, 2: 1327} | sec_fit=298.91 | F1=0.455 Bal=0.456 | nullF1=0.180 nullBal=0.333
[openness] fold=3 | trn=4319 val=1106 | bins=[0.5111, 0.6333] | train_dist={0: 1445, 1: 1448, 2: 1426} | sec_fit=299.97 | F1=0.446 Bal=0.446 | nullF1=0.166 nullBal=0.333
[openness] fold=4 | trn=4337 val=1088 | bins=[0.5222, 0.6444] | train_dist={0: 1555, 1: 1439, 2: 1343} | sec_fit=298.80 | F1=0.415 Bal=0.420 | nullF1=0.179 nullBal=0.333


Unnamed: 0,pooling,f1_macro_avg,bal_acc_avg,lift_f1_pct,time_total_s,parquet,n_rows,n_feats,solver,max_iter,log_path,partial_csv
0,mean,0.45703,0.458808,28.583792,3789.33,/kaggle/working/emb/ssl_emb_hubert-base_mean_f...,5425,768,saga,2000,/kaggle/working/_progress_mean/log_ssl_emb_hub...,/kaggle/working/_progress_mean/partial_ssl_emb...
1,stats,0.434651,0.435139,26.345903,7516.1,/kaggle/working/emb/ssl_emb_hubert-base_stats_...,5425,1536,saga,2000,/kaggle/working/_progress_stats/log_ssl_emb_hu...,/kaggle/working/_progress_stats/partial_ssl_em...


#### Penjelasan (Sel 16)
- Menjalankan evaluasi mean vs stats embedding, menyimpan hasil per trait dan leaderboard pooling; keluaran dipakai memilih pooling terbaik.

In [None]:
#Cek status savepoint (apa saja trait yang sudah selesai)

import os, glob, pandas as pd, json

def check_progress(prefix):
    part = glob.glob(f"{prefix}/partial_*.csv")
    summ = glob.glob(f"{prefix}/summary_*.json")
    fold = sorted(glob.glob(f"{prefix}/fold_metrics_*_*.csv"))
    print("==", prefix)
    print(" partial:", os.path.basename(part[0]) if part else "-",
          "| summary:", os.path.basename(summ[0]) if summ else "-",
          "| fold_files:", len(fold))
    if part:
        dfp = pd.read_csv(part[0])
        print(" traits_done:", dfp["trait"].tolist(), "| n:", len(dfp))
        display(dfp)

check_progress("/kaggle/working/_progress_mean")
check_progress("/kaggle/working/_progress_stats")


== /kaggle/working/_progress_mean
 partial: partial_ssl_emb_hubert-base_mean_full.parquet.csv | summary: summary_ssl_emb_hubert-base_mean_full.parquet.json | fold_files: 5
 traits_done: ['extraversion', 'agreeableness', 'conscientiousness', 'neuroticism', 'openness'] | n: 5


Unnamed: 0,trait,f1_macro,bal_acc,f1_macro_null,bal_acc_null,time_s
0,extraversion,0.472535,0.473862,0.174142,0.333333,756.38
1,agreeableness,0.406927,0.408189,0.170874,0.333333,759.34
2,conscientiousness,0.466526,0.468481,0.171963,0.333333,757.39
3,neuroticism,0.465642,0.468174,0.166478,0.333333,759.06
4,openness,0.473522,0.475336,0.172504,0.333333,757.12


== /kaggle/working/_progress_stats
 partial: partial_ssl_emb_hubert-base_stats_full.parquet.csv | summary: summary_ssl_emb_hubert-base_stats_full.parquet.json | fold_files: 5
 traits_done: ['extraversion', 'agreeableness', 'conscientiousness', 'neuroticism', 'openness'] | n: 5


Unnamed: 0,trait,f1_macro,bal_acc,f1_macro_null,bal_acc_null,time_s
0,extraversion,0.451267,0.451263,0.174142,0.333333,1502.37
1,agreeableness,0.410876,0.411497,0.170874,0.333333,1499.48
2,conscientiousness,0.433627,0.433968,0.171963,0.333333,1501.26
3,neuroticism,0.442015,0.442565,0.166478,0.333333,1508.25
4,openness,0.435472,0.436402,0.172504,0.333333,1504.71


#### Penjelasan (Sel 17)
- Mengecek file progress/summary yang sudah tersimpan untuk memastikan tidak ada trait yang tertinggal sebelum melanjutkan.

In [None]:
#Tetapkan pemenang:

POOLING_CHOSEN = "mean"
PARQUET = PQ_MEAN


#### Penjelasan (Sel 18)
- Mengunci pilihan pooling (mean) dan menyimpan path Parquet yang akan dipakai seluruh tahap berikutnya.

In [None]:
# tulis CSV “resmi” + leaderboard dari partial (aman bila file akhir belum tertulis):

import glob, pandas as pd, numpy as np

pm = glob.glob("/kaggle/working/_progress_mean/partial_*.csv")[0]
ps = glob.glob("/kaggle/working/_progress_stats/partial_*.csv")[0]
res_mean  = pd.read_csv(pm); res_mean.to_csv("/kaggle/working/results_cls_hubert_mean_pertrait.csv", index=False)
res_stats = pd.read_csv(ps); res_stats.to_csv("/kaggle/working/results_cls_hubert_stats_pertrait.csv", index=False)

pool_leader = pd.DataFrame([
    {"pooling":"mean",  "f1_macro_avg": res_mean["f1_macro"].mean(),  "bal_acc_avg": res_mean["bal_acc"].mean(),
     "lift_f1_pct": 100*(res_mean["f1_macro"].mean() - res_mean["f1_macro_null"].mean())},
    {"pooling":"stats", "f1_macro_avg": res_stats["f1_macro"].mean(), "bal_acc_avg": res_stats["bal_acc"].mean(),
     "lift_f1_pct": 100*(res_stats["f1_macro"].mean() - res_stats["f1_macro_null"].mean())},
]).sort_values(["f1_macro_avg","bal_acc_avg"], ascending=[False, False])
pool_leader.to_csv("/kaggle/working/results_cls_hubert_pool_leader.csv", index=False)
pool_leader


Unnamed: 0,pooling,f1_macro_avg,bal_acc_avg,lift_f1_pct
0,mean,0.45703,0.458808,28.583792
1,stats,0.434651,0.435139,26.345903


#### Penjelasan (Sel 19)
- Menulis ulang hasil resmi (mean/stats) dan leaderboard ke CSV, menjaga artefak tetap ada walau proses sebelumnya berhenti tengah jalan.

## Yang sudah dilakukan sebelumnya

* **Step-0 Env & Backbone check**: beresin versi lib, load model HF via snapshot lokal.
* **Bake-off pilot**: bandingkan huBERT / WavLM / W2V → pemenang **huBERT-base**.
* **Full embedding**: ekstrak **mean** & **stats** untuk semua klip.
* **Evaluasi klasifikasi (3 kelas tertile, speaker-wise CV)** dengan progress & savepoints → **mean pooling unggul**.

## Selanjutnya (yang dikerjakan sekarang)

1. **Tetapkan pooling pemenang** (mean) dan **train final** per trait di **seluruh data**.
2. Simpan artefak: model `.joblib`, `thresholds.json`, `config.json`.
3. (Opsional) buat confusion matrix & demo inference 1 audio.

In [None]:
# cek save point, Cek apa yang sudah ke-save
import os, pandas as pd

target_cols = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]
done = [t for t in target_cols if os.path.exists(f"/kaggle/working/models/{t}_mean_logreg.joblib")]
todo = [t for t in target_cols if t not in done]

print("MODEL DONE:", done)
print("MODEL TODO:", todo)
for p in [
    "/kaggle/working/models/report_partial_mean.csv",
    "/kaggle/working/models/report_final_mean.csv",
    "/kaggle/working/models/config.json",
    "/kaggle/working/models/thresholds.json",
    "/kaggle/working/emb/ssl_emb_hubert-base_mean_full.parquet",
    "/kaggle/working/results_cls_hubert_mean_pertrait.csv",
    "/kaggle/working/results_cls_hubert_stats_pertrait.csv",
    "/kaggle/working/results_cls_hubert_pool_leader.csv",
]:
    print(os.path.basename(p), "→", os.path.exists(p))


MODEL DONE: []
MODEL TODO: ['extraversion', 'agreeableness', 'conscientiousness', 'neuroticism', 'openness']
report_partial_mean.csv → False
report_final_mean.csv → False
config.json → False
thresholds.json → False
ssl_emb_hubert-base_mean_full.parquet → True
results_cls_hubert_mean_pertrait.csv → True
results_cls_hubert_stats_pertrait.csv → True
results_cls_hubert_pool_leader.csv → True


#### Penjelasan (Sel 20)
- Mengecek model yang sudah terlatih, daftar file kritikal, dan status keberadaan embedding/hasil evaluasi sebelum training final.

In [None]:
# dokumentasi env (opsional tapi bagus buat reproducibility)
import subprocess
subprocess.run("pip freeze > /kaggle/working/requirements.txt", shell=True, check=False)

CompletedProcess(args='pip freeze > /kaggle/working/requirements.txt', returncode=0)

#### Penjelasan (Sel 21)
- Membekukan versi paket saat runtime ke requirements.txt sebagai dokumentasi reproduktibilitas.

In [None]:
# --- Pack & Save Now ---
import os, zipfile, subprocess, shutil, glob, json, pandas as pd

# 1) dokumentasi env
subprocess.run("pip freeze > /kaggle/working/requirements.txt", shell=True, check=False)

# 2) list ringkas isi penting
must_have = [
    "/kaggle/working/emb/ssl_emb_hubert-base_mean_full.parquet",
    "/kaggle/working/models",
    "/kaggle/working/_progress_mean",
    "/kaggle/working/_progress_stats",
    "/kaggle/working/results_cls_hubert_mean_pertrait.csv",
    "/kaggle/working/results_cls_hubert_stats_pertrait.csv",
    "/kaggle/working/results_cls_hubert_pool_leader.csv",
    "/kaggle/working/requirements.txt",
]
print("CHECK FILES:")
for p in must_have:
    print(("DIR " if os.path.isdir(p) else "FILE"), p, "→", os.path.exists(p))

# 3) opsional: arsip ringkas supaya rapi
zip_path = "/kaggle/working/TA_S1_checkpoint.zip"
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
    for p in must_have:
        if os.path.isdir(p):
            for root,_,files in os.walk(p):
                for f in files:
                    full = os.path.join(root,f)
                    z.write(full, full.replace("/kaggle/working/",""))
        elif os.path.exists(p):
            z.write(p, p.replace("/kaggle/working/",""))
print("ZIPPED →", zip_path)
print("\nSekarang klik **Save Version** agar semua isi /kaggle/working tersimpan sebagai Output Files.")


CHECK FILES:
FILE /kaggle/working/emb/ssl_emb_hubert-base_mean_full.parquet → True
FILE /kaggle/working/models → False
DIR  /kaggle/working/_progress_mean → True
DIR  /kaggle/working/_progress_stats → True
FILE /kaggle/working/results_cls_hubert_mean_pertrait.csv → True
FILE /kaggle/working/results_cls_hubert_stats_pertrait.csv → True
FILE /kaggle/working/results_cls_hubert_pool_leader.csv → True
FILE /kaggle/working/requirements.txt → True
ZIPPED → /kaggle/working/TA_S1_checkpoint.zip

Sekarang klik **Save Version** agar semua isi /kaggle/working tersimpan sebagai Output Files.


#### Penjelasan (Sel 22)
- Mem-packing artefak penting (embedding, progress, model, hasil) ke ZIP working_artifacts.zip untuk disimpan sebagai output Kaggle.

# **LANJUTIN CHECKPOINT 1 (dataset trainingv1-checkpoint1)**


In [2]:
# === Rehydrate dari dataset output versi sebelumnya ===
import os, shutil, glob, pandas as pd

BASE_IN = "/kaggle/input/trainingv1-checkpoint1"  # <-- kalau slug beda, ganti ini

# copy folder penting ke working (supaya bisa nulis/overwrite)
for d in ["models", "_progress_mean", "_progress_stats", "emb", "hf_cache"]:
    src = os.path.join(BASE_IN, d)
    if os.path.exists(src):
        shutil.copytree(src, f"/kaggle/working/{d}", dirs_exist_ok=True)

# path embedding (pakai langsung dari INPUT biar hemat disk)
PQ_MEAN  = f"{BASE_IN}/emb/ssl_emb_hubert-base_mean_full.parquet"

# Cek model yang sudah ada
TARGET_COLS = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]
done = [t for t in TARGET_COLS if os.path.exists(f"/kaggle/working/models/{t}_mean_logreg.joblib")]
todo = [t for t in TARGET_COLS if t not in done]
print("MODEL DONE:", done)
print("MODEL TODO:", todo)

# sanity: file yang wajib ada
for p in [PQ_MEAN,
          f"{BASE_IN}/results_cls_hubert_mean_pertrait.csv",
          f"{BASE_IN}/results_cls_hubert_stats_pertrait.csv",
          f"{BASE_IN}/results_cls_hubert_pool_leader.csv"]:
    print(os.path.basename(p), "→", os.path.exists(p))


MODEL DONE: []
MODEL TODO: ['extraversion', 'agreeableness', 'conscientiousness', 'neuroticism', 'openness']
ssl_emb_hubert-base_mean_full.parquet → True
results_cls_hubert_mean_pertrait.csv → True
results_cls_hubert_stats_pertrait.csv → True
results_cls_hubert_pool_leader.csv → True


#### Penjelasan (Sel 23)
- Memuat ulang artefak dari dataset checkpoint (models, progress, embedding) sehingga percobaan bisa dilanjutkan tanpa re-run penuh.

In [4]:
# === Final train dgn SAGA + progress bar EPOCH ===
import os, json, time, joblib, numpy as np, pandas as pd
from tqdm.auto import tqdm
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score

# ---- config ----
POOLING_CHOSEN = "mean"
PARQUET = PQ_MEAN                     # pastikan sudah didefinisikan
SAVE_DIR = "/kaggle/working/models"
TARGET_COLS = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

SOLVER = "saga"                       # pakai saga agar ada iterasi
PENALTY = "l2"                        # ganti: "l1" / "elasticnet" (set L1_RATIO)
C_VAL   = 1.0
L1_RATIO = None                       # contoh elasticnet: 0.5
MAX_ITER = 2000                       # total epoch target
EPOCH_STEP = 50                       # langkah progress bar (iter per chunk)
CLASS_WEIGHT = "balanced"

os.makedirs(SAVE_DIR, exist_ok=True)

# ---- load embedding + thresholds (tertile global) ----
dat = pd.read_parquet(PARQUET)
feat_cols = [c for c in dat.columns if c.startswith("emb_")]
X = dat[feat_cols].values
thresholds = {t: [float(b) for b in np.quantile(dat[t].values, [1/3,2/3])] for t in TARGET_COLS}

# tulis config/thresholds di awal (safe)
with open(f"{SAVE_DIR}/config.json","w") as f:
    json.dump({"backbone":"hubert-base","pooling":POOLING_CHOSEN,"embedding_parquet":PARQUET,
               "targets":TARGET_COLS,"n_rows":int(len(dat)),"n_feats":int(len(feat_cols)),
               "solver":SOLVER,"penalty":PENALTY,"C":C_VAL,"l1_ratio":L1_RATIO,
               "max_iter":MAX_ITER,"class_weight":CLASS_WEIGHT}, f, indent=2)
with open(f"{SAVE_DIR}/thresholds.json","w") as f:
    json.dump(thresholds, f, indent=2)

def fit_with_epoch_bar(X, y, *, total_iter=2000, step=50):
    """Fit LogisticRegression incrementally; update tqdm bar tiap 'step' iter."""
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("logreg", LogisticRegression(
            solver=SOLVER, penalty=PENALTY, C=C_VAL, l1_ratio=L1_RATIO,
            multi_class="multinomial", class_weight=CLASS_WEIGHT,
            warm_start=True, max_iter=step, n_jobs=-1, verbose=0, random_state=42
        ))
    ])
    bar = tqdm(total=total_iter, desc="epochs", leave=False)
    used = 0
    last_small = 0
    t0 = time.time()
    while used < total_iter:
        # set iter target untuk chunk berikutnya
        remain = total_iter - used
        chunk = min(step, remain)
        pipe.set_params(logreg__max_iter=chunk)
        pipe.fit(X, y)

        # n_iter_ yg dipakai solver (bisa < chunk jika konvergen)
        it = int(np.max(np.atleast_1d(pipe.named_steps["logreg"].n_iter_)))
        used += it
        bar.update(it)

        # metrik training (sekilas) untuk postfix bar
        y_hat = pipe.predict(X)
        f1_tr = f1_score(y, y_hat, average="macro")
        bal_tr = balanced_accuracy_score(y, y_hat)
        bar.set_postfix({"f1_tr": f"{f1_tr:.3f}", "bal_tr": f"{bal_tr:.3f}"})

        # early stop if sudah konvergen 2 kali berturut2 (iter yang dipakai kecil)
        if it < step//2:
            last_small += 1
        else:
            last_small = 0
        if last_small >= 2:
            break
    bar.close()
    elapsed = time.time() - t0
    return pipe, used, elapsed

rows = []
todo = [t for t in TARGET_COLS if not os.path.exists(f"{SAVE_DIR}/{t}_{POOLING_CHOSEN}_logreg.joblib")]
outer = tqdm(todo, desc=f"Final train ({POOLING_CHOSEN})")
for t in outer:
    y = np.digitize(dat[t].values, thresholds[t], right=True)

    # progress bar epoch
    pipe, used_iter, sec = fit_with_epoch_bar(X, y, total_iter=MAX_ITER, step=EPOCH_STEP)

    # save model + report partial
    out_path = f"{SAVE_DIR}/{t}_{POOLING_CHOSEN}_logreg.joblib"
    joblib.dump(pipe, out_path)
    y_hat = pipe.predict(X)
    row = {
        "trait": t,
        "epochs_used": used_iter,
        "time_s": round(sec,2),
        "train_f1_macro": float(f1_score(y, y_hat, average="macro")),
        "train_bal_acc": float(balanced_accuracy_score(y, y_hat)),
        "model_path": out_path
    }
    rows.append(row)
    pd.DataFrame(rows).to_csv(f"{SAVE_DIR}/report_partial_{POOLING_CHOSEN}.csv", index=False)

# simpan report final
if rows:
    pd.DataFrame(rows).to_csv(f"{SAVE_DIR}/report_final_{POOLING_CHOSEN}.csv", index=False)
    display(pd.DataFrame(rows))
else:
    print("Tidak ada trait yang perlu dilatih (semua model sudah ada).")


Final train (mean):   0%|          | 0/5 [00:00<?, ?it/s]

epochs:   0%|          | 0/2000 [00:00<?, ?it/s]



epochs:   0%|          | 0/2000 [00:00<?, ?it/s]



epochs:   0%|          | 0/2000 [00:00<?, ?it/s]



epochs:   0%|          | 0/2000 [00:00<?, ?it/s]



epochs:   0%|          | 0/2000 [00:00<?, ?it/s]



Unnamed: 0,trait,epochs_used,time_s,train_f1_macro,train_bal_acc,model_path
0,extraversion,2000,193.84,0.661482,0.662232,/kaggle/working/models/extraversion_mean_logre...
1,agreeableness,2000,194.63,0.611461,0.612339,/kaggle/working/models/agreeableness_mean_logr...
2,conscientiousness,2000,193.93,0.667658,0.669445,/kaggle/working/models/conscientiousness_mean_...
3,neuroticism,2000,194.75,0.658126,0.661284,/kaggle/working/models/neuroticism_mean_logreg...
4,openness,2000,194.36,0.645608,0.646382,/kaggle/working/models/openness_mean_logreg.jo...


#### Penjelasan (Sel 24)
- Melatih final classifier SAGA per trait dengan progress chunked; menyimpan model, threshold, laporan partial/final, dan resume otomatis jika sudah ada model.

In [5]:
# == Final CV report (sesuai setup klasifikasi tertile) ==
import numpy as np, pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score, confusion_matrix

PARQUET = PQ_MEAN
dat = pd.read_parquet(PARQUET)
X = dat[[c for c in dat.columns if c.startswith("emb_")]].values
folds = dat["fold"].values
traits = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

rows, cms = [], {}
for t in traits:
    y = dat[t].values
    f1s, bals = [], []
    cm_sum = np.zeros((3,3), dtype=int)
    for f in sorted(np.unique(folds)):
        trn, val = np.where(folds!=f)[0], np.where(folds==f)[0]
        bins = np.quantile(y[trn], [1/3, 2/3])
        y_trn = np.digitize(y[trn], bins, right=True)
        y_val = np.digitize(y[val], bins, right=True)

        pipe = Pipeline([
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(
                solver="saga", penalty="l2", C=1.0, multi_class="multinomial",
                class_weight="balanced", max_iter=2000, n_jobs=-1
            ))
        ])
        pipe.fit(X[trn], y_trn)
        y_hat = pipe.predict(X[val])

        f1s.append(f1_score(y_val, y_hat, average="macro"))
        bals.append(balanced_accuracy_score(y_val, y_hat))
        cm_sum += confusion_matrix(y_val, y_hat, labels=[0,1,2])

    rows.append({"trait":t, "f1_macro":np.mean(f1s), "bal_acc":np.mean(bals)})
    cms[t] = cm_sum

df_cv = pd.DataFrame(rows).sort_values("f1_macro", ascending=False).reset_index(drop=True)
display(df_cv)
df_cv.to_csv("/kaggle/working/final_cv_metrics.csv", index=False)

# simpan confusion matrix tiap trait
for t, cm in cms.items():
    pd.DataFrame(cm, index=["true_L","true_M","true_H"], columns=["pred_L","pred_M","pred_H"])\
      .to_csv(f"/kaggle/working/confmat_{t}.csv")




Unnamed: 0,trait,f1_macro,bal_acc
0,openness,0.473522,0.475336
1,extraversion,0.472535,0.473862
2,conscientiousness,0.466526,0.468481
3,neuroticism,0.465803,0.468354
4,agreeableness,0.406927,0.408189


#### Penjelasan (Sel 25)
- Menghitung laporan CV akhir (F1/balanced accuracy) dan confusion matrix per trait, lalu menyimpan ke file untuk audit.

In [6]:
# === HELD-OUT TEST (by person) & evaluation ===
import numpy as np, pandas as pd, os, json
from sklearn.model_selection import GroupShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score, classification_report, confusion_matrix

PARQUET = PQ_MEAN  # dari langkah sebelumnya (mean pooling)
dat = pd.read_parquet(PARQUET)

assert "person_id" in dat.columns, "person_id diperlukan untuk split by person"
X = dat[[c for c in dat.columns if c.startswith("emb_")]].values
groups = dat["person_id"].values
traits = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

# 1) split TEST sekali (reproducible)
gss = GroupShuffleSplit(n_splits=1, test_size=0.20, random_state=42)
trn_idx, tst_idx = next(gss.split(X, groups=groups))
Xtr, Xte = X[trn_idx], X[tst_idx]
dat_tr, dat_te = dat.iloc[trn_idx].reset_index(drop=True), dat.iloc[tst_idx].reset_index(drop=True)

# simpan siapa saja yg jadi test (reproducibility)
os.makedirs("/kaggle/working/splits", exist_ok=True)
pd.Series(sorted(dat_te["person_id"].unique())).to_csv("/kaggle/working/splits/test_persons.csv", index=False, header=["person_id"])

# 2) evaluasi per trait (train->fit, test->score)
rows = []
for t in traits:
    y = dat[t].values
    # ambang dari TRAIN saja (anti-leak)
    bins = np.quantile(y[trn_idx], [1/3, 2/3])
    ytr = np.digitize(y[trn_idx], bins, right=True)
    yte = np.digitize(y[tst_idx],  bins, right=True)

    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(
            solver="saga", penalty="l2", C=1.0,
            multi_class="multinomial", class_weight="balanced",
            max_iter=2000, n_jobs=-1
        ))
    ])
    pipe.fit(Xtr, ytr)
    yhat = pipe.predict(Xte)

    f1m  = f1_score(yte, yhat, average="macro")
    balc = balanced_accuracy_score(yte, yhat)
    rpt  = classification_report(yte, yhat, labels=[0,1,2], output_dict=True, zero_division=0)

    # per-class F1
    f1_L = rpt["0"]["f1-score"]; f1_M = rpt["1"]["f1-score"]; f1_H = rpt["2"]["f1-score"]

    # baseline majority (train)
    maj  = np.bincount(ytr).argmax()
    y0   = np.full_like(yte, maj)
    f1m0 = f1_score(yte, y0, average="macro")
    balc0= balanced_accuracy_score(yte, y0)

    rows.append({
        "trait": t,
        "f1_macro": f1m, "bal_acc": balc,
        "f1_L": f1_L, "f1_M": f1_M, "f1_H": f1_H,
        "f1_macro_baseline": f1m0, "bal_acc_baseline": balc0,
        "lift_f1_pct": 100*(f1m - f1m0)
    })

    # simpan confusion matrix
    cm = confusion_matrix(yte, yhat, labels=[0,1,2])
    pd.DataFrame(cm, index=["true_L","true_M","true_H"], columns=["pred_L","pred_M","pred_H"])\
      .to_csv(f"/kaggle/working/test_confmat_{t}.csv")

df_test = pd.DataFrame(rows).sort_values("f1_macro", ascending=False).reset_index(drop=True)
df_test.to_csv("/kaggle/working/test_metrics.csv", index=False)
display(df_test)

# agregat 5 trait
agg = {
    "f1_macro_avg": df_test["f1_macro"].mean(),
    "bal_acc_avg":  df_test["bal_acc"].mean(),
    "lift_f1_pct_avg": df_test["lift_f1_pct"].mean()
}
pd.DataFrame([agg]).to_csv("/kaggle/working/test_metrics_aggregate.csv", index=False)
display(pd.DataFrame([agg]))




Unnamed: 0,trait,f1_macro,bal_acc,f1_L,f1_M,f1_H,f1_macro_baseline,bal_acc_baseline,lift_f1_pct
0,extraversion,0.476674,0.478437,0.560606,0.340491,0.528926,0.177928,0.333333,29.874623
1,conscientiousness,0.472121,0.475106,0.563776,0.310078,0.54251,0.178587,0.333333,29.353356
2,neuroticism,0.460403,0.462131,0.519894,0.336634,0.524683,0.167472,0.333333,29.293159
3,openness,0.454332,0.455783,0.52551,0.345781,0.491704,0.176935,0.333333,27.739679
4,agreeableness,0.416912,0.417316,0.451007,0.378531,0.421199,0.174603,0.333333,24.230923


Unnamed: 0,f1_macro_avg,bal_acc_avg,lift_f1_pct_avg
0,0.456089,0.457755,28.098348


#### Penjelasan (Sel 26)
- Melakukan held-out test split by person, mengevaluasi tiap trait, menyimpan metrik per fold/aggregate, dan membuat laporan ringkas.

In [7]:
# Ringkas file penting dan sarankan commit
for p in [
    "/kaggle/working/final_cv_metrics.csv",
    "/kaggle/working/test_metrics.csv",
    "/kaggle/working/test_metrics_aggregate.csv",
    "/kaggle/working/splits/test_persons.csv",
    "/kaggle/working/models/config.json",
    "/kaggle/working/models/thresholds.json",
]:
    print(os.path.basename(p), "→", os.path.exists(p))
print("\nKalau semua True, klik **Save Version** supaya jadi Output Files.")


final_cv_metrics.csv → True
test_metrics.csv → True
test_metrics_aggregate.csv → True
test_persons.csv → True
config.json → True
thresholds.json → True

Kalau semua True, klik **Save Version** supaya jadi Output Files.


#### Penjelasan (Sel 27)
- Mencetak keberadaan file kunci (metrics, splits, config) sebagai checklist sebelum menyimpan versi.

In [8]:
# === Confusion Matrix (VAL & TEST) untuk 3-way split by person ===
import os, numpy as np, pandas as pd
from sklearn.model_selection import GroupShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

PARQUET = PQ_MEAN
dat = pd.read_parquet(PARQUET)
assert "person_id" in dat.columns, "person_id dibutuhkan."

X = dat[[c for c in dat.columns if c.startswith("emb_")]].values
persons = dat["person_id"].astype(str).values
traits = ["extraversion","agreeableness","conscientiousness","neuroticism","openness"]

# -- Ambil split dari file kalau ada; kalau tidak, regenerate (seed sama) --
split_dir = "/kaggle/working/splits"
if all(os.path.exists(f"{split_dir}/{n}_persons.csv") for n in ["train","val","test"]):
    train_p = set(pd.read_csv(f"{split_dir}/train_persons.csv")["person_id"].astype(str))
    val_p   = set(pd.read_csv(f"{split_dir}/val_persons.csv")["person_id"].astype(str))
    test_p  = set(pd.read_csv(f"{split_dir}/test_persons.csv")["person_id"].astype(str))
    train_idx = np.where(pd.Series(persons).isin(train_p))[0]
    val_idx   = np.where(pd.Series(persons).isin(val_p))[0]
    test_idx  = np.where(pd.Series(persons).isin(test_p))[0]
else:
    gss1 = GroupShuffleSplit(n_splits=1, test_size=0.20, random_state=42)
    trval_idx, test_idx = next(gss1.split(X, groups=persons))
    gss2 = GroupShuffleSplit(n_splits=1, test_size=0.20/0.80, random_state=43)
    tr_idx_sub, val_idx_sub = next(gss2.split(X[trval_idx], groups=persons[trval_idx]))
    train_idx = trval_idx[tr_idx_sub]
    val_idx   = trval_idx[val_idx_sub]

Xtr, Xva, Xte = X[train_idx], X[val_idx], X[test_idx]

os.makedirs("/kaggle/working/confmats", exist_ok=True)

def disc_from_train(y_all, tr_idx):
    bins = np.quantile(y_all[tr_idx], [1/3, 2/3])
    return np.digitize(y_all, bins, right=True), bins

for t in traits:
    y_all = dat[t].values
    y_disc, bins = disc_from_train(y_all, train_idx)
    ytr, yva, yte = y_disc[train_idx], y_disc[val_idx], y_disc[test_idx]

    clf = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(
            solver="lbfgs", penalty="l2", multi_class="multinomial",
            class_weight="balanced", max_iter=2000
        ))
    ])

    # Train -> VAL CM
    clf.fit(Xtr, ytr)
    yva_hat = clf.predict(Xva)
    cm_val = confusion_matrix(yva, yva_hat, labels=[0,1,2])

    # Refit (TRAIN+VAL) -> TEST CM
    Xtrva = np.vstack([Xtr, Xva]); ytrva = np.concatenate([ytr, yva])
    clf.fit(Xtrva, ytrva)
    yte_hat = clf.predict(Xte)
    cm_test = confusion_matrix(yte, yte_hat, labels=[0,1,2])

    # Save kedua CM
    pd.DataFrame(cm_val,  index=["true_L","true_M","true_H"], columns=["pred_L","pred_M","pred_H"])\
      .to_csv(f"/kaggle/working/confmats/confmat_val_{t}.csv")
    pd.DataFrame(cm_test, index=["true_L","true_M","true_H"], columns=["pred_L","pred_M","pred_H"])\
      .to_csv(f"/kaggle/working/confmats/confmat_test_{t}.csv")

print("Saved confusion matrices to /kaggle/working/confmats/")
# Tampilkan contoh CM TEST untuk 1 trait biar cepat cek:
sample_t = traits[0]
display(pd.read_csv(f"/kaggle/working/confmats/confmat_test_{sample_t}.csv", index_col=0))


Saved confusion matrices to /kaggle/working/confmats/


Unnamed: 0,pred_L,pred_M,pred_H
true_L,219,108,68
true_M,127,118,109
true_H,58,90,188


#### Penjelasan (Sel 28)
- Menghasilkan confusion matrix untuk val/test berbasis split person, menyimpannya per trait, dan menampilkan contoh untuk verifikasi cepat.

In [15]:
# ==== DEMO INFERENCE OFFLINE: 1 WAV -> 5 trait (L/M/H) ====
import os, glob, json, joblib, numpy as np, pandas as pd, librosa, torch
from transformers import AutoFeatureExtractor, AutoModel

SR = 16000
MODELDIR = "/kaggle/working/models"

# 1) cari snapshot lokal huBERT di hf_cache (working atau dataset checkpoint)
candidates = []
for base in [
    "/kaggle/working/hf_cache",
    "/kaggle/input/trainingv1-checkpoint1/hf_cache",  # kalau pakai dataset checkpoint
]:
    hub_root = os.path.join(base, "models--facebook--hubert-base-ls960")
    snap_glob = os.path.join(hub_root, "snapshots", "*")
    candidates += glob.glob(snap_glob)

assert candidates, "Snapshot lokal hubert-base tidak ditemukan di hf_cache"
HUBERT_DIR = candidates[0]
print("HUBERT_DIR:", HUBERT_DIR)

device = "cuda" if torch.cuda.is_available() else "cpu"

# 2) load feature extractor + backbone dari folder lokal (NO internet)
fe = AutoFeatureExtractor.from_pretrained(HUBERT_DIR, local_files_only=True)
backbone = AutoModel.from_pretrained(HUBERT_DIR, local_files_only=True).to(device).eval()

print("Loaded backbone from local snapshot.")

# 3) load thresholds + model klasifikasi (logreg) yang sudah kamu train
with open(os.path.join(MODELDIR, "thresholds.json")) as f:
    thresholds = json.load(f)
traits = list(thresholds.keys())
models = {t: joblib.load(os.path.join(MODELDIR, f"{t}_mean_logreg.joblib")) for t in traits}

def embed_mean(wav_path: str):
    y, _ = librosa.load(wav_path, sr=SR, mono=True)
    with torch.no_grad():
        inputs = fe([y], sampling_rate=SR, return_tensors="pt", padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        hs = backbone(**inputs).last_hidden_state  # [1,T,D]
        emb = hs.mean(dim=1).cpu().numpy()        # [1,D]
    return emb

def predict_personality(wav_path: str):
    emb = embed_mean(wav_path)
    out = {}
    for t in traits:
        clf = models[t]
        probs = clf.predict_proba(emb)[0]
        pred_idx = int(probs.argmax())
        label = ["Low", "Mid", "High"][pred_idx]
        out[t] = {
            "label": label,
            "pred_idx": pred_idx,
            "probs_LMH": probs.round(3).tolist(),
        }
    return out

# 4) contoh: pakai 1 klip dari dataset asli
BASE_TR = "/kaggle/input/traningv1"
df_meta = pd.read_csv(f"{BASE_TR}/fi_v2_meta.with_feats.clean.folds.csv")
wav_col = "wav_path" if "wav_path" in df_meta.columns else "relpath"
sample_path = os.path.join(BASE_TR, df_meta[wav_col].iloc[0].lstrip("./"))
print("SAMPLE WAV:", sample_path)

pred = predict_personality(sample_path)
pd.DataFrame(pred).T


HUBERT_DIR: /kaggle/working/hf_cache/models--facebook--hubert-base-ls960/snapshots/dba3bb02fda4248b6e082697eee756de8fe8aa8a
Loaded backbone from local snapshot.
SAMPLE WAV: /kaggle/input/traningv1/wav/J4GQm9j0JZ0.003.wav


Unnamed: 0,label,pred_idx,probs_LMH
extraversion,Mid,1,"[0.28299999237060547, 0.5070000290870667, 0.20..."
agreeableness,Low,0,"[0.42399999499320984, 0.19300000369548798, 0.3..."
conscientiousness,Mid,1,"[0.4480000138282776, 0.49300000071525574, 0.05..."
neuroticism,Mid,1,"[0.2669999897480011, 0.414000004529953, 0.3190..."
openness,Mid,1,"[0.16500000655651093, 0.7110000252723694, 0.12..."


#### Penjelasan (Sel 29)
- Demo inferensi offline: memuat backbone dan model tersimpan, memproses satu WAV, dan menampilkan prediksi L/M/H untuk lima trait.

## Kesimpulan
- Dataset dan fold sudah tervalidasi tanpa kebocoran, backbone dievaluasi, dan huBERT-base dengan pooling mean dipilih.
- Embedding penuh diekstrak, evaluasi klasifikasi tersimpan, model final dilatih (beserta laporan CV/test dan confusion matrix).
- Artefak utama (embedding, model, metrics, paket env) sudah dipaketkan sehingga eksperimen bisa direpro dan dijalankan ulang dari checkpoint atau untuk inferensi.