# Deteksi Anomali Aktivitas User (CLUE-style Swap UID + IsolationForest + SHAP)

Notebook ini merapikan pipeline menjadi beberapa tahap yang jelas:

1. **Konfigurasi & setup**
2. **Membaca data JSON Lines (`.jsonl`) secara streaming**
3. **Cleaning & validasi kolom penting (`uid`, `time`, `type`)**
4. **EDA/Visualisasi ringkas (aktivitas per tahun/bulan + user unik)**
5. **Ambil sampel tahun target + injeksi anomali sintetis (swap UID)**
6. **Feature engineering harian (count + ratio + log)**
7. **Modeling: IsolationForest + tuning threshold (F1)**
8. **Explainability: SHAP (global importance + top fitur)**
9. **Output: skor anomali & prediksi (level harian dan join ke event)**

> Catatan: File data besar sebaiknya diproses **streaming** (tidak diload penuh ke memori).


In [None]:
# ============================================================
# 0) Setup: dependency & import
# ============================================================
import sys, subprocess, os, json, math
from collections import defaultdict
from datetime import datetime
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.model_selection import ParameterGrid

from IPython.display import display

# (opsional) install shap kalau belum ada
try:
    import shap
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "shap"])
    import shap

shap.initjs()

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)


## 1) Konfigurasi

Sesuaikan `DATA_PATH` ke lokasi file JSON Lines kamu.

- Format input: **1 event per baris**, JSON object yang minimal punya:
  - `uid`  : id user
  - `time` : timestamp ISO (contoh: `"2020-01-01T12:34:56Z"`)
  - `type` : tipe event (contoh: `"login"`, `"file_accessed"`, dll)

Jika kolom kamu berbeda (mis. `event_type`), fungsi `normalize_record()` di bawah akan mencoba memetakan otomatis.


In [None]:
# ============================================================
# 1) Konfigurasi
# ============================================================
# Jika kamu di Google Colab dan file ada di Drive, aktifkan 2 baris ini:
# from google.colab import drive
# drive.mount('/content/drive')

DATA_PATH = Path("/content/drive/MyDrive/Projek/clue.json")  # ganti sesuai lokasi kamu

YEARS_INTEREST = list(range(2017, 2024))  # 2017–2023
TARGET_YEAR = 2020
N_SAMPLE = 50_000

# Injeksi anomali (swap UID)
ANOM_N_PAIRS = 20
MIN_EVENTS_PER_USER = 80
MIN_OVERLAP_DAYS = 2
PREFER_DISSIMILAR = True

RANDOM_STATE = 42

assert TARGET_YEAR in range(min(YEARS_INTEREST), max(YEARS_INTEREST)+2), "TARGET_YEAR di luar rentang wajar."
print("DATA_PATH:", DATA_PATH)


## 2) Membaca data (streaming) + cleaning

Di sini kita definisikan helper untuk:
- membaca file JSON Lines baris demi baris
- normalisasi field (`uid`, `time`, `type`)
- parsing timestamp


In [None]:
# ============================================================
# 2) Helpers: iterasi jsonl & normalisasi record
# ============================================================
def iter_jsonl(path: Path):
    """Yield dict per baris dari file JSON Lines (.jsonl)."""
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(
            f"File tidak ditemukan: {path}\n"
            "Pastikan DATA_PATH sudah benar, atau upload file ke runtime."
        )
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError:
                continue

def parse_iso_datetime(ts: str):
    """Parse timestamp ISO cepat (mendukung akhiran 'Z'). Return datetime atau None."""
    if not ts:
        return None
    try:
        return datetime.fromisoformat(ts.replace("Z", "+00:00"))
    except Exception:
        return None

def normalize_record(rec: dict):
    """Mapping field berbeda ke skema standar: uid, time, type."""
    uid = rec.get("uid", rec.get("user_id", rec.get("userid")))
    ts  = rec.get("time", rec.get("timestamp", rec.get("ts")))
    typ = rec.get("type", rec.get("event_type", rec.get("action")))
    if uid is None or ts is None or typ is None:
        return None
    dt = parse_iso_datetime(ts)
    if dt is None:
        return None
    return {
        "uid": str(uid),
        "time": dt,          # datetime (timezone-aware jika input ISO +00:00)
        "type": str(typ)
    }

# quick sanity: hitung 5 record valid pertama
tmp = []
for rec in iter_jsonl(DATA_PATH):
    norm = normalize_record(rec)
    if norm is not None:
        tmp.append(norm)
    if len(tmp) >= 5:
        break
display(pd.DataFrame(tmp))


## 3) EDA Streaming: aktivitas per tahun/bulan + user unik

Menghitung:
- total event per tahun (2017–2023)
- user unik per tahun
- total event per (tahun, bulan) untuk visualisasi tren bulanan


In [None]:
# ============================================================
# 3) EDA Streaming
# ============================================================
def compute_activity_stats(path: Path, years_interest):
    activity_per_year = defaultdict(int)
    activity_per_year_month = defaultdict(int)  # (year, month) -> count
    userset_per_year = {y: set() for y in years_interest}

    for i, rec in enumerate(iter_jsonl(path), start=1):
        norm = normalize_record(rec)
        if norm is None:
            continue
        dt = norm["time"]
        y = dt.year
        if y in years_interest:
            activity_per_year[y] += 1
            activity_per_year_month[(y, dt.month)] += 1
            userset_per_year[y].add(norm["uid"])

        if i % 1_000_000 == 0:
            print(f"Processed {i:,} lines ...")

    return activity_per_year, activity_per_year_month, userset_per_year

activity_per_year, activity_per_year_month, userset_per_year = compute_activity_stats(DATA_PATH, YEARS_INTEREST)

years = YEARS_INTEREST
totals = [activity_per_year.get(y, 0) for y in years]
user_counts = [len(userset_per_year.get(y, set())) for y in years]

# Plot: total event per tahun
plt.figure(figsize=(10,4))
plt.bar(years, totals)
plt.title("Total Aktivitas per Tahun (2017–2023)")
plt.xlabel("Tahun"); plt.ylabel("Jumlah event")
for y, v in zip(years, totals):
    plt.text(y, v, f"{v:,}", ha="center", va="bottom", fontsize=8, rotation=90)
plt.tight_layout(); plt.show()

# Plot: user unik per tahun
plt.figure(figsize=(10,4))
plt.bar(years, user_counts)
plt.title("Jumlah User Unik per Tahun (2017–2023)")
plt.xlabel("Tahun"); plt.ylabel("Jumlah user unik")
for y, v in zip(years, user_counts):
    plt.text(y, v, str(v), ha="center", va="bottom", fontsize=8)
plt.tight_layout(); plt.show()

# Tambahan: aktivitas per bulan per tahun
df_month = pd.DataFrame(
    [{"year": y, "month": m, "count": c} for (y,m), c in activity_per_year_month.items()]
)
if not df_month.empty:
    pivot_month = df_month.pivot_table(index="month", columns="year", values="count", fill_value=0).sort_index()
    plt.figure(figsize=(10,4))
    for y in pivot_month.columns:
        plt.plot(pivot_month.index, pivot_month[y], marker="o", label=str(y))
    plt.title("Aktivitas per Bulan (by Tahun)")
    plt.xlabel("Bulan"); plt.ylabel("Jumlah event")
    plt.xticks(range(1,13))
    plt.grid(True); plt.legend(ncols=4, fontsize=8)
    plt.tight_layout(); plt.show()

# Proporsi kontribusi per tahun
overall = sum(totals) if sum(totals) > 0 else 1
plt.figure(figsize=(7,7))
plt.pie([t/overall for t in totals], labels=years, autopct="%1.1f%%", startangle=140)
plt.title("Proporsi Aktivitas per Tahun")
plt.tight_layout(); plt.show()


## 4) Ambil sampel tahun target + basic cleaning

Agar modeling cepat, kita ambil sampel `N_SAMPLE` event untuk `TARGET_YEAR`.


In [None]:
# ============================================================
# 4) Load sampel TARGET_YEAR
# ============================================================
def load_year_sample(path: Path, year: int, n_sample: int):
    rows = []
    for rec in iter_jsonl(path):
        norm = normalize_record(rec)
        if norm is None:
            continue
        if norm["time"].year == year:
            rows.append(norm)
            if len(rows) >= n_sample:
                break

    df = pd.DataFrame(rows)
    if df.empty:
        raise ValueError(f"Tidak ada data untuk year={year}. Cek DATA_PATH / format timestamp.")
    df["time"] = pd.to_datetime(df["time"], utc=True)
    df["date"] = df["time"].dt.date
    df["label"] = 1  # normal = +1
    return df

df_year = load_year_sample(DATA_PATH, TARGET_YEAR, N_SAMPLE)

print("Preview data NORMAL (+1) sebelum injeksi anomaly:")
display(df_year.head())
print("Distribusi type (top 10):")
display(df_year["type"].value_counts().head(10))


## 5) Injeksi anomali sintetis: swap UID (CLUE-style)

Ide:
- pilih pasangan user aktif
- cari irisan hari (overlap) aktivitas
- tentukan `ts` (median overlap)
- **setelah `ts`**, tukar `uid` untuk kedua user → label event menjadi **-1**


In [None]:
# ============================================================
# 5) Injeksi anomaly: swap UID
# ============================================================
def create_anomalies_by_user_swap(
    df_in: pd.DataFrame,
    n_pairs: int = 15,
    random_state: int = 42,
    min_events_per_user: int = 100,
    min_overlap_days: int = 2,
    prefer_dissimilar: bool = True,
    max_jaccard: float = 0.6
):
    rng = np.random.RandomState(random_state)
    df = df_in.copy()

    # kandidat user aktif
    counts = df["uid"].value_counts()
    candidates = counts[counts >= min_events_per_user].index.tolist()
    if len(candidates) < 2:
        print("User aktif terlalu sedikit. Turunkan min_events_per_user.")
        return df, []

    rng.shuffle(candidates)

    # precompute set event types & hari aktif per user
    user_eventset = {}
    user_days = {}
    for u in candidates:
        sub = df[df["uid"] == u]
        user_eventset[u] = set(sub["type"].unique())
        user_days[u] = set(sub["date"].unique())

    def jaccard(a, b):
        if not a and not b:
            return 1.0
        inter = len(a & b)
        union = len(a | b)
        return inter / union if union > 0 else 0.0

    anomaly_idx = set()
    pairs = []

    i = 0
    while i + 1 < len(candidates) and len(pairs) < n_pairs:
        u1, u2 = candidates[i], candidates[i + 1]
        i += 2

        overlap = np.intersect1d(list(user_days[u1]), list(user_days[u2]))
        if len(overlap) < min_overlap_days:
            continue

        sim = jaccard(user_eventset[u1], user_eventset[u2])
        if prefer_dissimilar and sim > max_jaccard:
            continue

        ts = np.sort(overlap)[len(overlap) // 2]  # median overlap day
        mask1 = (df["uid"] == u1) & (df["date"] >= ts)
        mask2 = (df["uid"] == u2) & (df["date"] >= ts)
        if mask1.sum() == 0 or mask2.sum() == 0:
            continue

        anomaly_idx.update(df[mask1].index.tolist())
        anomaly_idx.update(df[mask2].index.tolist())

        # swap uid (setelah ts)
        df.loc[mask1, "uid"] = u2
        df.loc[mask2, "uid"] = u1

        pairs.append((u1, u2, ts, sim))

    df.loc[list(anomaly_idx), "label"] = -1

    print(f"Total pasangan swap: {len(pairs)}")
    if pairs:
        print("Contoh pasangan (u1, u2, ts, jaccard):", pairs[:5])
    print("Distribusi label (setelah injeksi):")
    print(df["label"].value_counts())
    return df, pairs

df_anom, swap_pairs = create_anomalies_by_user_swap(
    df_year,
    n_pairs=ANOM_N_PAIRS,
    random_state=RANDOM_STATE,
    min_events_per_user=MIN_EVENTS_PER_USER,
    min_overlap_days=MIN_OVERLAP_DAYS,
    prefer_dissimilar=PREFER_DISSIMILAR,
)

print("Preview data SETELAH injeksi anomaly:")
display(df_anom.sample(10, random_state=RANDOM_STATE).sort_values("time")[["time","uid","type","label"]])


## 6) Vektorisasi harian: (uid, date) × event_type (count)

- Fitur harian = jumlah event per `type`.
- Label harian = **-1** jika ada minimal 1 event anomaly pada hari tersebut, selain itu **+1**.


In [None]:
# ============================================================
# 6) Vektorisasi harian
# ============================================================
def build_daily_vectors(df_in: pd.DataFrame):
    d = df_in.copy()

    X_counts = d.pivot_table(
        index=["uid", "date"],
        columns="type",
        values="time",
        aggfunc="count",
        fill_value=0
    ).sort_index()

    # label harian: min() karena -1 < +1 → kalau ada -1, hasil -1
    y_day = d.groupby(["uid","date"])["label"].min().reindex(X_counts.index).astype(int)

    return X_counts, y_day

X_counts, y_day = build_daily_vectors(df_anom)

print("Shape vektor harian (counts-only):", X_counts.shape)
print("Distribusi label harian (1 normal, -1 anomaly):")
print(y_day.value_counts())
display(X_counts.head())


## 7) Feature Engineering (ringan tapi efektif)

Tambahan fitur:
- `total_events`
- `unique_event_types`
- log transform untuk count: `log1p(count)`
- ratio per tipe event: `count / total_events`


In [None]:
# ============================================================
# 7) Feature engineering
# ============================================================
def make_aug_features(X_counts: pd.DataFrame) -> pd.DataFrame:
    X = X_counts.copy()

    extra = pd.DataFrame(index=X.index)
    extra["total_events"] = X.sum(axis=1)
    extra["unique_event_types"] = (X > 0).sum(axis=1)

    # log transform untuk kurangi skew
    X_log = np.log1p(X)

    # proporsi tiap event type
    denom = extra["total_events"].replace(0, np.nan)
    ratio = X.div(denom, axis=0).fillna(0).add_prefix("ratio_")

    X_aug = pd.concat([X_log, extra, ratio], axis=1)
    return X_aug

X_aug = make_aug_features(X_counts)
X_train = X_aug[y_day == 1]          # fit di normal saja
y_true_full = y_day.copy()           # 1 normal, -1 anomaly
y_true_bin = (y_true_full == -1).astype(int)  # 1=anomaly, 0=normal

print("X_aug shape:", X_aug.shape)


## 8) Modeling: IsolationForest + tuning threshold (F1)

Strategi:
- Fit IsolationForest di data **normal**.
- Skor anomali untuk semua data: `scores = -decision_function(...)` (makin besar = makin anomali).
- Sweep threshold berbasis percentile skor untuk memaksimalkan **F1**.


In [None]:
# ============================================================
# 8) Modeling + tuning threshold
# ============================================================
def plot_confusion_matrix(cm, title="Confusion Matrix", classes=("Normal (0)", "Anomaly (1)")):
    fig, ax = plt.subplots(figsize=(5, 5))
    im = ax.imshow(cm, interpolation="nearest")
    ax.set_title(title)
    plt.colorbar(im, ax=ax)

    tick_marks = np.arange(len(classes))
    ax.set_xticks(tick_marks); ax.set_xticklabels(classes, rotation=45)
    ax.set_yticks(tick_marks); ax.set_yticklabels(classes)

    thresh = cm.max() / 2.0 if cm.size else 0
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], "d"),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")

    ax.set_ylabel("True label")
    ax.set_xlabel("Predicted label")
    plt.tight_layout()
    plt.show()

def train_tune_iforest(X_aug: pd.DataFrame, X_train: pd.DataFrame, y_true_bin: np.ndarray, random_state: int = 42):
    param_grid = {
        "n_estimators": [300, 600],
        "max_samples": [0.6, 0.8, 1.0],
        "contamination": [0.05, 0.1],  # baseline; threshold sebenarnya dari sweep
    }
    percentiles = np.linspace(50, 99, 100)

    best = {"f1": -1, "params": None, "threshold": None, "pred": None, "model": None, "scores": None}

    for params in ParameterGrid(param_grid):
        model = IsolationForest(random_state=random_state, n_jobs=-1, **params)
        model.fit(X_train)

        scores = -model.decision_function(X_aug)  # skor besar = lebih aneh

        for p in percentiles:
            t = np.percentile(scores, p)
            y_pred_bin = (scores > t).astype(int)
            f1 = f1_score(y_true_bin, y_pred_bin)

            if f1 > best["f1"]:
                best.update({"f1": f1, "params": params, "threshold": t,
                             "pred": y_pred_bin, "model": model, "scores": scores})

    return best

best = train_tune_iforest(X_aug, X_train, y_true_bin, random_state=RANDOM_STATE)

print("=== Best IsolationForest (tuned) ===")
print("Params    :", best["params"])
print("Threshold :", best["threshold"])
print("Best F1   :", best["f1"])

cm = confusion_matrix(y_true_bin, best["pred"])
plot_confusion_matrix(cm, "Confusion Matrix - IsolationForest (tuned)")

print("\nClassification report:")
print(classification_report(y_true_bin, best["pred"], digits=4))

if best["f1"] < 0.70:
    print("\n[PERINGATAN] F1 < 0.70 di dataset ini.")
    print("Tips:")
    print("- Tambah ANOM_N_PAIRS (lebih banyak injeksi anomali)")
    print("- Turunkan MIN_EVENTS_PER_USER (lebih banyak kandidat user)")
    print("- longgarkan max_jaccard (kalau user terlalu mirip)")
    print("- Tambah fitur (mis. rolling mean/STD per user)")


## 9) Explainability: SHAP (global importance + top fitur)

SHAP untuk IsolationForest berbasis tree bisa cukup berat, jadi kita sampling baris `X_aug`.


In [None]:
# ============================================================
# 9) SHAP
# ============================================================
n_sample_shap = min(2000, X_aug.shape[0])
rng = np.random.RandomState(RANDOM_STATE)
idx = rng.choice(np.arange(X_aug.shape[0]), size=n_sample_shap, replace=False)
X_shap = X_aug.iloc[idx]

explainer = shap.TreeExplainer(best["model"])
shap_values = explainer.shap_values(X_shap)

plt.figure(figsize=(10,6))
shap.summary_plot(shap_values, features=X_shap, feature_names=X_shap.columns, show=False)
plt.title("SHAP Summary Plot - IsolationForest (tuned)")
plt.tight_layout(); plt.show()

# Global importance: mean(|SHAP|)
mean_abs_shap = np.abs(shap_values).mean(axis=0)
imp = (pd.DataFrame({"feature": X_shap.columns, "mean_abs_shap": mean_abs_shap})
       .sort_values("mean_abs_shap", ascending=False)
       .reset_index(drop=True))

display(imp.head(20))

top3 = imp.head(3)
print("\n=== 3 Fitur Paling Berpengaruh (mean |SHAP|) ===")
for i, row in top3.iterrows():
    print(f"{i+1}. {row['feature']} (mean|SHAP|={row['mean_abs_shap']:.6f})")


## 10) Output: skor anomali & prediksi

- Level harian: gabungkan `scores` + prediksi
- Join ke level event (`df_anom`) agar tiap event punya `anomaly_score_day` & `pred_day`


In [None]:
# ============================================================
# 10) Output + join
# ============================================================
scores_full = best["scores"]
pred_full = best["pred"]

df_daily = X_aug.copy()
df_daily["anomaly_score"] = scores_full
df_daily["pred_bin"] = pred_full           # 1=pred anomaly, 0=pred normal
df_daily["true_bin"] = y_true_bin.values   # 1=gt anomaly, 0=gt normal

df_daily = df_daily.reset_index()  # uid, date jadi kolom

# join ke event-level
df_out = df_anom.merge(
    df_daily[["uid","date","anomaly_score","pred_bin","true_bin"]],
    on=["uid","date"],
    how="left"
)

print("Preview output event-level:")
display(df_out.head())

# Opsional: simpan output (aktifkan jika perlu)
# out_path = Path("output_with_scores.parquet")
# df_out.to_parquet(out_path, index=False)
# print("Saved:", out_path.resolve())


## 11) (Opsional) Visualisasi 1 pohon IsolationForest

Berguna untuk sekilas melihat struktur split, tapi **bukan** interpretasi utama (lebih baik pakai SHAP).


In [None]:
# ============================================================
# 11) Visualisasi 1 pohon dari IsolationForest
# ============================================================
from sklearn.tree import plot_tree

one_tree = best["model"].estimators_[0]
plt.figure(figsize=(20, 10))
plot_tree(
    one_tree,
    feature_names=X_aug.columns,
    max_depth=3,
    filled=True,
    fontsize=6
)
plt.title("Salah satu pohon IsolationForest (max_depth=3)")
plt.tight_layout()
plt.show()
