# Notebook: Sistem Pendukung Investigasi Serangan Siber Berdasarkan Log Server Web dan Bisecting K-Means

Notebook ini membangun sistem analisis log server web untuk investigasi serangan siber, dengan pipeline:
1. Import library & setup lingkungan
2. Membaca & membersihkan data log
3. Ekstraksi & tokenisasi URL
4. Pembuatan embedding URL dengan BERT
5. Clustering URL menggunakan Bisecting K-Means
6. Visualisasi & analisis hasil cluster

## 1. Import Library dan Setup Lingkungan

Impor semua library yang diperlukan (pandas, numpy, torch, transformers, sklearn, dsb). Lakukan setup device (CPU/GPU) dan inisialisasi model/tokenizer BERT.

In [10]:
import os
import re
import sys
import torch
import argparse
import numpy as np
import pandas as pd
from urllib.parse import urlparse, unquote
from transformers import BertTokenizer, BertModel
from sentence_transformers import SentenceTransformer
# from gensim.models import FastText
from sklearn.cluster import KMeans
from custom_bkm import VerboseBisectingKMeans, flush_print
from sklearn.preprocessing import normalize, OneHotEncoder, MinMaxScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from decoder import parse_dec_file_to_dataframe
from pprint import pprint
from tqdm import trange
from sklearn.feature_extraction.text import (
    HashingVectorizer,
    TfidfVectorizer,
    TfidfTransformer,
)
from scipy.sparse import csr_matrix, hstack, vstack, issparse

In [23]:
# Select GPU if available, otherwise fallback to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load tokenizer and model ONCE globally for efficiency
TOKENIZER = BertTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
MODEL = BertModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2").to(device)
MODEL.eval()

# Important Variables
input_log_file = "../inputs/sample_1000000.log"
output_dir = "../outputs/sample_1000000.csv"
num_clusters = 8

Using device: cuda


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

## 2. Membaca dan Membersihkan Data Log

Baca file log hasil decoding (menggunakan fungsi dari `decoder.py`), filter bot, dan parsing menjadi DataFrame yang siap diolah.

In [12]:
# Load and process log file
df = parse_dec_file_to_dataframe(input_log_file)
print(f"✅ Loaded {len(df)} rows from {input_log_file}")
df.head()

✅ Loaded 896359 rows from ../inputs/sample_1000000.log


Unnamed: 0,ip,time,method,url,protocol,status,size,referrer,user_agent,extra,no
0,5.217.188.22,2019-01-26 15:49:54+00:00,GET,/static/images/guarantees/warranty.png,HTTP/1.1,200,5807,https://www.zanbil.ir/m/index?utm_medium=26&ut...,Mozilla/5.0 (Linux; Android 8.1.0; SAMSUNG SM-...,-,1
1,89.199.139.223,2019-01-25 18:50:35+00:00,GET,/image/30970?name=vb-8320h-1.jpg&wh=200x200,HTTP/1.1,200,8788,-,Dalvik/2.1.0 (Linux; U; Android 7.0; SM-A510F ...,-,2
2,46.209.107.210,2019-01-26 12:28:46+00:00,GET,/image/61997/productModel/150x150,HTTP/1.1,200,2746,https://www.zanbil.ir/filter/p3?page=1,Mozilla/5.0 (Windows NT 6.1; rv:64.0) Gecko/20...,-,3
3,217.219.52.1,2019-01-24 05:43:19+00:00,GET,"/site/searchAutoComplete?f=p49&f=b63,stexists&...",HTTP/1.1,302,0,"https://www.zanbil.ir/filter?f=p49,b63,stexists",Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,-,4
4,91.185.151.126,2019-01-23 10:40:42+00:00,GET,/image/8165/productModel/150x150,HTTP/1.1,200,3247,https://www.zanbil.ir/browse/kitchen-sink/سینک...,Mozilla/5.0 (Windows NT 6.1; Win64; x64) Apple...,-,5


In [13]:
print("=== DataFrame Info ===")
df.info()
print("\n=== DataFrame Describe ===")
print(df.describe(include='all'))
print("\n=== DataFrame Columns ===")
print(df.columns)

=== DataFrame Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 896359 entries, 0 to 896358
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype              
---  ------      --------------   -----              
 0   ip          896359 non-null  object             
 1   time        896359 non-null  datetime64[ns, UTC]
 2   method      896359 non-null  object             
 3   url         896359 non-null  object             
 4   protocol    896359 non-null  object             
 5   status      896359 non-null  int64              
 6   size        896359 non-null  int64              
 7   referrer    896359 non-null  object             
 8   user_agent  896359 non-null  object             
 9   extra       896359 non-null  object             
 10  no          896359 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(7)
memory usage: 75.2+ MB

=== DataFrame Describe ===
                     ip                                 time 

## 3. Ekstraksi dan Tokenisasi Fitur

Ekstrak fitur dari log, lakukan masking angka pada fitur URL, kategorisasi fitur status, dan tokenisasi path serta query string menjadi token-token teks.

In [24]:
def mask_numbers(url):
    """
    Replace numeric sequences in a URL with a placeholder token <NUM>.

    Args:
        url (str): A URL string that may contain numeric sequences.

    Returns:
        str: The URL string where all numeric sequences are replaced with <NUM>.
    """
    return re.sub(r'\d+', '<NUM>', url)



def split_url_tokens(url):
    """
    Tokenize a URL by splitting its path and query string.

    Args:
        url (str): A URL string to tokenize.

    Returns:
        list: List of token strings extracted from the URL.
    """
    parsed = urlparse(url)
    path = unquote(parsed.path)
    query = unquote(parsed.query)
    delimiters = r"[\/\-\_\=\&\?\.\+\(\)\[\]\<\>\{\}]"
    tokens = re.split(delimiters, path.strip("/")) + re.split(delimiters, query)
    return [tok for tok in tokens if tok]


def tokenize_user_agent(ua):
    """
    Tokenize a User-Agent string by splitting on common delimiters.

    Args:
        ua (str): User-Agent string.

    Returns:
        list: List of tokens such as browser, OS, and engine identifiers.
    """
    tokens = re.split(r"[ /;()]+", ua)
    return [tok for tok in tokens if tok]


def categorize_status(code):
    """
    Categorize HTTP status codes into standard ranges.

    Args:
        code (int): HTTP status code.

    Returns:
        str: Category label ("2xx", "3xx", "4xx", "5xx", or "other").
    """
    if 200 <= code < 300:
        return "2xx"
    elif 300 <= code < 400:
        return "3xx"
    elif 400 <= code < 500:
        return "4xx"
    elif 500 <= code < 600:
        return "5xx"
    else:
        return "other"

In [15]:
# --- Tokenisasi ---

# tokenized_urls = [" ".join(split_url_tokens(url)) for url in unique_urls]
tokenized_urls = [" ".join(split_url_tokens(mask_numbers(url))) for url in df['url']]
print("\n✅ tokenized_urls (head & tail)")
pprint(tokenized_urls[:5] + ["..."] + tokenized_urls[-5:])

# Method
methods = df['method'].tolist()
print("\n✅ methods (head)") 
pprint(methods[:5])

# Status kategori
status_categories = df['status'].apply(categorize_status).tolist()
print("\n✅ status_categories (head)")
pprint(status_categories[:5])

# Size mentah (akan dinormalisasi nanti)
sizes = df['size'].tolist()
print("\n✅ sizes (head)")
pprint(sizes[:5])

# User-Agent token
ua_tokens = [" ".join(tokenize_user_agent(ua)) for ua in df['user_agent']]
print("\n✅ user_agent tokens (head)")
pprint(ua_tokens[:3])


✅ tokenized_urls (head & tail)
['static images guarantees warranty png',
 'image NUM name vb NUM h NUM jpg wh NUM x NUM',
 'image NUM productModel NUM x NUM',
 'site searchAutoComplete f p NUM f b NUM ,stexists phrase چرخ خیاطی ژانومه '
 'مدل NUM',
 'image NUM productModel NUM x NUM',
 '...',
 'image NUM productModel NUM x NUM',
 'product NUM NUM صندلی مدیریتی نیلپر مدل SM NUM',
 'image NUM productModel NUM x NUM',
 'image NUM productModel NUM x NUM',
 'image NUM productTypeMenu']

✅ methods (head)
['GET', 'GET', 'GET', 'GET', 'GET']

✅ status_categories (head)
['2xx', '2xx', '2xx', '3xx', '2xx']

✅ sizes (head)
[5807, 8788, 2746, 0, 3247]

✅ user_agent tokens (head)
['Mozilla 5.0 Linux Android 8.1.0 SAMSUNG SM-J730F Build M1AJQ AppleWebKit '
 '537.36 KHTML, like Gecko SamsungBrowser 8.2 Chrome 63.0.3239.111 Mobile '
 'Safari 537.36',
 'Dalvik 2.1.0 Linux U Android 7.0 SM-A510F Build NRD90M',
 'Mozilla 5.0 Windows NT 6.1 rv:64.0 Gecko 20100101 Firefox 64.0']


## 4. Vektorisasi Fitur

Konversi token-token URL menjadi embedding vektor menggunakan berbagai metode, encoding metode dan status, normalisasi size, dan vektorisasi tfidf untuk user-agent.

In [25]:
def generate_url_hashing(url_list, n_features=1024, batch_size=50000):
    """
    Generate feature vectors for URLs using a hashing trick.

    Args:
        url_list (list): List of tokenized URL strings.
        n_features (int, optional): Number of output features. Defaults to 1024.
        batch_size (int, optional): Number of URLs per batch. Defaults to 50000.

    Returns:
        scipy.sparse.csr_matrix: L2-normalized sparse matrix of hashed URL features.
    """
    vectorizer = HashingVectorizer(
        n_features=n_features,
        alternate_sign=False,
        dtype=np.float32
    )

    X_batches = []
    for i in range(0, len(url_list), batch_size):
        batch = url_list[i:i + batch_size]
        X_batches.append(vectorizer.transform(batch))
    X = vstack(X_batches)
    return normalize(X, norm='l2', copy=False)


def generate_url_tfidf(url_list, n_features=1024, batch_size=50000):
    """
    Generate scalable TF-IDF vectors for URLs using a hashing + IDF approach.

    Args:
        url_list (list): List of tokenized URL strings.
        n_features (int, optional): Number of output features. Defaults to 1024.
        batch_size (int, optional): Number of URLs per batch. Defaults to 50000.

    Returns:
        scipy.sparse.csr_matrix: Sparse TF-IDF feature matrix for URLs.
    """
    hv = HashingVectorizer(n_features=n_features, alternate_sign=False, dtype=np.float32)

    tf_batches = []
    for i in range(0, len(url_list), batch_size):
        batch = url_list[i:i + batch_size]
        tf_batches.append(hv.transform(batch))
    X_tf = vstack(tf_batches)

    transformer = TfidfTransformer()
    X_tfidf = transformer.fit_transform(X_tf)
    return normalize(X_tfidf, norm='l2', copy=False)


def generate_url_bert(url_list, TOKENIZER, MODEL, device, batch_size=32, out_path=None):
    """
    Generate BERT embeddings for a list of preprocessed URL strings.

    Args:
        url_list (list): List of preprocessed URL strings.
        TOKENIZER (transformers tokenizer): Pre-trained tokenizer.
        MODEL (transformers model): Pre-trained model (e.g. BERT, MiniLM, etc.).
        device (torch.device): Computation device ("cuda" or "cpu").
        batch_size (int, optional): Batch size for embedding. Defaults to 32.
        out_path (str, optional): Path for memmap file to store embeddings.

    Returns:
        np.ndarray or np.memmap: Dense array or memory-mapped embeddings.
    """
    MODEL.eval()
    dim = MODEL.config.hidden_size

    if out_path:
        fp = np.memmap(out_path, dtype=np.float32, mode='w+', shape=(len(url_list), dim))
    else:
        fp = []

    for i in trange(0, len(url_list), batch_size, desc="Embedding URLs"):
        batch = url_list[i:i + batch_size]
        inputs = TOKENIZER(batch, return_tensors="pt", padding=True, truncation=True, max_length=64).to(device)
        with torch.no_grad():
            outputs = MODEL(**inputs)
        emb = outputs.last_hidden_state.mean(dim=1).cpu().numpy().astype(np.float32)

        if out_path:
            fp[i:i + len(batch)] = emb
        else:
            fp.append(emb)

        del emb, inputs, outputs
        torch.cuda.empty_cache()

    if out_path:
        del fp
        return np.memmap(out_path, dtype=np.float32, mode='r', shape=(len(url_list), dim))
    else:
        return np.vstack(fp)

In [26]:
def encode_methods(methods):
    """
    One-hot encode HTTP methods (e.g., GET, POST, PUT) as sparse vectors.

    Args:
        methods (list[str]): List of HTTP method strings.

    Returns:
        scipy.sparse.csr_matrix: One-hot encoded sparse matrix for methods.
    """
    enc = OneHotEncoder(sparse_output=True, dtype=np.float32)
    return enc.fit_transform(np.array(methods).reshape(-1, 1))


def encode_statuses(status_categories):
    """
    One-hot encode HTTP status code categories (e.g., 2xx, 3xx, 4xx, 5xx).

    Args:
        status_categories (list[str]): List of status category labels.

    Returns:
        scipy.sparse.csr_matrix: One-hot encoded sparse matrix for statuses.
    """
    enc = OneHotEncoder(sparse_output=True, dtype=np.float32)
    return enc.fit_transform(np.array(status_categories).reshape(-1, 1))


def normalize_sizes(sizes):
    """
    Normalize response sizes to the [0, 1] range and return as sparse matrix.

    Args:
        sizes (list[int]): List of response sizes.

    Returns:
        scipy.sparse.csr_matrix: Normalized response sizes.
    """
    scaler = MinMaxScaler()
    arr = scaler.fit_transform(np.array(sizes, dtype=np.float32).reshape(-1, 1))
    return csr_matrix(arr)


def vectorize_user_agents(ua_tokens, max_features=200):
    """
    Convert tokenized User-Agent strings into TF-IDF sparse vectors.

    Args:
        ua_tokens (list[str]): List of tokenized User-Agent strings.
        max_features (int, optional): Maximum vocabulary size. Defaults to 200.

    Returns:
        scipy.sparse.csr_matrix: TF-IDF feature matrix for User-Agent tokens.
    """
    vectorizer = TfidfVectorizer(max_features=max_features, dtype=np.float32)
    return vectorizer.fit_transform(ua_tokens)

In [27]:
def combine_features(*arrays):
    """
    Safely combine multiple feature matrices (sparse or dense) into one matrix.

    Automatically converts all dense matrices to sparse if any input is sparse,
    ensuring compatibility and efficient memory usage for large datasets.

    Args:
        *arrays: Variable-length list of matrices (np.ndarray or csr_matrix).

    Returns:
        scipy.sparse.csr_matrix or np.ndarray: Combined feature matrix.
    """
    arrays = [a for a in arrays if a is not None]
    if any(issparse(a) for a in arrays):
        arrays = [csr_matrix(a) if not issparse(a) else a for a in arrays]
        return hstack(arrays)
    return np.hstack(arrays)

In [28]:
# URL embeddings (pilih salah satu)
# url_embeddings = generate_url_hashing(tokenized_urls)
# url_embeddings = generate_url_tfidf(tokenized_urls)
url_embeddings = generate_url_bert(tokenized_urls, TOKENIZER, MODEL, device)

if not issparse(url_embeddings):
    url_embeddings = normalize(url_embeddings, norm='l2', copy=False)

# Fitur tambahan
method_enc = encode_methods(df["method"])
status_enc = encode_statuses(df["status"].apply(categorize_status))
size_enc = normalize_sizes(df["size"])
ua_tokens = [" ".join(tokenize_user_agent(ua)) for ua in df["user_agent"]]
ua_enc = vectorize_user_agents(ua_tokens)

# Gabungkan semua fitur jadi satu matriks
final_features = combine_features(url_embeddings, method_enc, status_enc, size_enc, ua_enc)
print(f"✅ final_features shape: {final_features.shape}")

Embedding URLs:   0%|          | 0/28012 [00:00<?, ?it/s]

Embedding URLs: 100%|██████████| 28012/28012 [09:10<00:00, 50.92it/s]


✅ final_features shape: (896359, 594)


## 5. Clustering Log Menggunakan Bisecting K-Means (Library dan Manual)

Implementasikan dan jalankan algoritma Bisecting K-Means hasil final fitur, lalu kelompokkan berdasarkan hasil cluster.

In [29]:
def evaluate_clusters(features, labels):
    """
    Evaluate clustering results using standard metrics (safe for large/sparse data).

    Args:
        features (np.ndarray or scipy.sparse matrix): Feature matrix.
        labels (np.ndarray): Cluster labels.

    Returns:
        dict: Dictionary with metric results.
    """
    n_labels = len(set(labels))
    if n_labels < 2:
        return {
            "silhouette": None,
            "davies_bouldin": None,
            "calinski_harabasz": None
        }

    results = {}

    # 1️⃣ Silhouette score (supports sparse input)
    try:
        results["silhouette"] = silhouette_score(features, labels, sample_size=20000 if len(labels) > 20000 else None)
    except Exception as e:
        results["silhouette"] = None
        print(f"⚠️ Silhouette score skipped: {e}")

    # 2️⃣ Davies–Bouldin and Calinski–Harabasz require dense data
    if issparse(features):
        # Avoid full toarray() for large datasets (could eat 10+ GB)
        if features.shape[0] * features.shape[1] < 50_000 * 1500:
            print("ℹ️ Converting sparse matrix to dense for small sample...")
            X_dense = features.toarray()
        else:
            print("⚠️ Skipping Davies-Bouldin & Calinski-Harabasz (too large or sparse).")
            results["davies_bouldin"] = None
            results["calinski_harabasz"] = None
            return results
    else:
        X_dense = features

    # 3️⃣ Compute remaining metrics safely
    try:
        results["davies_bouldin"] = davies_bouldin_score(X_dense, labels)
    except Exception as e:
        results["davies_bouldin"] = None
        print(f"⚠️ Davies-Bouldin skipped: {e}")

    try:
        results["calinski_harabasz"] = calinski_harabasz_score(X_dense, labels)
    except Exception as e:
        results["calinski_harabasz"] = None
        print(f"⚠️ Calinski-Harabasz skipped: {e}")

    return results


def visualize_clusters(features, labels, out_file="clusters.png", save_plot=True, title=None):
    """
    Visualize clusters using PCA 2D projection.

    Args:
        features (np.ndarray): Feature matrix.
        labels (np.ndarray): Cluster labels.
        out_file (str): Path to save PNG plot.
        save_plot (bool): If True, save plot to file.
        title (str): Optional title for the plot.
    """
    pca = PCA(n_components=2, random_state=42)
    reduced = pca.fit_transform(features)

    plt.figure(figsize=(8,6))
    scatter = plt.scatter(reduced[:,0], reduced[:,1], c=labels, cmap="tab10", alpha=0.6)
    plt.colorbar(scatter, label="Cluster ID")
    if title:
        plt.title(title)
    else:
        plt.title("Cluster Visualization (PCA)")
    if save_plot:
        plt.savefig(out_file)
        print(f"✅ Cluster visualization saved to {out_file}")
    plt.show()


In [30]:
def cluster_logs(df, features, out_path, n_clusters):
    """
    Cluster web server logs using feature embeddings and Bisecting KMeans,
    optimized for large-scale datasets with gzip-compressed output.

    Args:
        df (pd.DataFrame): DataFrame of log entries.
        features (np.ndarray or scipy.sparse matrix): Feature matrix for clustering.
        out_path (str): Output path ending with ".csv" (example: "outputs/result.csv").
        n_clusters (int): Number of clusters.

    Returns:
        None
    """

    # === 1. Clustering process ===
    print("🚀 Starting clustering process...")
    bkm = VerboseBisectingKMeans(
        n_clusters=n_clusters,
        random_state=42,
        init="k-means++",
        n_init=5
    )
    bkm.fit_verbose(features)
    labels = bkm.labels_
    print(f"✅ Clustering completed. Total clusters: {n_clusters}")

    # === 2. Attach cluster labels ===
    df_label = df.copy()
    df_label["cluster"] = labels

    # === 3. Define output paths ===
    base, _ = os.path.splitext(out_path)
    csv_path = f"{base}.csv.gz"
    txt_path = f"{base}.txt"

    # === 4. Save compressed CSV efficiently ===
    chunk_size = 100_000
    print(f"💾 Writing compressed CSV to: {csv_path}")
    with open(csv_path, "wb") as f:  # gzip stream write
        import gzip
        with gzip.open(f, "wt", encoding="utf-8", newline="") as gz:
            # Write header
            df_label.head(0).to_csv(gz, index=False)
            # Write in chunks
            for i in range(0, len(df_label), chunk_size):
                df_label.iloc[i:i+chunk_size].to_csv(gz, header=False, index=False)
    print(f"✅ Compressed CSV saved: {csv_path}")

    # === 5. Write summary TXT (truncated for huge data) ===
    max_per_cluster = 1000 if len(df_label) > 1_000_000 else None
    print(f"🧩 Writing text summary... -> {txt_path}")
    with open(txt_path, "w", encoding="utf-8") as f:
        for cluster_id in range(n_clusters):
            cluster_data = df_label[df_label["cluster"] == cluster_id]
            f.write(f"\nCluster {cluster_id} ({len(cluster_data)} entries):\n")
            if max_per_cluster:
                cluster_data = cluster_data.head(max_per_cluster)
                f.write(f"  [Showing first {max_per_cluster} entries]\n")
            for _, row in cluster_data.iterrows():
                f.write(f"  {row['method']} {row['url']} [{row['status']}]\n")
    print(f"✅ Cluster summaries saved: {txt_path}")

    # === 6. Evaluate clustering (sample-based for large data) ===
    if len(df_label) > 200_000:
        print("⚠️ Large dataset detected, evaluating on sample of 200k entries...")
        idx = np.random.choice(len(df_label), 200_000, replace=False)
        sample_features = features[idx] if not issparse(features) else features[idx, :]
        sample_labels = labels[idx]
        metrics = evaluate_clusters(sample_features, sample_labels)
    else:
        metrics = evaluate_clusters(features, labels)

    print("📈 Cluster Evaluation:", metrics)

    # === 7. Visualization (skip for huge data) ===
    if len(df_label) <= 200_000:
        base_png = f"{base}_plot.png"
        visualize_clusters(features, labels, base_png, reduce_dim=True)
        print(f"🖼️ Cluster plot saved: {base_png}")
    else:
        print("🖼️ Skipping visualization due to dataset size (>200k).")

    print("🎯 All done.")

    
cluster_logs(df, final_features, output_dir, num_clusters)

🚀 Starting clustering process...
STATUS: Starting BisectingKMeans initialization
PROGRESS: 7
PROGRESS: 14
STATUS: KMeans algorithm backend configured
PROGRESS: 21
STATUS: Data centering complete
PROGRESS: 29
STATUS: Root cluster created
STATUS: Bisecting cluster 1 of 7 (size=896359)
PROGRESS: 36
STATUS: Bisecting cluster 2 of 7 (size=731426)
PROGRESS: 43
STATUS: Bisecting cluster 3 of 7 (size=491421)
PROGRESS: 50
STATUS: Bisecting cluster 4 of 7 (size=319857)
PROGRESS: 57
STATUS: Bisecting cluster 5 of 7 (size=296085)
PROGRESS: 64
STATUS: Bisecting cluster 6 of 7 (size=180161)
PROGRESS: 71
STATUS: Bisecting cluster 7 of 7 (size=115924)
PROGRESS: 79
PROGRESS: 86
STATUS: Cluster labels and centers aggregated
PROGRESS: 93
PROGRESS: 93
STATUS: Clustering finished (total inertia=784458.688)
PROGRESS: 100
DONE
✅ Clustering completed. Total clusters: 8
💾 Writing compressed CSV to: ../outputs/sample_1000000.csv.gz
✅ Compressed CSV saved: ../outputs/sample_1000000.csv.gz
🧩 Writing text summary.