## Build Model

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm
from typing import List, Optional, Tuple
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
import os
pd.set_option('display.max_colwidth', None)

In [2]:
allminiLM = "sentence-transformers/all-MiniLM-L6-v2"
distilroberta = "all-distilroberta-v1"
e5_base = "intfloat/e5-base-v2"
mpnet_base = "all-mpnet-base-v2"

ModelName = {
    allminiLM : "all-minilml6",
    distilroberta : "distilrobertav1",
    e5_base : "e5_basev2",
    mpnet_base : "mpnet_basev2"
}


In [3]:
CS = "cs"
MATH = "math"
STAT = "stat"
SUBJECT = CS
STRANSFORMER = mpnet_base
EMBEDING_DIM = 768

In [4]:
data = pd.read_csv(f"dataset/arxiv_{SUBJECT}_emb.csv")
data

Unnamed: 0,title,submitted_date,tag_text,text
0,Fault Detection using Immune-Based Systems and Formal Language Algorithms,2000-10-03,"Computational Engineering, Finance, and Science, Machine Learning","fault detection using immunebased systems and formal language algorithms. this paper describes two approaches for fault detection an immunebased mechanism and a formal language algorithm. the first one is based on the feature of immune systems in distinguish any foreign cell from the body own cell. the formal language approach assumes the system as a linguistic source capable of generating a certain language, characterised by a grammar. each algorithm has particular characteristics, which are analysed in the paper, namely in what cases they can be used with advantage. to test their practicality, both approaches were applied on the problem of fault detection in an induction motor."
1,Robust Classification for Imprecise Environments,2000-09-13,Machine Learning,"robust classification for imprecise environments. in realworld environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. this uncertainty makes building robust classification systems problematic. we show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. in some cases, the performance of the hybrid actually can surpass that of the best known classifier. this robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization. the hybrid also is efficient to build, to store, and to update. the hybrid is based on a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. the roc convex hull rocch method combines techniques from roc analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. the method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. finally, we point to empirical evidence that a robust hybrid classifier indeed is needed for many realworld problems."
2,Tagger Evaluation Given Hierarchical Tag Sets,2000-08-09,Computation and Language,"tagger evaluation given hierarchical tag sets. we present methods for evaluating human and automatic taggers that extend current practice in three ways. first, we show how to evaluate taggers that assign multiple tags to each test instance, even if they do not assign probabilities. second, we show how to accommodate a common property of manually constructed gold standards that are typically used for objective evaluation, namely that there is often more than one correct answer. third, we show how to measure performance when the set of possible tags is treestructured in an isa hierarchy. to illustrate how our methods can be used to measure interannotator agreement, we show how to compute the kappa coefficient over hierarchical tag sets."
3,Description of GADEL,2000-03-07,"Artificial Intelligence, Logic in Computer Science",description of gadel. this article describes the first implementation of the gadel system a genetic algorithm for default logic. the goal of gadel is to compute extensions in reiters default logic. it accepts every kind of finite propositional default theories and is based on evolutionary principles of genetic algorithms. its first experimental results on certain instances of the problem show that this new approach of the problem can be successful.
4,The dynamics of iterated transportation simulations,2000-02-22,"Adaptation and Self-Organizing Systems, Computational Engineering, Finance, and Science","the dynamics of iterated transportation simulations. iterating between a router and a traffic microsimulation is an increasibly accepted method for doing traffic assignment. this paper, after pointing out that the analytical theory of simulationbased assignment todate is insufficient for some practical cases, presents results of simulation studies from a real world study. specifically, we look into the issues of uniqueness, variability, and robustness and validation. regarding uniqueness, despite some cautionary notes from a theoretical point of view, we find no indication of metastable states for the iterations. variability however is considerable. by variability we mean the variation of the simulation of a given plan set by just changing the random seed. we show then results from three different microsimulations under the same iteration scenario in order to test for the robustness of the results under different implementations. we find the results encouraging, also when comparing to reality and with a traditional assignment result. keywords dynamic traffic assignment dta traffic microsimulation transims largescale simulations urban planning"
...,...,...,...,...
159242,Mining the Long Tail: A Comparative Study of Data-Centric Criticality Metrics for Robust Offline Reinforcement Learning in Autonomous Motion Planning,2025-09-16,"Robotics, Artificial Intelligence, Machine Learning","mining the long tail a comparative study of datacentric criticality metrics for robust offline reinforcement learning in autonomous motion planning. offline reinforcement learning rl presents a promising paradigm for training autonomous vehicle av planning policies from largescale, realworld driving logs. however, the extreme data imbalance in these logs, where mundane scenarios vastly outnumber rare longtail events, leads to brittle and unsafe policies when using standard uniform data sampling. in this work, we address this challenge through a systematic, largescale comparative study of data curation strategies designed to focus the learning process on informationrich samples. we investigate six distinct criticality weighting schemes which are categorized into three families heuristicbased, uncertaintybased, and behaviorbased. these are evaluated at two temporal scales, the individual timestep and the complete scenario. we train seven goalconditioned conservative qlearning cql agents with a stateoftheart, attentionbased architecture and evaluate them in the highfidelity waymax simulator. our results demonstrate that all data curation methods significantly outperform the baseline. notably, datadriven curation using model uncertainty as a signal achieves the most significant safety improvements, reducing the collision rate by nearly threefold from 16.0 to 5.5. furthermore, we identify a clear tradeoff where timesteplevel weighting excels at reactive safety while scenariolevel weighting improves longhorizon planning. our work provides a comprehensive framework for data curation in offline rl and underscores that intelligent, nonuniform sampling is a critical component for building safe and reliable autonomous agents."
159243,Adaptive Data-Knowledge Alignment in Genetic Perturbation Prediction,2025-10-01,"Molecular Networks, Artificial Intelligence, Machine Learning","adaptive dataknowledge alignment in genetic perturbation prediction. the transcriptional response to genetic perturbation reveals fundamental insights into complex cellular systems. while current approaches have made progress in predicting genetic perturbation responses, they provide limited biological understanding and cannot systematically refine existing knowledge. overcoming these limitations requires an endtoend integration of datadriven learning and existing knowledge. however, this integration is challenging due to inconsistencies between data and knowledge bases, such as noise, misannotation, and incompleteness. to address this challenge, we propose aligned adaptive alignment for inconsistent genetic knowledge and data, a neurosymbolic framework based on the abductive learning abl paradigm. this endtoend framework aligns neural and symbolic components and performs systematic knowledge refinement. we introduce a balanced consistency metric to evaluate the predictions consistency against both data and knowledge. our results show that aligned outperforms stateoftheart methods by achieving the highest balanced consistency, while also rediscovering biologically meaningful knowledge. our work advances beyond existing methods to enable both the transparency and the evolution of mechanistic biological understanding."
159244,Sobolev Training of End-to-End Optimization Proxies,2025-05-16,"Machine Learning, Optimization and Control","sobolev training of endtoend optimization proxies. optimization proxies machine learning models trained to approximate the solution mapping of parametric optimization problems in a single forward pass offer dramatic reductions in inference time compared to traditional iterative solvers. this work investigates the integration of solver sensitivities into such end to end proxies via a sobolev training paradigm and does so in two distinct settings i fully supervised proxies, where exact solver outputs and sensitivities are available, and ii self supervised proxies that rely only on the objective and constraint structure of the underlying optimization problem. by augmenting the standard training loss with directional derivative information extracted from the solver, the proxy aligns both its predicted solutions and local derivatives with those of the optimizer. under lipschitz continuity assumptions on the true solution mapping, matching first order sensitivities is shown to yield uniform approximation error proportional to the training set covering radius. empirically, different impacts are observed in each studied setting. on three large alternating current optimal power flow benchmarks, supervised sobolev training cuts mean squared error by up to 56 percent and the median worst case constraint violation by up to 400 percent while keeping the optimality gap below 0.22 percent. for a mean variance portfolio task trained without labeled solutions, self supervised sobolev training halves the average optimality gap in the medium risk region standard deviation above 10 percent of budget and matches the baseline elsewhere. together, these results highlight sobolev training whether supervised or self supervised as a path to fast reliable surrogates for safety critical large scale optimization workloads."
159245,Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics,2025-03-30,"Information Retrieval, Artificial Intelligence, Computation and Language, Computer Vision and Pattern Recognition","beyond unimodal boundaries generative recommendation with multimodal semantics. generative recommendation gr has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on nonsemantic item identifiers in autoregressive models. however, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal usually text. we argue that this is a significant limitation given the rich, multimodal nature of realworld data and the potential sensitivity of gr models to modality choices and usage. our work aims to explore the critical problem of multimodal generative recommendation mgr, highlighting the importance of modality choices in gr nframeworks. we reveal that gr models are particularly sensitive to different modalities and examine the challenges in achieving effective gr when multiple modalities are available. by evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce mgrlf, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20 compared to singlemodality alternatives."


In [5]:
data["submitted_date"] = pd.to_datetime(data["submitted_date"], errors="coerce")
data["year"] = data["submitted_date"].dt.year
idx = data.index.to_numpy()

In [6]:
def generate_mmap_embeddings(
    texts: List[str],
    embedding_model: SentenceTransformer,
    mmap_path: str,
    batch_size: int = 1024
) -> np.memmap:
    N = len(texts)
    emb_dim = embedding_model.get_sentence_embedding_dimension()
    
    if N == 0:
        print("Error: List teks kosong.")
        return

    print(f"Total dokumen: {N}")
    print(f"Dimensi embedding: {emb_dim}")
    print(f"Ukuran batch: {batch_size}")
    print(f"Menyimpan ke: {mmap_path}")

    embs = np.memmap(
        mmap_path, 
        dtype="float32", 
        mode="w+",
        shape=(N, emb_dim)
    )

    for i in tqdm(range(0, N, batch_size)):
        batch_texts = texts[i:i + batch_size]
        
        batch_embeddings = embedding_model.encode(
            batch_texts, 
            show_progress_bar=False, 
            convert_to_numpy=True
        )
        
        embs[i:i + len(batch_texts)] = batch_embeddings

    embs.flush()
    return embs

In [7]:
def load_mmap_embeddings(
    mmap_path: str,
    num_documents: int,
    embedding_dim: int,
    dtype: str = "float32"
) -> Optional[np.memmap]:
    
    try:
        embs = np.memmap(
            mmap_path,
            dtype=dtype,
            mode="r",
            shape=(num_documents, embedding_dim)
        )
        print("Embeddings berhasil dimuat.")
        return embs
    except FileNotFoundError:
        print(f"Error: File tidak ditemukan di path: {mmap_path}")
        return None
    except Exception as e:
        print(f"Error saat memuat file mmap: {e}")
        return None

In [8]:
def train_bertopic_model(
    documents: List[str],
    embeddings: np.ndarray,
    n_neighbors: int = 20,
    n_components: int = 5,
    min_dist: float = 0.0,
    min_cluster_size: int = 200,
    min_samples: int = 10,
    random_state: int = 42
) -> Tuple[BERTopic, List[int], Optional[np.ndarray]]:

    print(f"n_neighbors={n_neighbors}, min_dist={min_dist}, n_components={n_components}")
    umap_model = UMAP(
        n_neighbors=n_neighbors,
        n_components=n_components,
        metric="cosine",
        random_state=random_state,
        min_dist=min_dist,
        verbose=True
    )

    print(f"min_cluster_size={min_cluster_size}, min_samples={min_samples}")
    hdbscan_model = HDBSCAN(
        min_cluster_size=min_cluster_size,
        min_samples=min_samples,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True,
        gen_min_span_tree=True
    )

    topic_model = BERTopic(
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        calculate_probabilities=False, # Set False agar lebih cepat
        verbose=True
    )

    topics, probs = topic_model.fit_transform(documents, embeddings=embeddings)
    print(f"total topics found: {len(topic_model.get_topic_info()) - 1}")
    return topic_model, topics, probs

In [9]:
emb_path = f"model_results/bertopic/model_embedings/{ModelName[STRANSFORMER]}_{SUBJECT}.npy"
if os.path.exists(emb_path):
    print("Embeddings exist")
    texts = data["text"].tolist()
    embs = load_mmap_embeddings(mmap_path=emb_path,num_documents=len(texts),embedding_dim=EMBEDING_DIM )
else:
    print("Start model embedings")
    embedding_model = SentenceTransformer(STRANSFORMER)
    embs = generate_mmap_embeddings(texts=data['text'].tolist(), embedding_model=embedding_model,mmap_path=emb_path)

Start model embedings
Total dokumen: 159247
Dimensi embedding: 768
Ukuran batch: 1024
Menyimpan ke: model_results/bertopic/model_embedings/mpnet_basev2_cs.npy


100%|██████████| 156/156 [33:03<00:00, 12.72s/it]


In [10]:
sample_embeddings = embs[idx]
topic_model, topics , probs = train_bertopic_model(data['text'].tolist(),sample_embeddings)

2025-11-19 19:30:19,265 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


n_neighbors=20, min_dist=0.0, n_components=5
min_cluster_size=200, min_samples=10
UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=5, n_jobs=1, n_neighbors=20, random_state=42, verbose=True)
Wed Nov 19 19:30:19 2025 Construct fuzzy simplicial set
Wed Nov 19 19:30:19 2025 Finding Nearest Neighbors
Wed Nov 19 19:30:19 2025 Building RP forest with 25 trees
Wed Nov 19 19:30:33 2025 NN descent for 17 iterations
	 1  /  17
	 2  /  17
	 3  /  17
	 4  /  17
	Stopping threshold met -- exiting after 4 iterations
Wed Nov 19 19:31:03 2025 Finished Nearest Neighbor Search
Wed Nov 19 19:31:06 2025 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Wed Nov 19 19:32:36 2025 Finished embedding


2025-11-19 19:32:37,853 - BERTopic - Dimensionality - Completed ✓
2025-11-19 19:32:37,855 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-19 19:32:42,479 - BERTopic - Cluster - Completed ✓
2025-11-19 19:32:42,495 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-19 19:32:53,934 - BERTopic - Representation - Completed ✓


total topics found: 143


In [11]:
topic_model.save(f"model_results/bertopic/{ModelName[STRANSFORMER]}_{SUBJECT}")



Wed Nov 19 19:33:03 2025 Worst tree score: 0.41104071
Wed Nov 19 19:33:03 2025 Mean tree score: 0.42233562
Wed Nov 19 19:33:03 2025 Best tree score: 0.43516675
Wed Nov 19 19:33:06 2025 Forward diversification reduced edges from 3184940 to 1112790
Wed Nov 19 19:33:08 2025 Reverse diversification reduced edges from 1112790 to 1112790
Wed Nov 19 19:33:10 2025 Degree pruning reduced edges from 1299792 to 1299583
Wed Nov 19 19:33:10 2025 Resorting data and graph based on tree order
Wed Nov 19 19:33:10 2025 Building and compiling search function


# Visualization

In [14]:
loaded_model = BERTopic.load(f"model_results/bertopic/{ModelName[STRANSFORMER]}_{SUBJECT}")

Wed Nov 19 20:45:21 2025 Building and compiling search function


In [17]:
import pandas as pd
import plotly.express as px
docs = data['text'].tolist()
timestamps = pd.to_datetime(data["submitted_date"]).dt.year
topics_over_time = loaded_model.topics_over_time(
    docs,
    timestamps,
    
)

26it [01:17,  2.99s/it]


In [18]:
df = topics_over_time.copy()
df = df[df['Topic'] != -1]
totals = df.groupby("Timestamp")["Frequency"].sum().reset_index()
totals = totals.rename(columns={"Frequency": "Total"})

df = df.merge(totals, on="Timestamp")
df["Proportion"] = df["Frequency"] / df["Total"] 
df["Percentage"] = df["Proportion"]

topic_totals = df.groupby("Topic")["Frequency"].sum()
top10 = topic_totals.sort_values(ascending=False).head(10).index.tolist()

df_top10 = df[df["Topic"].isin(top10)]


fig = px.line(
    df_top10,
    x="Timestamp",
    y="Percentage",
    color="Topic",
    hover_data=["Words", "Frequency", "Total"],
)

fig.update_layout(
    title="Topic Proportion Over Time",
    yaxis_title="Proportion",
    xaxis_title="Time",
)

fig.show()

# Metrics

In [23]:
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from tqdm.notebook import tqdm
from gensim.models import CoherenceModel

In [24]:
def tokenize_for_coherence(text):
    return [
        token for token in simple_preprocess(str(text), deacc=True)
    ]
texts_for_coherence = [tokenize_for_coherence(text) for text in tqdm(data['text'], desc="Tokenizing for Coherence")]
dictionary_coherence = Dictionary(texts_for_coherence)

Tokenizing for Coherence:   0%|          | 0/159247 [00:00<?, ?it/s]

In [25]:
def get_bertopic_topics(model, top_n=10):
    topics_list = []
    for topic_id in range(len(model.get_topic_info()) - 1): 
        words_scores = model.get_topic(topic_id)
        if words_scores: 
            top_words = [word for word, score in words_scores[:top_n]]
            topics_list.append(top_words)
    return topics_list[1:]

In [27]:
for m in ModelName.values():
    bertopic = BERTopic.load(f"model_results/bertopic/{m}_{SUBJECT}") 
    all_topic_ids = bertopic.get_topics().keys()
    bertopic_topics_list = [] 
    for topic_id in all_topic_ids:
        topic_words = [word for word, _ in bertopic.get_topic(topic_id)]
        bertopic_topics_list.append(topic_words)
    bertopic_topics = get_bertopic_topics(bertopic, top_n=20)
    cm_bertopic = CoherenceModel(
        topics=bertopic_topics_list,        
        texts=texts_for_coherence,
        dictionary=dictionary_coherence, 
        coherence='c_v',   
        processes=1
    )
    coherence_bertopic = cm_bertopic.get_coherence()
    print(f"\nSkor Koherensi BERTopic {m}: {coherence_bertopic:.4f}")

Wed Nov 19 21:00:52 2025 Building and compiling search function

Skor Koherensi BERTopic all-minilml6: 0.7199
Wed Nov 19 21:05:38 2025 Building and compiling search function

Skor Koherensi BERTopic distilrobertav1: 0.7253
Wed Nov 19 21:10:41 2025 Building and compiling search function

Skor Koherensi BERTopic e5_basev2: 0.7057
Wed Nov 19 21:15:11 2025 Building and compiling search function

Skor Koherensi BERTopic mpnet_basev2: 0.7088
