# CS5660 Final Project
## Topic Modeling on arXiv abstract data
Ryan Dielhenn  
Joe Jimenez  
Bohdan Hrotovytskyy  
Ryan Goshorn  
CalStateLA

## Topic Modeling

### Def 1.
**Topic modeling** is an **unsupervised machine learning technique** that automatically identifies the abstract topics present within a collection of documents. It assumes that each document is a mixture of a small number of topics and that each topic is characterized by a distribution over words. The goal of topic modeling is to uncover the hidden thematic structure in large textual datasets, facilitating tasks such as organization, summarization, and discovery of patterns without prior annotation.

### Def 2.
* The problem of modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.


[1] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.


### Traditional Topic Models and Their Limitation

* Traditional Approaches:
    * Latent Dirichlet Allocation (LDA)
    * Non-Negative Matrix Factorization (NMF)

* Bag-of-Words Assumption
    * Treat documents as a collection of individual words (e.g., ignores word order).

* Limitation:
    * Ignores the meaning and relationship between words.

### What is a Bag-of-Words?
* A bag-of-words is a representation of text that describes the occurrence of words within a document.
    
    * A vocabulary
    * A measure of the presence of known words.

# Project: Topic Modeling arXiv cs.AI with BERTopic and LLMs

## Goal
Discover the main research themes in the **cs.AI** category on arXiv by:
- Grouping similar paper abstracts into topics
- Automatically generating human-readable labels for each topic
- Visualizing how topics relate to each other

## Methods (High-Level)
- **Bag-of-Words demo:** Simple example to introduce topic modeling.
- **BERTopic:** Uses sentence embeddings + UMAP + HDBSCAN to create dense, meaningful clusters.
- **LLM labeling (Llama3 via Ollama):** Generates concise, human-style topic names from keywords and representative documents.
- **Visualization:**
  - Intertopic distance maps
  - Topic word score bar charts
  - Document map scatter plots
  - Final radial topic map (DataMapPlot)

## Dataset
- Source: arXiv API, category **cs.AI**
- Data: ~1000 paper abstracts (title + abstract text)
- Use case: Explore what kinds of AI research areas are most common in this category.


In [None]:
from tensorflow import keras
from typing import List
from tensorflow.keras.preprocessing.text import Tokenizer

sentence = ["John likes to watch movies. Mary likes movies too."]


def print_bow(sentence: List[str]) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentence)
    sequences = tokenizer.texts_to_sequences(sentence)
    word_index = tokenizer.word_index
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f"{word_index}")
    print(f"We found {len(word_index)} unique tokens.")


print_bow(sentence)

In [None]:
print("John likes to watch movies. Mary likes movies too.")

### BERTopic [(ðŸ”—)](https://maartengr.github.io/BERTopic/index.html)
BERTopic is a topic modeling technique that leverages ðŸ¤— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

### Visual Overview

BERTopic can be viewed as a sequence of steps to create its topic representations. BERTopic generates topics from text through a four-step process:

1. **Embedding**: Each document is first transformed into a numerical vector using a pre-trained language model such as BERT. This step captures the semantic meaning and contextual nuances of the text.

2. **Dimensionality Reduction**: Because the resulting vectors are high-dimensional, a dimensionality reduction technique (e.g., UMAP) is applied to simplify the representation while preserving important structure, making clustering more efficient and effective.

3. **Clustering**: The reduced vectors are then clustered into groups, where each cluster corresponds to a potential topic.

4. **Topic Representation**: For each cluster, BERTopic applies a technique called class-based TF-IDF to identify the key words that best characterize the topic.

This end-to-end process enables BERTopic to generate clear, interpretable, and contextually rich topics, often outperforming traditional topic modeling methods.

![image.png](attachment:6c85ca13-6cfa-4a64-9443-bbe4b29e41c2.png)![image.png](attachment:4a208f14-75db-4b07-acb8-c0998be65fe9.png)

## Install required libraries

In [None]:
# BERTopic library
!pip install -q BERTopic

# Visualization Libraries
!pip install datamapplot matplotlib

# Tokenization and ollama for running llm locally
!pip install -q openai tiktoken ollama

# Cuda Drivers for running LLM on colab
!apt-get update && apt-get install -y pciutils cuda-drivers

## Import required packages

In [None]:
# Core
import os
import sys
import re
import time
import subprocess
import ast

# Data
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

# Topic modeling
from bertopic import BERTopic
from bertopic.representation import (
    KeyBERTInspired,
    MaximalMarginalRelevance,
    TextGeneration,
    OpenAI as RepresentationOpenAI
)
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Visualization
import plotly.express as px
import plotly.io as pio
import matplotlib.pyplot as plt
import datamapplot

# External services
from google.colab import drive
from openai import OpenAI
import openai

# Mount Google Drive to access files
drive.mount('/content/drive')

# Utilities for fetching abstracts from the arXiv api
sys.path.insert(0, "/content/drive/MyDrive/CSULA/CS5660/arXiv_topic_modeling")
from utils import fetch_arxiv_abstracts
print("âœ… Imports successful!")

## BERTopic Quick Start

### Loading the Dataset

In [None]:
docs, titles = fetch_arxiv_abstracts(category='cs.AI', max_results=2000)

In [None]:
docs[0]

`fetch_arXiv_abstracts` is a function from `utils/` that will return a certain number of abstracts from a category eg `cs.AI`. The arXiv api seems to have rate limiting so we may need a delay before fetching more data in the future.


### Building and Training the BERTopic Model

In [None]:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

In [None]:
topic_model.get_topic_info()

* -1 refers to all outliers and should typically be ignored.
* Next, let's take a look at the most frequent topic that was generated, topic 0:

In [None]:
topic_model.get_topic(0)

Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic


In [None]:
topic_model.get_document_info(docs)

 ### Representation Models: Fine-tune Topic Representation

BERTopic uses a Bag-of-Words approach with class-based TF-IDF (c-TF-IDF) to quickly generate topic keywords without needing to re-train the model after clustering.
While this provides good initial topic representations, BERTopic also offers optional representation models for further fine-tuning.
These models can range from powerful GPT-like models to faster keyword extraction methods like KeyBERT, giving users flexibility to enhance topic quality as needed.

### LLM & Generative AI

Using LLMs such as GPT-4, and open source soultion, we can fine-tune topics to generate labels, summaries of the topics.

- Generate a set of keywords and documetns that describe a topic best using BERTopic's c-TF-IDF .
- Candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best.


#### Prompt


In [None]:
prompt = """
I have topic that contains the following documents: \n[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the above information, can you give a short label of the topic?
"""

### Selecting Documents

Four of the most representative documents will be passed to `[Documents]`.


BERTopic works rather straightforward. It consists of 5 sequential steps: embedding documents, reducing embeddings in dimensionality, cluster embeddings, tokenize documents per cluster, and finally extract the best representing words per topic.
<br>
<div>
<img src="https://github.com/MaartenGr/BERTopic/assets/25746895/e9b0d8cf-2e19-4bf1-beb4-4ff2d9fa5e2d" width="500"/>
</div>


In [None]:
!curl -fsSL https://ollama.ai/install.sh | sh

In [None]:
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

def run_ollama_serve():
    subprocess.Popen(["nohup", "ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

run_ollama_serve()
time.sleep(5)
print("Ollama server started.")

In [None]:
!ollama pull llama3

In [None]:
# Configure the client to use the local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama', # dummy API key required by the client library
)

# Use the model you pulled (e.g., "llama3")
model_name = "llama3"

print(f"Sending request to {model_name}...")

# Example using the standard OpenAI client chat completion
try:
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "user", "content": "Explain how to run an LLM locally in one sentence."}
        ],
        temperature=0.7,
    )
    print("\n--- Model Response ---")
    print(response.choices[0].message.content)
    print("----------------------")

except Exception as e:
    print(f"\nAn error occurred: {e}")
    print("Make sure the 'ollama serve' process is running in the background.")

# You can run !ollama ps again after this code executes to see the model usage
time.sleep(2)
!ollama ps

In [None]:
!curl http://localhost:11434/

In [None]:
# Assuming bertopic and its dependencies are installed
# If not, run this line first: !pip install bertopic sentence-transformers

prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: [KEYWORDS]

Generate a concise topic label (3-7 words) that captures the main theme.

CRITICAL INSTRUCTIONS:
- Output ONLY the topic label itself
- Do NOT include phrases like "Here is", "The topic is", "Topic:", or any preamble
- Do NOT add explanations or formatting
- Just output the label directly as plain text

Example output: "Neural Networks for Computer Vision"

Your label:"""

# Configure the client to use the local Ollama server
client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama_key_placeholder",
)

# Use the model you pulled (e.g., "llama3")
OLLAMA_MODEL_NAME = "llama3"
ollama_representation_model = RepresentationOpenAI(client, prompt=prompt, model=OLLAMA_MODEL_NAME, delay_in_seconds=10)

print(f"Representation model configured using local Ollama model: {OLLAMA_MODEL_NAME}")

# You can now proceed with your BERTopic workflow:
# topic_model = BERTopic(representation_model=representation_model)
# documents = [...] # Your actual list of documents
# topics, probabilites = topic_model.fit_transform(documents)

# Verification using a simple prompt
try:
    response = client.chat.completions.create(
        model=OLLAMA_MODEL_NAME,
        messages=[
            {"role": "user", "content": "Confirm that you are running locally via Ollama."}
        ],
    )
    print("\n--- Verification Response ---")
    print(response.choices[0].message.content)
    print("-----------------------------")
except Exception as e:
    print(f"\nAn error occurred during verification: {e}")

## **Preparing Embeddings**

By pre-calculating the embeddings for each document, we can speed-up additional exploration steps and use the embeddings to quickly iterate over BERTopic's hyperparameters if needed.

ðŸ”¥ **TIP**: You can find a great overview of good embeddings for clustering on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [None]:
# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-small-en")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

In [None]:
#Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

In [None]:
df_plot = pd.DataFrame({
    "x1": [point[0] for point in reduced_embeddings],
    "x2": [point[1] for point in reduced_embeddings],
    "docs": docs,
})

df_plot["docs_short"] = df_plot["docs"].str[:100] + "..."
df_plot.head(10)

In [None]:
pio.renderers.default = "colab"

total_docs = len(df_plot)
fig = px.scatter(df_plot, x="x1", y="x2",  hover_data=["docs_short"])
fig.update_traces(marker=dict(line=dict(width=0.5, color='white')))
fig.update_layout(
    title=f"arXiv abstracts from cs.AI - Document Map ({total_docs} documents)",
    title_font_size=20
)

fig.show()

## **Sub-models**

Next, we will define all sub-models in BERTopic and do some small tweaks to the number of clusters to be created, setting random states, etc.

In [None]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

As a small bonus, we are going to reduce the embeddings we created before to 2-dimensions so that we can use them for visualization purposes when we have created our topics.

### **Representation Models**

One of the ways we are going to represent the topics is with Llama 2 which should give us a nice label. However, we might want to have additional representations to view a topic from multiple angles.

Here, we will be using c-TF-IDF as our main representation and [KeyBERT](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired), [MMR](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#maximalmarginalrelevance), and [Llama 2](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html) as our additional representations.

In [None]:
# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)

# Text generation with Llama 2
#llama2 = TextGeneration(generator, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert,
    "GPT-40": ollama_representation_model, # Use the renamed object
    "MMR": mmr,
}

# **Training**

Now that we have our models prepared, we can start training our topic model! We supply BERTopic with the sub-models of interest, run `.fit_transform`, and see what kind of topics we get.

## Multiple Representations
During the development of BERTopic, many different types of representations can be created, from keywords and phrases to summaries and custom labels. There is a variety of techniques that one can choose from to represent a topic. As such, there are a number of interesting and creative ways one can summarize topics. A topic is more than just a single representation.

Therefore, multi-aspect topic modeling is introduced! During the .fit or .fit_transform stages, you can now get multiple representations of a single topic. In practice, it works by generating and storing all kinds of different topic representations (see image below).

In [None]:
# To remove English stopwords
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(

  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,
  vectorizer_model=vectorizer_model, # Add this line

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(docs, embeddings)

Now that we are done training our model, let's see what topics were generated:


In [None]:
# Show topics
topic_model.get_topic_info()

In [None]:
print(f"Renderer set to '{pio.renderers.default}'")
fig = topic_model.visualize_topics()
fig.show()

In [None]:
gpt4o_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["GPT-40"].values()]

In [None]:
def get_clean_label(raw_label_string):
    """Extracts clean label from list or string."""
    # If it's already a list, just take the first element
    if isinstance(raw_label_string, list):
        return raw_label_string[0] if raw_label_string else "Unlabeled Topic"

    # If it's a string, clean it up
    if isinstance(raw_label_string, str):
        cleaned = raw_label_string.strip()
        # Remove brackets and quotes if present
        cleaned = cleaned.strip("[]").strip().strip("'\"").strip()
        return cleaned if cleaned else "Unlabeled Topic"

    # Fallback for other types
    return str(raw_label_string)

In [None]:
# Get document info
document_info = topic_model.get_document_info(docs)
document_info["GPT-40"] = document_info["GPT-40"]

In [None]:
# First, let's inspect the raw content of the 'GPT-40' column
print("--- Raw GPT-40 labels (before cleaning) ---")
display(document_info["GPT-40"].head())

# Now, apply the cleaning function
all_labels = document_info["GPT-40"].apply(get_clean_label)

print("\n--- Cleaned Labels (after cleaning) ---")
display(all_labels.head())

In [None]:
fig = topic_model.visualize_barchart()
fig.show()

# Visualize Documents


In [None]:
df_plot = pd.DataFrame({
    "x1": [point[0] for point in reduced_embeddings],
    "x2": [point[1] for point in reduced_embeddings],
    "docs": docs,
    "label": all_labels
})
df_plot["docs_short"] = df_plot["docs"].str[:100] + "..."
df_plot.head(10)

In [None]:
fig = px.scatter(df_plot, x="x1", y="x2", color="label", hover_data=["docs_short"])

fig.update_layout(
    height=600,
    legend=dict(
        orientation="h",  # Change orientation to horizontal
        yanchor="bottom",
        y=1.02,           # Place the legend above the plot area
        xanchor="right",
        x=1
    )
)

fig.show()

Source: https://www.williampnicholson.com/2024-02-07-topic-modelling/

In [None]:
# Run the topic map visualization
datamapplot.create_plot(
    reduced_embeddings,
    all_labels,

    use_medoids=True,

    # Follows matplotlibâ€™s 'figsize' convention.
    # The actual size of the resulting plot (in pixels) will depend on the dots per inch (DPI)
    # setting in matplotlib.
    # By default that is set to 100 dots per inch for the standard backend, but it can vary.
    figsize=(12, 12),
    # If you really wish to have explicit control of the size of the resulting plot in pixels.
    dpi=100,

    title="arXiv cs.AI - Topic Analysis",
    sub_title="A Topic Map of arXiv's cs.AI sub-category based on abstracts from the arXiv api",

    # Takes a dictionary of keyword arguments that is passed through to
    # matplotlibâ€™s 'suptitle' 'fontdict' arguments.
    sub_title_keywords={
        "fontsize":18,
    },

    # Takes a list of text labels to be highlighted.
    # Note: these labels need to match the exact text from your labels array that you are passing in.
    highlight_labels=[
        "Retinopathy Prematurity Screening",
    ],
    # Takes a dictionary of keyword arguments to be applied when styling the labels.
    highlight_label_keywords={
        "fontsize": 12,
        "fontweight": "bold",
        "bbox": {"boxstyle":"round"}
    },

    # By default DataMapPlot tries to automatically choose a size for the text that will allow
    # all the labels to be laid out well with no overlapping text. The layout algorithm will try
    # to accommodate the size of the text you specify here.
    label_font_size=8,
    label_wrap_width=16,
    label_linespacing=1.25,
    # Default is 1.5. Generally, the values of 1.0 and 2.0 are the extremes.
    # With 1.0 you will have more labels at the top and bottom.
    # With 2.0 you will have more labels on the left and right.
    label_direction_bias=1.3,
    # Controls how large the margin is around the exact bounding box of a label, which is the
    # bounding box used by the algorithm for collision/overlap detection.
    # The default is 1.0, which means the margin is the same size as the label itself.
    # Generally, the fewer labels you have the larger you can make the margin.
    label_margin_factor=2.0,
    # Labels are placed in rings around the core data map. This controls the starting radius for
    # the first ring. Note: you need to provide a radius in data coordinates from the center of the
    # data map.
    # The defaul is selected from the data itself, based on the distance from the center of the
    # most outlying points. Experiment and let the DataMapPlot algoritm try to clean it up.
    label_base_radius=15.0,

    # By default anything over 100,000 points uses datashader to create the scatterplot, while
    # plots with fewer points use matplotlibâ€™s scatterplot.
    # If DataMapPlot is using datashader then the point-size should be an integer,
    # say 0, 1, 2, and possibly 3 at most. If however you are matplotlib scatterplot mode then you
    # have a lot more flexibility in the point-size you can use - and in general larger values will
    # be required. Experiment and see what works best.
    point_size=4,

    # Market type. There is only support if you are in matplotlib's scatterplot mode.
    # https://matplotlib.org/stable/api/markers_api.html
    marker_type="o",

    arrowprops={
        "arrowstyle":"wedge,tail_width=0.5",
        "connectionstyle":"arc3,rad=0.05",
        "linewidth":0,
        "fc":"#33333377"
    },

    add_glow=True,
    # Takes a dictionary of keywords that are passed to the 'add_glow_to_scatterplot' function.
    glow_keywords={
        "kernel_bandwidth": 0.75,  # controls how wide the glow spreads.
        "kernel": "cosine",        # controls the kernel type. Default is "gaussian". See https://scikit-learn.org/stable/modules/density.html#kernel-density.
        "n_levels": 32,            # controls how many "levels" there are in the contour plot.
        "max_alpha": 0.9,          # controls the translucency of the glow.
    },

    darkmode=False,
)

plt.tight_layout()

# Save the plot as a PDF, png, and svg file.
plt.show()