# Topic Modeling

This notebook contains the code used for topic modeling our podcast and news corpora.

## Imports
Add necessary imports.

In [None]:
from bertopic import BERTopic # topic modeling
from sentence_transformers import SentenceTransformer # embeddings
import os # file haldling
import pandas as pd # data exploration
import math # for chunking function
import pickle # for loading articles
import random # for sampling our data
import numpy as np # for exporting embeddings

## Preprocessing
Here, we preprocess our data for topic modeling.

### Transcripts
Transcripts are already clean, with no timestamps, speaker tags, or non-speech annotations. BERTopic also does not require stemming, lemmatizing, or stopword removal. 

Preprocessing is thus limited to removing those transcripts for podcasts published by traditional news outlets.

In [None]:
podcast_df = pd.read_csv("data/top_50_pods_USA_2025_Q1.csv", index_col="Rank")

# array of podcast titles to include
to_include = podcast_df[~podcast_df["Exclude"]]["Title"].values

In [None]:
metadata_table = pd.read_csv("data/podcasts/all_podcasts_metadata.csv")

# removing all podcasts not on this list
metadata_table = metadata_table[metadata_table["podcast_name"].isin(to_include)]

### News Articles
News articles do not require proprocessing.

## Text Chunking

As podcasts and news articles can be long and often contain multiple topics, we break these into shorter documents so that our model will have an easier time grouping similar ideas together. Chunking also gives us even more datapoints, with more evidence for each topic and accordingly more robust topic discovery.

For our embeddings, we default to a `chunk size` of 300, which tends to be the "sweet spot" (i.e. long enough to be meaningful, short enough to not mix topics too much). We can adjust this based on modeling results.

### General Chunking Function
Note, this function is designed to make chunks as even as possible without a potential tiny chunk at the end of the transcript for reasons of consistency.

In [None]:
def chunk_text(text, chunk_size=300):
    """Split documents into even chunks of approximately chunk_size"""
    words = text.split()
    n_chunks = math.ceil(len(words) / chunk_size)
    if n_chunks == 0:
        return []
    avg = len(words) // n_chunks
    rem = len(words) % n_chunks
    chunks = []
    start = 0
    for i in range(n_chunks):
        # distribute remainder if any to the first chunk
        extra = 1 if i < rem else 0 
        end = start + avg + extra
        chunks.append(' '.join(words[start:end]))
        start = end
    return chunks

### Chunking Transcripts
Here we apply chunking to our podcast transcripts. Our outputs are:

1. `pod_docs`: a list containing chunks from all transcripts.
2. `pod_chunk_metadata`: a list of dictionaries containing for each chunk, its source type, source name, title, publish date, filename and unique id (within the document).

In [None]:
pod_docs = []
pod_chunk_metadata = []

for idx, row in metadata_table.iterrows():
    filepath = row["filename"]
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            text = f.read()
            chunks = chunk_text(text, chunk_size=300)
            for i, chunk in enumerate(chunks):
                pod_docs.append(chunk)
                # Attach all the original metadata and give each chunk a unique id for tracking
                pod_chunk_metadata.append({
                    "filename": row["filename"],
                    "source_type": "podcast",
                    "source_name": row["podcast_name"],
                    "title": row["episode_title"],
                    "publish_date": row["publish_date"],
                    "chunk_id": i
                })
    except Exception as e:
        print(f"Error reading {filepath}: {e}")

### Chunking News Articles
Here, we do the same chunking on our news articles. 

First, we load our news articles, which have been exported from Pandas DataFrames into `.pkl` files. We aggregate these into `news_articles`, a list of dictionaries where each article's text content is paired with keys for `outlet_name`, `title`, `date`, `filename`.

Then, we perform the same chunking we did on our podcast transcripts. Our outputs are:

1. `news_docs`: a list containing chunks from all news articles.
2. `news_chunk_metadata`: a list of dictionaries containing for each chunk, its source type, source name, title, publish date, filename and unique id (within the document).

In [None]:
def get_article_filepaths():
    found_files = []
    for root, _, files in os.walk("data/article data"):
        for file in files:
            if file.endswith(".pkl") and "combined" in file:
                found_files.append(os.path.join(root, file))
    return found_files

In [None]:
def load_news_articles(filepaths):
    articles = []
    for fp in filepaths:
        outlet_name = "NPR" if "NPR" in fp else "PBS" if "PBS" in fp else "Unknown"
        df = pd.read_pickle(fp)
        for idx, row in df.iterrows():
            if row["content"]:
                articles.append({
                    "outlet_name": outlet_name,
                    "title": row["title"],
                    "date": row.get("date", None),
                    "content": " ".join(row["content"]) if isinstance(row["content"], list) else row["content"],  # join content if it's a list
                    "filename": fp
                })
    return articles

news_filepaths = get_article_filepaths()
news_articles = load_news_articles(news_filepaths)

In [None]:
news_docs = []
news_chunk_metadata = []

for article in news_articles:
    chunks = chunk_text(article["content"], chunk_size=300)
    for i, chunk in enumerate(chunks):
        news_docs.append(chunk)
        news_chunk_metadata.append({
            "source_type": "news",
            "source_name": article["outlet_name"],
            "title": article["title"],
            "publish_date": article["date"],
            "filename": article["filename"],
            "chunk_id": i
        })

## Running Topic Model
Here we run our topic model on our combined news and podcast corpus. For the sake of time, below is a run on 10,000 randomly sampled chunks (approximately 2% of our total).

In [None]:
all_docs = pod_docs + news_docs # joining our document corpora
all_chunk_metadata = pod_chunk_metadata + news_chunk_metadata # joining our metadata

In [None]:
sample_indices = random.sample(range(len(all_docs)), 10000) # taking a random sample of 10,000 chunks

sampled_docs = [all_docs[i] for i in sample_indices]
sampled_metadata = [all_chunk_metadata[i] for i in sample_indices]

In [None]:
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(all_docs, show_progress_bar=True) # replace with all_docs for full run
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)
topics, probs = topic_model.fit_transform(all_docs, embeddings) # replace with all_docs for full run

## Exploring Topics

In [None]:
# dataframe containing topic ID, number of chunks assigned to the topic, and topic keywords
topic_model.get_topic_info().head(5)

In [None]:
# The `results` DataFrame contains the our chunk metadata along with their most likely topic.
results = pd.DataFrame(all_metadata) # replace with all_chunk_metadata for full run
results["text"] = all_docs # replace with all_docs for full run
results["topic"] = topics
results["probability"] = probs
results.head()

## Exporting Model and Topics

In [None]:
# Export our model
topic_model.save("models/bertopic_model_full")

# Export our embeddings
np.save("sampled_docs_embeddings.npy", embeddings)

# Export our results
results.to_csv("data/bertopic_results.csv", index=False)
topic_model.get_topic_info().to_csv("data/topic_info.csv", index=False)