# Topic Modeling

This notebook contains the code used for topic modeling our podcast and news corpora.

## Imports
Add necessary imports.

In [None]:
#%pip install --upgrade torch torchvision torchaudio
#%pip install bertopic
from bertopic import BERTopic # topic modeling
from sentence_transformers import SentenceTransformer # embeddings
import os # file haldling
import pandas as pd # data exploration
import math # for chunking function
import pickle # for loading articles
import random # for sampling our data
import numpy as np # for exporting embeddings

## Preprocessing
Here, we preprocess our data for topic modeling.

### Transcripts
Transcripts are already clean, with no timestamps, speaker tags, or non-speech annotations. BERTopic also does not require stemming, lemmatizing, or stopword removal. 

Preprocessing is thus limited to removing those transcripts for podcasts published by traditional news outlets.

In [None]:
podcast_df = pd.read_csv("data/top_50_pods_USA_2025_Q1.csv", index_col="Rank")

# array of podcast titles to include
to_include = podcast_df[~podcast_df["Exclude"]]["Title"].values

In [None]:
metadata_table = pd.read_csv("data/all_podcasts_metadata.csv")

# removing all podcasts not on this list
metadata_table = metadata_table[metadata_table["podcast_name"].isin(to_include)]

### News Articles
News articles do not require proprocessing.

## Text Chunking

As podcasts and news articles can be long and often contain multiple topics, we break these into shorter documents so that our model will have an easier time grouping similar ideas together. Chunking also gives us even more datapoints, with more evidence for each topic and accordingly more robust topic discovery.

For our embeddings, we default to a `chunk size` of 300, which tends to be the "sweet spot" (i.e. long enough to be meaningful, short enough to not mix topics too much). We can adjust this based on modeling results.

### General Chunking Function
Note, this function is designed to make chunks as even as possible without a potential tiny chunk at the end of the transcript for reasons of consistency.

In [None]:
def chunk_text(text, chunk_size=300):
    """Split documents into even chunks of approximately chunk_size"""
    words = text.split()
    n_chunks = math.ceil(len(words) / chunk_size)
    if n_chunks == 0:
        return []
    avg = len(words) // n_chunks
    rem = len(words) % n_chunks
    chunks = []
    start = 0
    for i in range(n_chunks):
        # distribute remainder if any to the first chunk
        extra = 1 if i < rem else 0 
        end = start + avg + extra
        chunks.append(' '.join(words[start:end]))
        start = end
    return chunks

### Chunking Transcripts
Here we apply chunking to our podcast transcripts. Our outputs are:

1. `pod_docs`: a list containing chunks from all transcripts.
2. `pod_chunk_metadata`: a list of dictionaries containing for each chunk, its source type, source name, title, publish date, filename and unique id (within the document).

In [None]:
metadata_table

In [None]:
pod_docs = []
pod_chunk_metadata = []

for idx, row in metadata_table.iterrows():
    filepath = row["filename"]
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            text = f.read()
            chunks = chunk_text(text, chunk_size=300)
            for i, chunk in enumerate(chunks):
                pod_docs.append(chunk)
                # Attach all the original metadata and give each chunk a unique id for tracking
                pod_chunk_metadata.append({
                    "filename": row["filename"],
                    "source_type": "podcast",
                    "source_name": row["podcast_name"],
                    "title": row["episode_title"],
                    "publish_date": row["publish_date"],
                    "chunk_id": i
                })
    except Exception as e:
        print(f"Error reading {filepath}: {e}")

### Chunking News Articles
Here, we do the same chunking on our news articles. 

First, we load our news articles, which have been exported from Pandas DataFrames into `.pkl` files. We aggregate these into `news_articles`, a list of dictionaries where each article's text content is paired with keys for `outlet_name`, `title`, `date`, `filename`.

Then, we perform the same chunking we did on our podcast transcripts. Our outputs are:

1. `news_docs`: a list containing chunks from all news articles.
2. `pod_chunk_metadata`: a list of dictionaries containing for each chunk, its source type, source name, title, publish date, filename and unique id (within the document).

In [None]:
def get_article_filepaths():
    found_files = []
    for root, _, files in os.walk("data/article data"):
        for file in files:
            if file.endswith(".pkl") and "combined" in file:
                found_files.append(os.path.join(root, file))
    return found_files

In [None]:
def load_news_articles(filepaths):
    articles = []
    for fp in filepaths:
        outlet_name = "NPR" if "NPR" in fp else "PBS" if "PBS" in fp else "Unknown"
        df = pd.read_pickle(fp)
        for idx, row in df.iterrows():
            if row["content"]:
                articles.append({
                    "outlet_name": outlet_name,
                    "title": row["title"],
                    "date": row.get("date", None),
                    "content": " ".join(row["content"]) if isinstance(row["content"], list) else row["content"],  # join content if it's a list
                    "filename": fp
                })
    return articles

news_filepaths = get_article_filepaths()
news_articles = load_news_articles(news_filepaths)

In [None]:
news_docs = []
news_chunk_metadata = []

for article in news_articles:
    chunks = chunk_text(article["content"], chunk_size=300)
    for i, chunk in enumerate(chunks):
        news_docs.append(chunk)
        news_chunk_metadata.append({
            "source_type": "news",
            "source_name": article["outlet_name"],
            "title": article["title"],
            "publish_date": article["date"],
            "filename": article["filename"],
            "chunk_id": i
        })

## Running Topic Model
Here we run our topic model on our combined news and podcast corpus. For the sake of time, below is a run on 10,000 randomly sampled chunks (approximately 2% of our total).

In [None]:
all_docs = pod_docs + news_docs # joining our document corpora
all_chunk_metadata = pod_chunk_metadata + news_chunk_metadata # joining our metadata

In [None]:
# sample_indices = random.sample(range(len(all_docs)), 10000) # taking a random sample of 10,000 chunks

# sampled_docs = [all_docs[i] for i in sample_indices]
# sampled_metadata = [all_chunk_metadata[i] for i in sample_indices]

# Custom Topics List

In [None]:
seeded_topics = ["Donald Trump",
"Kamala Harris",
"Tim Walz",
"JD Vance",
"Joe Biden",
"Bernie Sanders",
"Mike Johnson",
"Nancy Pelosi",
"Alexandria Ocasio-Cortez",
"Mitch McConnell",
"Pete Buttigieg",
"Xi Jinping",
"Vladimir Putin",
"Volodymyr Zelensky",
"Benjamin Netanyahu",
"Claudia Sheinbaum",
"Ali Khamenei",
"Diddy",
"Luigi Mangione",
"Pope",
"Kanye West",
"Jeff Bezos",
"Mark Zuckerberg",
"Tim Cook",
"Elon Musk",
"Sam Altman",
"Karoline Leavitt",
"Jeffrey Epstein",
"Mark Carney",
"Justin Trudeau",
"Robert F. Kennedy Jr",
"Pete Hegseth",
"Bob Menendez",
"Ron DeSantis",
"Kevin McCarthy",
"Chuck Schumer",
"Taylor Swift", 
"Caitlin Clark",
"China",
"Russia",
"Canada",
"United States",
"Mexico",
"El Salvador",
"Israel",
"Iran",
"United Kingdom",
"Taiwan",
"Saudi Arabia",
"India",
"Pakistan",
"ICE",
"Democratic Party",
"Republican Party",
"USAID",
"Texas floods",
"LGBTQ",
"Israel Gaza",
"India Pakistan",
"Russia Ukraine",
"FEMA",
"TikTok",
"Crypto",
"Tariffs",
"COVID-19",
"Taylor Swift Travis Kelce",
"Opioids"]

print(len(seeded_topics))
 

## Exporting Model and Topics Using Direct Classification 
We will be using our predefined topics in the seeded topics list and using the cosine similarity to determine the topic confidence of each source.

In [None]:
# Your existing BERTopic setup
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(all_docs, show_progress_bar=True)
topic_model = BERTopic(embedding_model=embedding_model, verbose=True, seed_topic_list=seeded_topics)
topics, probs = topic_model.fit_transform(all_docs, embeddings)

# embeddings for direct classification
print("Encoding seeded topics...")
topic_embeddings = embedding_model.encode(seeded_topics, show_progress_bar=True)




## Direct Classification

In [None]:

custom_topics = []
similarity_scores = []

print(f"Using existing embeddings for {len(embeddings)} documents...")
print(f"Expected to process {len(all_docs)} documents...")


# if len(embeddings) != len(all_docs):
#     print(f"WARNING: Embeddings length ({len(embeddings)}) doesn't match documents length ({len(all_docs)})")
#     min_length = min(len(embeddings), len(all_docs))
#     print(f"Processing only first {min_length} documents")
#     process_count = min_length
# else:
#     process_count = len(embeddings)

process_count = len(embeddings)
for i in range(process_count):
    doc_emb = embeddings[i]
    

    doc_emb_norm = doc_emb / np.linalg.norm(doc_emb)
    topic_embs_norm = topic_embeddings / np.linalg.norm(topic_embeddings, axis=1, keepdims=True)
    
    similarities = np.dot(topic_embs_norm, doc_emb_norm)
    most_similar_idx = np.argmax(similarities)
    
    custom_topics.append(seeded_topics[most_similar_idx])
    similarity_scores.append(similarities[most_similar_idx])
    
    if (i + 1) % 1000 == 0:
        print(f"Processed {i + 1}/{process_count} documents")

print(f"Direct classification complete. Processed {len(custom_topics)} documents.")



In [None]:

def get_bertopic_probs(probs):
    """Handle different probability formats from BERTopic"""
    if probs is None:
        return [0.0] * len(topics)
    
    if len(probs.shape) == 1:
        return probs.tolist()
    elif len(probs.shape) == 2:
        return probs.max(axis=1).tolist()
    else:
        print(f"Warning: using zeros")
        return [0.0] * len(topics)

bertopic_probs = get_bertopic_probs(probs)

# print(f"\nDebugging array lengths:")
# print(f"  ]=: {len(all_docs)}")
# print(f"  topics: {len(topics)}")
# print(f"  bertopic_probs: {len(bertopic_probs)}")
# print(f"  custom_topics: {len(custom_topics)}")
# print(f"  similarity_scores: {len(similarity_scores)}")
# print(f"  embeddings: {len(embeddings)}")

expected_length = len(all_docs)
arrays_to_check = {
    'topics': topics,
    'bertopic_probs': bertopic_probs,
    'custom_topics': custom_topics,
    'similarity_scores': similarity_scores
}

for name, array in arrays_to_check.items():
    if len(array) != expected_length:
        print(f"ERROR: {name} has length {len(array)}, expected {expected_length}")
        if len(array) > expected_length:
            print(f"  Truncating {name} to {expected_length}")
            arrays_to_check[name] = array[:expected_length]
        else:
            print(f"  Padding {name} to {expected_length}")
            if name == 'topics':
                arrays_to_check[name] = list(array) + [-1] * (expected_length - len(array))
            elif name == 'bertopic_probs':
                arrays_to_check[name] = list(array) + [0.0] * (expected_length - len(array))
            elif name == 'custom_topics':
                arrays_to_check[name] = list(array) + ['Unknown'] * (expected_length - len(array))
            elif name == 'similarity_scores':
                arrays_to_check[name] = list(array) + [0.0] * (expected_length - len(array))

topics = arrays_to_check['topics']
bertopic_probs = arrays_to_check['bertopic_probs']
custom_topics = arrays_to_check['custom_topics']
similarity_scores = arrays_to_check['similarity_scores']

results_df = pd.DataFrame({
    'text': all_docs,
    'bertopic_topic': topics,
    'bertopic_prob': bertopic_probs,
    'custom_topic': custom_topics,
    'custom_topic_score': similarity_scores
})


results_df['custom_topic_confident'] = results_df['custom_topic_score'] > 0.3  #change this cosine similarity score threshold as desired


custom_counts = pd.Series(custom_topics).value_counts()
print(f"\nDirect classification - Top 10 topics:")
for topic, count in custom_counts.head(10).items():
    avg_score = results_df[results_df['custom_topic'] == topic]['custom_topic_score'].mean()
    print(f"  {topic}: {count} documents (avg score: {avg_score:.3f})")


In [None]:
print(len(results_df))

print(results_df.head())


In [None]:
results_df.to_csv('custom_topics.csv', index=False)