
# Topic Modeling with LDA (Gensim)

This notebook follows the required steps:

**(a) Article Selection & Topic Estimation**  
- Uses the attached news article (`Problem1News.txt`).  
- Manual topic estimate: **3 topics** (based on a quick read: (i) *Reagan ad vs. original speech*, (ii) *Tariff/trade-war impacts*, (iii) *US–Canada talks & reactions*).

**(b) Data Preparation**  
- Split the article by **paragraphs** (blank lines as delimiters).  
- Preprocess: lowercase, remove stopwords, and **stemming** (Porter).  
- Display samples of processed paragraphs.

**(c) LDA Model Implementation**  
- Train **LDA (gensim)** with `num_topics=3` on the preprocessed paragraphs.

**(d) Results Presentation**  
- Show **top 10 words** for each topic.  
- For each topic, show the **2 most associated paragraphs** (original text), and provide a short **2–3 word label**.

*Note:* The manual estimate of 3 topics is motivated by the article's structure: it moves between (1) the advert’s editing of Reagan’s 1987 speech, (2) arguments about tariffs/free trade and historical context, and (3) immediate political reactions and trade‑talk implications.


In [1]:
import os
from pprint import pprint

from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import simple_preprocess
from nltk.stem import PorterStemmer

from gensim.corpora import Dictionary
from gensim.models import LdaModel

import re
import itertools

ARTICLE_PATH = "Problem1News.txt"

assert os.path.exists(ARTICLE_PATH), f"File not found: {ARTICLE_PATH}"


def split_into_paragraphs(text: str):
   
    text = text.replace('\r\n', '\n').replace('\r', '\n')
    paras = re.split(r'\n\s*\n', text.strip())

    paras = [p.strip() for p in paras if p.strip()]
    return paras

  "class": algorithms.Blowfish,


In [2]:
with open(ARTICLE_PATH, "r", encoding="utf-8", errors="ignore") as f:
    raw_text = f.read()

paragraphs = split_into_paragraphs(raw_text)
print(f"Total characters: {len(raw_text)}")
print(f"Total paragraphs: {len(paragraphs)}\n")

for i, para in enumerate(paragraphs[:2]):
    print(f"--- Paragraph {i} (original) ---")
    print(para[:800] + ("..." if len(para) > 800 else ""))
    print()

Total characters: 6515
Total paragraphs: 38

--- Paragraph 0 (original) ---
What's in Reagan advert that's caused US-Canada trade talks collapse?
3 hours ago

--- Paragraph 1 (original) ---
Share



In [3]:
# 2.Preprocessing: lowercase, remove stopwords, stemming
ps = PorterStemmer()

def preprocess_para(para: str):
  
    tokens = simple_preprocess(para, deacc=True, min_len=2, max_len=30)
    # remove stopwords
    tokens = [t for t in tokens if t not in STOPWORDS]
    # stemming
    tokens = [ps.stem(t) for t in tokens]
    return tokens

processed_docs = [preprocess_para(p) for p in paragraphs]

print("Sample processed paragraphs:")
for i in range(min(3, len(processed_docs))):
    print(f"--- Paragraph {i} (processed) ---")
    print(processed_docs[i])
    print()

Sample processed paragraphs:
--- Paragraph 0 (processed) ---
['reagan', 'advert', 'caus', 'canada', 'trade', 'talk', 'collaps', 'hour', 'ago']

--- Paragraph 1 (processed) ---
['share']

--- Paragraph 2 (processed) ---
['save', 'maia', 'davi', 'getti', 'imag', 'file', 'photo', 'ronald', 'reagan', 'wear', 'brown', 'suit', 'jacket', 'dark', 'red', 'tie', 'white', 'shirt', 'slick', 'brown', 'hair', 'speak', 'microphon', 'stand', 'flag', 'blue', 'curtain', 'getti', 'imag', 'radio', 'address', 'presid', 'ronald', 'reagan', 'focus', 'impact', 'tariff', 'presid', 'donald', 'trump', 'said', 'halt', 'trade', 'negoti', 'canada', 'immedi', 'advert', 'predecessor', 'ronald', 'reagan', 'say', 'tariff', 'hurt', 'american']



In [4]:
# 3.LDA Model Implementation, prepare dictionary & corpus
dictionary = Dictionary(processed_docs)
# remove too-rare and too-common tokens to reduce noise
dictionary.filter_extremes(no_below=1, no_above=0.7)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

print(f"Vocabulary size: {len(dictionary)}")
print(f"Sample BOW for paragraph 0: {corpus[0][:15] if corpus else []}")

Vocabulary size: 263
Sample BOW for paragraph 0: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]


In [5]:
# 4.Train LDA with the manual estimate of 3 topics
NUM_TOPICS = 3
lda = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=NUM_TOPICS,
    random_state=42,
    passes=20,
    alpha='auto',
    eta='auto',
    per_word_topics=False
)

# 5.Show 10 most frequent words per topic
TOPN = 10
print("Top words per topic:")
topic_top_words = {}
for k in range(NUM_TOPICS):
    words = lda.show_topic(k, topn=TOPN)
    topic_top_words[k] = [w for w, _ in words]
    print(f"Topic {k}: {[w for w, _ in words]}")

Top words per topic:
Topic 0: ['tariff', 'trade', 'reagan', 'high', 'long', 'minut', 'address', 'advert', 'barrier', 'run']
Topic 1: ['tariff', 'trade', 'say', 'trump', 'talk', 'canada', 'ad', 'depress', 'legisl', 'advert']
Topic 2: ['reagan', 'line', 'advert', 'trade', 'address', 'tariff', 'free', 'origin', 'say', 'speech']


In [6]:
# 5.For each paragraph, compute topic distribution and record top topic and score
doc_topics = []
for doc_idx, bow in enumerate(corpus):
    topic_dist = lda.get_document_topics(bow, minimum_probability=0.0)
    topic_dict = {t: float(p) for t, p in topic_dist}
    doc_topics.append((doc_idx, topic_dict))

# For each topic, pick the top-2 paragraphs by probability of that topic
top_docs_by_topic = {}
for k in range(NUM_TOPICS):
    ranked = sorted(doc_topics, key=lambda x: x[1].get(k, 0.0), reverse=True)
    top_docs_by_topic[k] = ranked[:2]

# Heuristic 2–3 word labels: try to pick salient words; default to first two top words
def propose_label(words):
    candidates = words[:3]
    label = " ".join(w.capitalize() for w in candidates[:2])
    return label

topic_labels = {k: propose_label(topic_top_words[k]) for k in range(NUM_TOPICS)}

# Present results
for k in range(NUM_TOPICS):
    print("="*80)
    print(f"Topic {k} — Label: {topic_labels[k]}")
    print(f"Top words: {topic_top_words[k]}")
    print("\nMost associated paragraphs:")
    for rank, (doc_idx, tdist) in enumerate(top_docs_by_topic[k], start=1):
        score = tdist.get(k, 0.0)
        print(f"\n[{rank}] Paragraph #{doc_idx} (topic prob={score:.3f})")
        para_text = paragraphs[doc_idx]
        # Print up to 700 chars to keep it readable
        snippet = para_text[:700] + ("..." if len(para_text) > 700 else "")
        print(snippet)

Topic 0 — Label: Tariff Trade
Top words: ['tariff', 'trade', 'reagan', 'high', 'long', 'minut', 'address', 'advert', 'barrier', 'run']

Most associated paragraphs:

[1] Paragraph #2 (topic prob=0.998)
Save
Maia Davies
Getty Images A file photo of Ronald Reagan from the 1970s. He wears a brown suit jacket, a dark red tie, a white shirt, and has slicked back brown hair. He speaks into a microphone and stands in front of a US flag and a blue curtain.Getty Images
The radio address made by former President Ronald Reagan focused on the impact of tariffs
US President Donald Trump has said he will halt all trade negotiations with Canada immediately over an advert in which his predecessor Ronald Reagan says tariffs "hurt every American".

[2] Paragraph #23 (topic prob=0.997)
"What eventually occurs is: First, homegrown industries start relying on government protection in the form of high tariffs. They stop competing and stop making the innovative management and technological changes they need t


## Notes & Interpretation

- The **labels** are heuristic and derived from each topic’s top words.  
- Use the listed **most associated paragraphs** to refine the labels if needed.  
- You can re-run the LDA with a different `NUM_TOPICS` to see how topics regroup.

**Manual Estimate Recap:** 3 topics — *Reagan ad vs. original speech*, *Tariff/trade-war impacts*, *US–Canada talks & reactions*.
