<a href="https://colab.research.google.com/github/swsewon3-ship-it/python-for-public-policy_2025-Fall/blob/main/Intro_Text_Analysis_TFIDF_LDA_Inaugurals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Intro to Text Analysis in Python: FreqDist ‚Üí TF‚ÄìIDF ‚Üí Topic Modeling (U.S. Inaugural Addresses)

**Course**: Intro to Text Analysis for Public Policy  
**Format**: Live coding (~2.5 hours) + 60‚Äëmin student-driven scavenger hunt  
**Dataset**: U.S. Presidential Inaugural Addresses (via NLTK)

### Learning Outcomes
- Load and lightly clean a real-world corpus.
- Contrast **raw frequency (FreqDist)** vs **TF‚ÄìIDF** to understand term salience.
- Use **topic modeling (LDA, scikit‚Äëlearn)** to uncover corpus‚Äëlevel themes.
- Compare and interpret outputs to make policy‚Äërelevant claims.


## üîç Comparing TF‚ÄìIDF vs Topic Modeling in Policy Contexts

| Policy Context | What TF‚ÄìIDF Reveals | What Topic Modeling Reveals | Example Insight |
|----------------|--------------------|-----------------------------|-----------------|
| üèõ **Legislative & Political Communication** | Distinctive vocabulary by legislator or party (e.g., what makes one member‚Äôs rhetoric unique) | Shared themes or issue clusters across speeches (e.g., ‚Äúhealthcare,‚Äù ‚Äúsecurity,‚Äù ‚Äúimmigration‚Äù) | TF‚ÄìIDF shows that one senator emphasizes ‚Äúopioids‚Äù while another uses ‚Äúcybersecurity‚Äù; LDA groups all health-related terms into a ‚Äúpublic health‚Äù topic. |
| üåê **Diplomatic & Multilateral Statements** | Country-specific framing of an issue (what each nation stresses) | Global discourse patterns and alliances (how nations group around themes) | TF‚ÄìIDF highlights Fiji‚Äôs use of ‚Äúloss and damage‚Äù vs. the U.S.‚Äôs ‚Äúinnovation‚Äù; LDA identifies a broader ‚Äúclimate adaptation‚Äù topic uniting small island states. |
| üïä **NGO & Think-Tank Reports** | Organization-specific keywords that signal focus or mandate | Latent themes that span organizations (e.g., ‚Äúeducation policy,‚Äù ‚Äúmacroeconomic reform‚Äù) | TF‚ÄìIDF shows UNICEF‚Äôs ‚Äúchild rights‚Äù language; LDA uncovers cross-agency topics like ‚Äúfinancing for development.‚Äù |
| üì∞ **Media Coverage of Global Policy** | Outlet-specific framing and language choices | Dominant topics in media discourse across sources or time | TF‚ÄìIDF shows Fox News emphasizes ‚Äúenergy independence,‚Äù The Guardian ‚Äúclimate justice‚Äù; LDA extracts topics like ‚Äúenergy transition,‚Äù ‚Äúpolicy negotiations.‚Äù |
| ‚öñÔ∏è **Comparative Policy Texts / Legislation** | Unique legal or regulatory phrasing in each country | Shared or evolving legal concepts across multiple texts | TF‚ÄìIDF finds Germany stresses ‚ÄúEnergiewende‚Äù; LDA surfaces a ‚Äúrenewable energy transition‚Äù topic appearing in multiple EU laws. |
| üí¨ **Public Consultation & Citizen Feedback** | Stakeholder-specific concerns or jargon (e.g., NGOs vs. corporations) | Major themes emerging from thousands of comments | TF‚ÄìIDF identifies NGOs‚Äô use of ‚Äúpollution control‚Äù vs. industry‚Äôs ‚Äúinnovation cost‚Äù; LDA clusters all responses into ‚Äúeconomic impact,‚Äù ‚Äúenvironmental justice,‚Äù etc. |
| üß≠ **Speeches & Strategic Messaging Over Time** | New or distinctive terms introduced in a given year or presidency | Long-term thematic evolution or cycles in national rhetoric | TF‚ÄìIDF shows ‚Äúpandemic‚Äù spikes in 2020; LDA reveals enduring topics like ‚Äúforeign policy,‚Äù ‚Äúdomestic economy,‚Äù ‚Äúnational security.‚Äù |

---

### üß† Summary

| Technique | Best For | Analytical Focus |
|------------|-----------|------------------|
| **TF‚ÄìIDF** | Comparing documents or actors | *‚ÄúWhat makes this text distinct?‚Äù* |
| **Topic Modeling (LDA)** | Discovering cross-document themes | *‚ÄúWhat themes recur across the corpus?‚Äù* |

> ‚úÖ Together, they bridge **micro-level distinctiveness** (TF‚ÄìIDF) and **macro-level patterns** (LDA) ‚Äî enabling richer analysis of language in policy and diplomacy.



## 1) Environment Setup (Colab‚Äëfriendly)
Run this once in Colab to install/upgrade packages and download NLTK data.


In [None]:

# In a fresh runtime (Runtime ‚Üí Restart runtime), run:
!pip -q install "numpy==2.0.2" "scipy==1.14.1" "scikit-learn>=1.4"
!pip install nltk==3.9.2

import numpy, scipy, sklearn
print("NumPy:", numpy.__version__)     # ‚Üí 2.0.2
print("SciPy:", scipy.__version__)     # ‚Üí 1.14.x
print("sklearn:", sklearn.__version__) # ‚â• 1.4


import nltk
nltk.download('inaugural')
nltk.download('stopwords')
nltk.download('punkt')
# Some environments expect punkt_tab as well:
nltk.download('punkt_tab')

print("‚úÖ Setup complete.")



## 2) Imports
We use: `nltk` for data & preprocessing, `scikit-learn` for TF‚ÄìIDF and LDA, `matplotlib/pandas` for exploration.


In [None]:

import re
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import inaugural, stopwords
from nltk import word_tokenize, FreqDist

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA, LatentDirichletAllocation


# Display all rows and columns (adjust numbers as needed)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Show full text in each cell (no truncation)
pd.set_option('display.max_colwidth', None)

# Expand the display width so wide tables don't wrap
pd.set_option('display.width', 0)

print("‚úÖ Pandas display options set for full view.")

print("‚úÖ Imports loaded.")



## 3) Load the U.S. Presidential Inaugural Addresses
We‚Äôll load speeches from NLTK‚Äôs `inaugural` corpus. Each document is a speech like `1789-Washington.txt`.


In [None]:

fileids = inaugural.fileids()
print(fileids)

records = []
for fid in fileids:
    raw = inaugural.raw(fid) #The .raw() method returns the entire text of one file as a single string ‚Äî no tokenization, no cleaning, just raw text
    year, president = fid.replace('.txt', '').split('-')[0], fid.replace('.txt', '').split('-')[1] #extracts the year and president‚Äôs name from each file‚Äôs name in the NLTK inaugural corpus
    records.append({'fileid': fid, 'year': int(year), 'president': president, 'text': raw})

df = pd.DataFrame(records).sort_values('year').reset_index(drop=True)
df.head(3)



## 4) Light Preprocessing
Simple, transparent steps:
- lowercase ‚Üí tokenize ‚Üí keep alphabetic tokens (len ‚â• 3) ‚Üí remove stopwords  
We also add a small custom stoplist of political boilerplate words.


In [None]:

EN_STOP = set(stopwords.words('english'))
print("English stopwords:", EN_STOP)
CUSTOM_STOP = {
    # corpus artifacts / very generic political words (tweak in class)
    'applause', 'cheers', 'government', 'nation', 'people', 'states', 'united', 'american', 'america'
}
STOPWORDS = EN_STOP.union(CUSTOM_STOP)

def simple_clean_tokens(text):
    """
    1) Lowercase
    2) Tokenize
    3) Keep alphabetic tokens of length >= 3
    4) Remove stopwords
    """
    text = text.lower()
    tokens = word_tokenize(text)
    clean = [tok for tok in tokens if tok.isalpha() and len(tok) >= 3 and tok not in STOPWORDS]
    return clean

df['tokens'] = df['text'].apply(simple_clean_tokens)
df['text_clean'] = df['tokens'].apply(lambda toks: " ".join(toks))

print("Sample tokens:", df.loc[0, 'tokens'][:25])
df[['fileid','year','president','text_clean']].head(3)



## 5) Quick Exploration
A glance at token counts and frequent words gives intuition before modeling.


In [None]:

# Document lengths
df['n_tokens'] = df['tokens'].apply(len)
ax = df.plot(x='year', y='n_tokens', kind='bar', figsize=(12,4), legend=False)
ax.set_ylabel("Tokens per speech (after cleaning)")
ax.set_xlabel("Index (chronological order)")
ax.set_title("Document lengths")
plt.show()

# Global top terms (sanity check)
all_terms = [t for toks in df['tokens'] for t in toks]
top20 = Counter(all_terms).most_common(20)
pd.DataFrame(top20, columns=['term','count'])



## 5.5) Keyword Frequency (NLTK `FreqDist`) ‚Üí Why TF‚ÄìIDF?
`FreqDist` counts words across the **entire corpus**. High counts may reflect words that are common everywhere‚Äînot necessarily distinctive.

**Idea:** Use FreqDist to see the *loudest* words, then use TF‚ÄìIDF to see the *most distinctive per document*.


In [None]:

all_tokens = [t for toks in df['tokens'] for t in toks]
fdist = FreqDist(all_tokens)

print("Top 20 most frequent words across all speeches:\n")
for word, freq in fdist.most_common(20):
    print(f"{word:15s} {freq}")

# Visualize (optional)
plt.figure(figsize=(10,4))
fdist.plot(20, cumulative=False)
plt.title("Most Frequent Words in Inaugural Speeches (Cleaned)")
plt.show()



## 6) TF‚ÄìIDF with scikit‚Äëlearn
**TF‚ÄìIDF** = term frequency √ó inverse document frequency  
- Highlights terms that are frequent **and** specific to a document.  
- Downweights terms that appear in many documents.


In [None]:

# convert a collection of text documents (your speeches) into a matrix where: Each row = one document (a speech);
# Each column = one term (a word);
# Each cell value = TF‚ÄìIDF weight of that term in that document
# min_df=2 ‚Üí ignore words that appear in fewer than 2 documents
tfidf = TfidfVectorizer(min_df=2)

# Feeds your cleaned text (from the text_clean column) into the vectorizer.
# Two steps happen in one command:
# .fit() ‚Äî learns the vocabulary and IDF (Inverse Document Frequency) weights.
# .transform() ‚Äî applies the TF‚ÄìIDF transformation to each document.
# Returns a sparse matrix X_tfidf of shape:

X_tfidf = tfidf.fit_transform(df['text_clean'])

# Retrieves the list of all terms (vocabulary) that the vectorizer kept.
# Converts it into a NumPy array for easy indexing and sorting later.
# You‚Äôll use it when finding the top TF‚ÄìIDF terms for each speech:

terms = np.array(tfidf.get_feature_names_out())

# ((# of docs, # of unique words), # of unique words)
X_tfidf.shape, len(terms)


In [None]:

def top_tfidf_terms_for_doc(doc_idx, top_n=12): #Define a function that returns the top-n TF‚ÄìIDF terms for a single document (speech).
    row = X_tfidf.getrow(doc_idx).toarray().ravel() #row is a vector of TF‚ÄìIDF scores for one speech, where each position corresponds to one word in terms
    top_idx = row.argsort()[::-1][:top_n] #top_idx = positions of the most distinctive words in this speech.
    # terms[top_idx] gets the actual word strings for those indices.
    # row[top_idx] gets their corresponding TF‚ÄìIDF scores.
    # zip(...) pairs each word with its score.
    # list(...) turns that into a list of (word, score) tuples.
    return list(zip(terms[top_idx], row[top_idx]))

# Show a few speeches (early, middle, recent)
# This loop picks three speeches: The first (i = 0), The middle one (len(df)//2), The last one (len(df)-1)
for i in [0, len(df)//2, len(df)-1]:
    print(f"\n=== {df.loc[i, 'year']} - {df.loc[i, 'president']} ===") #Prints a header showing which speech you‚Äôre examining
    for term, score in top_tfidf_terms_for_doc(i, top_n=12): #Calls the function to get the top 12 terms for that document
        print(f"{term:15s} {score:.3f}")



### TF‚ÄìIDF Similarity (Cosine)
Find which speeches are lexically similar using cosine similarity on TF‚ÄìIDF vectors.


In [None]:

sim = cosine_similarity(X_tfidf) #Computes a cosine similarity matrix for all speeches
target = len(df) - 1  # Chooses the most recent speech (the last row in your DataFrame) as the target document.
pairs = [(i, sim[target, i]) for i in range(len(df)) if i != target] #Builds a list of tuples for every other speech, skips the target one
pairs_sorted = sorted(pairs, key=lambda x: x[1], reverse=True)[:5] #Sorts the list of (index, similarity) pairs by similarity score in descending order

print(f"Most similar to {df.loc[target,'year']} - {df.loc[target,'president']}:")
for idx, score in pairs_sorted: #Iterates over the top-5 most similar speeches
    print(f"  {df.loc[idx,'year']} - {df.loc[idx,'president']:12s}  (cosine={score:.3f})")

# Cosine similarity treats each speech as a high-dimensional vector (words as axes).
# The closer the angle between two vectors, the more similar their language use ‚Äî even if the speeches differ in length.


In [None]:

# 2D projection (small corpus ‚Üí OK to densify)
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(X_tfidf.toarray())
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(coords[:,0], coords[:,1])
for i, row in df.iterrows():
    ax.annotate(str(row['year']), (coords[i,0], coords[i,1]), fontsize=8)
ax.set_title("Speeches in TF‚ÄìIDF space (PCA projection)")
ax.set_xlabel("PC1"); ax.set_ylabel("PC2")
plt.show()


üé® How to interpret the chart
1. Each point = a speech represented by its overall word usage pattern

* Two speeches close together ‚Üí use similar vocabularies (in terms of TF‚ÄìIDF weighting).

* Far apart ‚Üí distinct word usage ‚Äî different priorities, tone, or historical context.

2. Axes (PC1 and PC2) are abstract ‚Äî they don‚Äôt correspond to literal variables

* PCA components are linear combinations of all TF‚ÄìIDF features (words).

* You can‚Äôt say ‚Äúthe x-axis means optimism vs war,‚Äù but you can say:

* ‚ÄúAlong PC1, speeches separate based on major vocabulary differences ‚Äî early vs modern language, perhaps.‚Äù

* You can interpret them qualitatively by checking which speeches cluster together.

3. Look for clusters, trends, or outliers

* Clusters of speeches by nearby years ‚Üí continuity in rhetoric or themes.

* Isolated points ‚Üí outlier speeches (perhaps unusually short, poetic, or issue-focused).

* You might see a chronological gradient: early 1800s on one side, 2000s on another ‚Äî showing the evolution of presidential language.

| Concept   | Interpretation                                           |
| --------- | -------------------------------------------------------- |
| Distance  | How different two speeches‚Äô vocabularies are             |
| Clusters  | Shared themes, era, or rhetorical style                  |
| Outliers  | Unique speeches that break linguistic patterns           |
| PC1 / PC2 | Major axes of variation in word use ‚Äî not literal topics |


In [None]:
df[['year','president']].assign(PC1=coords[:,0], PC2=coords[:,1]).sort_values('PC1').head()


## üé® Visualizing TF‚ÄìIDF: Word Cloud & Temporal Trend

Now that we‚Äôve mapped speeches in abstract ‚ÄúTF‚ÄìIDF space,‚Äù  
let‚Äôs explore two other ways to *see* what TF‚ÄìIDF tells us.

1. **Word Cloud** ‚Äì visually emphasizes the distinctive words in one speech.  
   - Larger words = higher TF‚ÄìIDF scores.  
   - Great for quick, qualitative insight into what stands out.

2. **Temporal Line Chart** ‚Äì tracks how the importance of a given term changes over time.  
   - Example: does *‚Äúfreedom‚Äù* rise or fall in salience across U.S. history?


In [None]:
# --- 1) Word Cloud for a Selected Speech ---
from wordcloud import WordCloud

# Pick a speech by index (0=earliest, -1=latest)
doc_idx = len(df) - 1  # last speech by default

# Generate dictionary of top TF‚ÄìIDF terms
wc_data = dict(top_tfidf_terms_for_doc(doc_idx, top_n=100))

# Create and display the word cloud
wc = WordCloud(width=800, height=400, background_color='white')
wc.generate_from_frequencies(wc_data)

plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title(f"Word Cloud: {df.loc[doc_idx,'year']} ‚Äì {df.loc[doc_idx,'president']}", fontsize=14)
plt.show()



In [None]:
# --- 2) Temporal Line Chart of a Word's TF‚ÄìIDF Weight ---
# Choose a term to track over time
term = "freedom"  # try swapping to 'war', 'peace', 'america', etc.

if term in terms:
    term_idx = np.where(terms == term)[0][0]
    df[f"tfidf_{term}"] = X_tfidf[:, term_idx].toarray().ravel()

    plt.figure(figsize=(8,4))
    plt.plot(df['year'], df[f"tfidf_{term}"], marker='o', linestyle='-')
    plt.title(f'TF‚ÄìIDF Weight of "{term}" Over Time', fontsize=14)
    plt.xlabel("Year of Inaugural Address")
    plt.ylabel("TF‚ÄìIDF Score")
    plt.grid(alpha=0.3)
    plt.show()
else:
    print(f'Term "{term}" not found in vocabulary. Try another word.')


> ‚úÖ **What TF‚ÄìIDF tells us**: which words uniquely characterize each speech; which speeches use similar vocabularies.  
> ‚ùå **What it doesn‚Äôt**: explicitly uncover *themes* shared across documents.



## 7) Topic Modeling with scikit‚Äëlearn‚Äôs LDA
**Latent Dirichlet Allocation (LDA)** models each document as a mixture of **topics** (word distributions).  
We‚Äôll build a bag‚Äëof‚Äëwords matrix, fit an LDA model, inspect topics, and examine per‚Äëspeech topic mixtures.


In [None]:

# Bag-of-words for LDA
  # CountVectorizer converts each document into a bag-of-words (word counts, not weights).
  # min_df=2 means: ignore words that appear in fewer than 2 speeches (to reduce noise).
  # fit_transform() builds the vocabulary and creates a document-term matrix: Rows = speeches; Columns = unique words; Values = how many times each word appears
  # vocab holds the list of all words (for displaying topic terms later).

cv = CountVectorizer(min_df=2)
X_counts = cv.fit_transform(df['text_clean'])
vocab = np.array(cv.get_feature_names_out())

# Train LDA
  # K is the number of topics you want the model to find.
  # This is not learned automatically ‚Äî it‚Äôs a parameter you choose.
  # Try experimenting with different values:
    # K=5 ‚Üí broader, more general themes (e.g., ‚Äúwar,‚Äù ‚Äúeconomy,‚Äù ‚Äúunity‚Äù).
    # K=8 ‚Üí more nuanced topics (e.g., ‚Äúforeign policy,‚Äù ‚Äúdomestic economy,‚Äù ‚Äúliberty‚Äù).

K = 8  # adjust live (5, 8, 12)


# Initializes the LDA model from scikit-learn.
  # n_components=K ‚Üí tells the model how many topics (components) to find.
  # learning_method="batch" ‚Üí trains on the entire dataset at once.
  # (Alternative: "online" trains incrementally on chunks; ‚Äúbatch‚Äù is stable for small corpora like this.)
  # random_state=42 ‚Üí ensures reproducible results (so every student gets the same topics).
  # max_iter=20 ‚Üí number of passes over the data to improve the model; more iterations = more refined topics, but slower training.

lda = LatentDirichletAllocation(
    n_components=K,
    learning_method="batch",
    random_state=42,
    max_iter=20
)

# This line fits the model and simultaneously transforms the data into topic proportions
topic_mix = lda.fit_transform(X_counts)  # theta: (n_docs, K)

def show_topics(model, vocab, topn=12):
# Defines a helper function called show_topics().
# Inputs:
  # model ‚Üí your trained LDA model (lda).
  # vocab ‚Üí array of all words in your vocabulary (from CountVectorizer).
  # topn ‚Üí how many top words you want to display per topic (default = 12).
    for k, comp in enumerate(model.components_):
    # model.components_ is a 2D NumPy array where:
    # Each row corresponds to a topic (Topic 0, Topic 1, ‚Ä¶).
    # Each column corresponds to a word in the vocabulary.
    # Each value = the importance (weight) of that word within the topic.
    # enumerate() loops through all topics (k) and their corresponding word-weight vectors (comp).
        top_idx = comp.argsort()[::-1][:topn]
        # argsort() returns the indices that would sort the array ‚Äî here, the word weights ‚Äî in ascending order.
        # [::-1] reverses that order to descending (highest-weighted words first).
        # [:topn] takes only the top n indices (e.g., top 12 words).
        print(f"\nTopic {k}: " + ", ".join(vocab[top_idx]))

show_topics(lda, vocab, topn=12)

# Assemble per-document topic proportions
topic_df = pd.DataFrame(topic_mix, columns=[f"topic_{k}" for k in range(K)])
result_df = pd.concat([df[['fileid','year','president']], topic_df], axis=1)
result_df.head(5)


### Map the topics across a heatmap

üß† How to Interpret the LDA Topic Heatmap

Each cell of the heatmap represents the proportion of a given topic within a specific speech.
Color intensity encodes how strongly that topic appears ‚Äî darker (or brighter) = higher proportion, lighter = weaker presence.

In [None]:
!pip -q install plotly

import numpy as np
import plotly.graph_objects as go


In [None]:
def topic_top_words(lda_model, vocab, topn=10):
    """Return:
       - topic_labels: list like ["T0: economy, growth, jobs", ...]
       - topic_words:  list of lists of the topn words per topic (for hover)
    """
    labels = []
    words_list = []
    for k, comp in enumerate(lda_model.components_):
        top_idx = comp.argsort()[::-1][:topn]
        words = vocab[top_idx].tolist()
        words_list.append(words)
        label = f"T{k}: " + ", ".join(words[:6])  # concise label for axis/hover
        labels.append(label)
    return labels, words_list

topic_labels, topic_words = topic_top_words(lda, vocab, topn=12)

# Columns in result_df that are topic proportions
topic_cols = [c for c in result_df.columns if c.startswith("topic_")]


In [None]:
# Pick rows you want to compare
rows = [0, len(result_df)//2, len(result_df)-1] #The list [0, len(result_df)//2, len(result_df)-1] = [first_speech, middle_speech, last_speech]
df_sel = result_df.iloc[rows].copy()

# This code converts the selected speeches‚Äô topic proportions into a NumPy matrix (Z) for plotting,
# creates x-axis labels (x) showing topic numbers (like ‚ÄúT0‚Äù, ‚ÄúT1‚Äù, ‚Ä¶), and builds y-axis labels (y)
# combining each speech‚Äôs year and president name for the heatmap
Z = df_sel[topic_cols].to_numpy()
x = [f"T{int(c.split('_')[-1])}" for c in topic_cols]
y = [f"{r.year} ‚Äî {r.president}" for _, r in df_sel.iterrows()]

# Build hovertext matrix: one string per cell
hovertext = []
for r_i, r in df_sel.iterrows():
    row_texts = []
    for t_i, col in enumerate(topic_cols):
        k = int(col.split('_')[-1])
        row_texts.append(
            f"<b>{int(r.year)} ‚Äî {r.president}</b><br>"
            f"<b>Topic {k}</b><br>"
            f"Top words: {', '.join(topic_words[k][:10])}<br>"
            f"Proportion: {r[col]:.3f}"
        )
    hovertext.append(row_texts)

fig = go.Figure(
    data=go.Heatmap(
        z=Z,
        x=x,
        y=y,
        colorscale="Plasma",
        zmin=0.0, zmax=1.0,
        hoverinfo="text",
        text=hovertext
    )
)
fig.update_layout(
    title="Topic mixture (theta) ‚Äî selected speeches",
    xaxis_title="Topic",
    yaxis_title="Speech",
    height=300 + 40*len(rows),
    margin=dict(l=80, r=20, t=60, b=60)
)
fig.show()


In [None]:
# Ensure chronological order
df_sorted = result_df.sort_values("year").reset_index(drop=True)

# Reorder topics by global prevalence (more interpretable)
mean_by_topic = df_sorted[topic_cols].mean(axis=0).to_numpy()
order = np.argsort(mean_by_topic)[::-1]
ordered_cols = [topic_cols[i] for i in order]
ordered_x = [f"T{int(c.split('_')[-1])}" for c in ordered_cols]

A = df_sorted[ordered_cols].to_numpy()
y_all = df_sorted["year"].astype(str) + " ‚Äî " + df_sorted["president"]

# Hovertext matrix for all speeches
hovertext_all = []
for r_i, r in df_sorted.iterrows():
    row_texts = []
    for c in ordered_cols:
        k = int(c.split('_')[-1])
        row_texts.append(
            f"<b>{int(r['year'])} ‚Äî {r['president']}</b><br>"
            f"<b>Topic {k}</b><br>"
            f"Top words: {', '.join(topic_words[k][:10])}<br>"
            f"Proportion: {r[c]:.3f}"
        )
    hovertext_all.append(row_texts)

fig_all = go.Figure(
    data=go.Heatmap(
        z=A,
        x=ordered_x,
        y=y_all,
        colorscale="Plasma",
        zmin=0.0, zmax=1.0,
        hoverinfo="text",
        text=hovertext_all
    )
)
fig_all.update_layout(
    title="All speeches ‚Äî topic mixture heatmap (topics ordered by prevalence)",
    xaxis_title="Topic",
    yaxis_title="Speech (year ‚Äî president)",
    height=max(450, 14*len(df_sorted)),
    margin=dict(l=120, r=20, t=60, b=80)
)
fig_all.show()



## 8) TF‚ÄìIDF vs LDA ‚Äî Compare & Contrast
| Aspect | TF‚ÄìIDF | LDA (Topics) |
|---|---|---|
| Unit | Terms per document | Topics (word dists); documents are mixtures |
| Great for | Keywording, distinctiveness, similarity | Thematic mapping across corpus |
| Limitations | No explicit themes | Needs K tuning; topics can blend/split |


## 9) üéØ Student-Driven Policy Exploration (‚âà60 minutes)

Work in pairs. Your mission: **choose a policy area**, **build a small text corpus**, and **experiment** with TF‚ÄìIDF and topic modeling to discover what language patterns define that space.

This is not a graded deliverable ‚Äî it‚Äôs a sandbox for exploration, pattern-finding, and discussion.

---

### üß≠ Part A ‚Äî Choose a Policy Area
Pick an issue you care about ‚Äî examples:

- Climate policy & sustainability  
- Immigration & border security  
- Health care & public health  
- Economic growth & inequality  
- Civil rights & social justice  
- Foreign policy & diplomacy  

Then brainstorm: *Whose language represents this issue?*  
(e.g., presidents, UN leaders, legislators, NGOs, media outlets).

---

### üìö Part B ‚Äî Build Your Corpus

You‚Äôll need at least **10‚Äì20 short to medium speeches or statements**.

**Option 1 ‚Äì Use existing open archives**
- U.S. presidential speeches: [American Presidency Project](https://www.presidency.ucsb.edu/speeches)
- UN General Assembly statements: [UN Digital Library](https://digitallibrary.un.org/)
- EU or UK parliament debates: [Hansard](https://hansard.parliament.uk/), [Europarl](https://www.europarl.europa.eu/)
- NGO or think-tank reports: World Bank, IMF, WHO, Brookings, RAND, etc.

**Option 2 ‚Äì Scrape or collect your own (advanced)**
- Use `requests` + `BeautifulSoup` or a library such as `newspaper3k` to extract text.
- Or copy/paste short excerpts into `.txt` files and upload them to Colab.

üìé *Hint:* keep your text clean ‚Äî remove headers, speaker names, and references.

See script below to get you started

---

### üß© Part C ‚Äî Explore Frequency, TF‚ÄìIDF, and Topics

1. **Frequency snapshot:**  
   Compute the top 15 most common words (`FreqDist`). Which ones are generic boilerplate (e.g., ‚Äúpeople,‚Äù ‚Äúgovernment‚Äù)?

2. **Distinctiveness check:**  
   Run **TF‚ÄìIDF** with `min_df=2` or `min_df=5`.  
   - Which words rise to the top?  
   - What do they reveal about your policy domain‚Äôs unique framing?

3. **Similarity sleuthing:**  
   Using cosine similarity on TF‚ÄìIDF vectors, find which two documents are most similar.  
   What links them ‚Äî era, country, tone?

4. **Topic discovery:**  
   Train an **LDA model** (try `K=5`, `K=8`, `K=12`).  
   - Label each topic in 2‚Äì3 words.  
   - Which `K` feels most interpretable?  
   - Do your topics align with known sub-issues (e.g., ‚Äúenergy transition,‚Äù ‚Äúhuman rights,‚Äù ‚Äútrade policy‚Äù)?

5. **Visualize:**  
   Create a PCA or heatmap of your documents.  
   - What clusters appear?  
   - Does time, geography, or institution explain them?

---

### üïµÔ∏è Part D ‚Äî Mini Scavenger Hunt Prompts

- **‚ÄúWord Detective‚Äù**: Which words define your corpus when using TF‚ÄìIDF vs raw frequency?  
- **‚ÄúSimilarity Sleuth‚Äù**: Which two documents look similar numerically but differ substantively?  
- **‚ÄúTopic Whisperer‚Äù**: Choose one topic from your LDA output. Find two speeches that heavily feature it (> 0.3). What do they share?  
- **‚ÄúEra Shift‚Äù**: Does any topic fade or grow over time? What might explain it?  
- **‚ÄúHeadline Writer‚Äù**: Summarize one document twice ‚Äî once using TF‚ÄìIDF terms, once using its dominant LDA topic. How do the headlines differ in tone?

---

### üß† Part E ‚Äî Policy Reflection (Discussion, not submission)

Compare what each method tells you:

| Method | Reveals | Best for |
|---------|----------|----------|
| **TF‚ÄìIDF** | Distinctive vocabulary per document | Comparing actors or countries |
| **LDA (Topic Modeling)** | Underlying shared themes | Tracking issue clusters and framing evolution |

> In your discussion:  
> - What language dominates your policy area?  
> - Whose framing or rhetoric stands out?  
> - How might these tools support evidence-based policy analysis?

---

‚úÖ **Outcome:** You should be able to *talk through* what you learned ‚Äî
not produce a written report. Your goal is pattern recognition, curiosity, and connecting computational text analysis to real policy discourse.


In [None]:
# =======================================================
# üß≠ STEP 1: Mount Google Drive
# =======================================================
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# =======================================================
# üóÇ STEP 2: Create a folder in Google Drive for text corpus
# =======================================================
import os

# Customize the folder name ‚Äî each student can use their initials or topic
folder_name = "policy_corpus"
drive_path = "/content/drive/MyDrive"
corpus_dir = os.path.join(drive_path, folder_name)

os.makedirs(corpus_dir, exist_ok=True)
print(f"‚úÖ Folder ready: {corpus_dir}")


In [None]:
# =======================================================
# üì∞ STEP 3: Scrape Articles with newspaper3k and Save as .txt
# =======================================================

!pip install newspaper3k lxml_html_clean --quiet
# !pip -q install newspaper3k lxml_html_clean

# Import after successful install
from newspaper import Article



import time, os, requests
from newspaper import Article, Config

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

cfg = Config()
cfg.browser_user_agent = HEADERS["User-Agent"]
cfg.request_timeout = 20
cfg.memoize_articles = False

def extract_with_newspaper(url: str) -> str:
    """Try Newspaper with real UA; if download() gets 403, use requests + set_html()."""
    art = Article(url, config=cfg)
    try:
        art.download()              # may 403
        art.parse()
        return art.text.strip()
    except Exception:
        # Fallback: fetch with requests using real headers, then feed raw HTML to Newspaper
        r = requests.get(url, headers=HEADERS, timeout=30)
        r.raise_for_status()        # will throw if not 200 (but you said it's 200)
        art = Article(url, config=cfg)
        art.set_html(r.text)
        art.parse()
        return art.text.strip()



import pathlib
SAVE_DIR = pathlib.Path("/content/drive/MyDrive/policy_corpus")
SAVE_DIR.mkdir(parents=True, exist_ok=True)

def safe_filename(title: str, i: int) -> str:
    base = "".join(c for c in title if c.isalnum() or c in (" ","_")).strip().replace(" ","_")
    if not base: base = f"article_{i}"
    return f"{i:02d}_{base[:60]}.txt"

def extract_article(url: str, i: int) -> str:
    # Try Newspaper (UA) ‚Üí Newspaper with requests HTML ‚Üí Trafilatura
    try:
        text = extract_with_newspaper(url)
        source = "newspaper3k"
    except Exception as e1:
        try:
            text = extract_with_trafilatura(url)
            source = "trafilatura"
        except Exception as e2:
            raise RuntimeError(f"Both extractors failed.\nNewspaper err: {e1}\nTrafilatura err: {e2}")
    return text, source

def save_article(urls):
    import datetime
    for i, url in enumerate(urls, start=1):
        try:
            text, source = extract_article(url, i)
            title_hint = url.split("/")[-2] if "/" in url else "article"
            fname = safe_filename(title_hint, i)
            fpath = SAVE_DIR / fname
            with open(fpath, "w", encoding="utf-8") as f:
                f.write(f"URL: {url}\n")
                f.write(f"SourceExtractor: {source}\n")
                f.write(f"SavedAtUTC: {datetime.datetime.utcnow().isoformat()}Z\n\n")
                f.write(text)
            print(f"‚úÖ Saved ({source}): {fname}")
            time.sleep(1.0)  # be polite
        except Exception as e:
            print(f"‚ö†Ô∏è Skipped {url}: {e}")

# EXAMPLE URLS (swap in your policy-area links)
urls = [
  "https://www.un.org/sg/en/content/sg/statements/2025-11-08/secretary-generals-message-the-20th-conference-of-youth-climate-change",
    "https://www.un.org/sg/en/content/sg/statements/2025-11-07/secretary-generals-remarks-the-belem-climate-summit-energy-transition-roundtable-delivered",
    "https://www.un.org/sg/en/content/sg/statements/2025-11-06/secretary-generals-remarks-the-launch-of-the-tropical-forest-forever-facility-delivered",
]
save_article(urls)
print("Folder:", SAVE_DIR)


In [None]:
# =======================================================
# üßæ STEP 4: Verify Saved Files
# =======================================================
import glob

files = sorted(glob.glob(os.path.join(corpus_dir, "*.txt")))
print(f"Found {len(files)} text files in Drive.")
for f in files:
    print("-", os.path.basename(f))



---

### Closing Thought
**FreqDist** shows what‚Äôs loudest. **TF‚ÄìIDF** shows what‚Äôs distinctive. **LDA** shows what‚Äôs thematic. Use all three to triangulate insights for public‚Äëpolicy questions.
