<a href="https://colab.research.google.com/github/swsewon3-ship-it/python-for-public-policy_2025-Fall/blob/main/Intro_Text_Analysis_TFIDF_LDA_Inaugurals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Intro to Text Analysis in Python: FreqDist ‚Üí TF‚ÄìIDF ‚Üí Topic Modeling (U.S. Inaugural Addresses)

**Course**: Intro to Text Analysis for Public Policy  
**Format**: Live coding (~2.5 hours) + 60‚Äëmin student-driven scavenger hunt  
**Dataset**: U.S. Presidential Inaugural Addresses (via NLTK)

### Learning Outcomes
- Load and lightly clean a real-world corpus.
- Contrast **raw frequency (FreqDist)** vs **TF‚ÄìIDF** to understand term salience.
- Use **topic modeling (LDA, scikit‚Äëlearn)** to uncover corpus‚Äëlevel themes.
- Compare and interpret outputs to make policy‚Äërelevant claims.


## üîç Comparing TF‚ÄìIDF vs Topic Modeling in Policy Contexts

| Policy Context | What TF‚ÄìIDF Reveals | What Topic Modeling Reveals | Example Insight |
|----------------|--------------------|-----------------------------|-----------------|
| üèõ **Legislative & Political Communication** | Distinctive vocabulary by legislator or party (e.g., what makes one member‚Äôs rhetoric unique) | Shared themes or issue clusters across speeches (e.g., ‚Äúhealthcare,‚Äù ‚Äúsecurity,‚Äù ‚Äúimmigration‚Äù) | TF‚ÄìIDF shows that one senator emphasizes ‚Äúopioids‚Äù while another uses ‚Äúcybersecurity‚Äù; LDA groups all health-related terms into a ‚Äúpublic health‚Äù topic. |
| üåê **Diplomatic & Multilateral Statements** | Country-specific framing of an issue (what each nation stresses) | Global discourse patterns and alliances (how nations group around themes) | TF‚ÄìIDF highlights Fiji‚Äôs use of ‚Äúloss and damage‚Äù vs. the U.S.‚Äôs ‚Äúinnovation‚Äù; LDA identifies a broader ‚Äúclimate adaptation‚Äù topic uniting small island states. |
| üïä **NGO & Think-Tank Reports** | Organization-specific keywords that signal focus or mandate | Latent themes that span organizations (e.g., ‚Äúeducation policy,‚Äù ‚Äúmacroeconomic reform‚Äù) | TF‚ÄìIDF shows UNICEF‚Äôs ‚Äúchild rights‚Äù language; LDA uncovers cross-agency topics like ‚Äúfinancing for development.‚Äù |
| üì∞ **Media Coverage of Global Policy** | Outlet-specific framing and language choices | Dominant topics in media discourse across sources or time | TF‚ÄìIDF shows Fox News emphasizes ‚Äúenergy independence,‚Äù The Guardian ‚Äúclimate justice‚Äù; LDA extracts topics like ‚Äúenergy transition,‚Äù ‚Äúpolicy negotiations.‚Äù |
| ‚öñÔ∏è **Comparative Policy Texts / Legislation** | Unique legal or regulatory phrasing in each country | Shared or evolving legal concepts across multiple texts | TF‚ÄìIDF finds Germany stresses ‚ÄúEnergiewende‚Äù; LDA surfaces a ‚Äúrenewable energy transition‚Äù topic appearing in multiple EU laws. |
| üí¨ **Public Consultation & Citizen Feedback** | Stakeholder-specific concerns or jargon (e.g., NGOs vs. corporations) | Major themes emerging from thousands of comments | TF‚ÄìIDF identifies NGOs‚Äô use of ‚Äúpollution control‚Äù vs. industry‚Äôs ‚Äúinnovation cost‚Äù; LDA clusters all responses into ‚Äúeconomic impact,‚Äù ‚Äúenvironmental justice,‚Äù etc. |
| üß≠ **Speeches & Strategic Messaging Over Time** | New or distinctive terms introduced in a given year or presidency | Long-term thematic evolution or cycles in national rhetoric | TF‚ÄìIDF shows ‚Äúpandemic‚Äù spikes in 2020; LDA reveals enduring topics like ‚Äúforeign policy,‚Äù ‚Äúdomestic economy,‚Äù ‚Äúnational security.‚Äù |

---

### üß† Summary

| Technique | Best For | Analytical Focus |
|------------|-----------|------------------|
| **TF‚ÄìIDF** | Comparing documents or actors | *‚ÄúWhat makes this text distinct?‚Äù* |
| **Topic Modeling (LDA)** | Discovering cross-document themes | *‚ÄúWhat themes recur across the corpus?‚Äù* |

> ‚úÖ Together, they bridge **micro-level distinctiveness** (TF‚ÄìIDF) and **macro-level patterns** (LDA) ‚Äî enabling richer analysis of language in policy and diplomacy.



## 1) Environment Setup (Colab‚Äëfriendly)
Run this once in Colab to install/upgrade packages and download NLTK data.


In [1]:

# In a fresh runtime (Runtime ‚Üí Restart runtime), run:
  #NumPy = statistical analysis tool, scipyÎèÑ ÌÜµÍ≥Ñ, scikit-learn ÏùÄ TF-idf ÏúÑÌï¥ Íº≠ ÌïÑÏöî

!pip -q install "numpy==2.0.2" "scipy==1.14.1" "scikit-learn>=1.4"
!pip install nltk==3.9.2

import numpy, scipy, sklearn
print("NumPy:", numpy.__version__)     # ‚Üí 2.0.2
print("SciPy:", scipy.__version__)     # ‚Üí 1.14.x
print("sklearn:", sklearn.__version__) # ‚â• 1.4


import nltk
nltk.download('inaugural')  #'inaugural'Í∞Ä corpus ÏûÑ
nltk.download('stopwords') #a, the Í∞ôÏùÄ Î∞òÎ≥µÏñ¥Íµ¨ ÌÅ¥Î¶∞Ìï¥Ï£ºÎäî Ìå®ÌÇ§ÏßÄ
nltk.download('punkt')  #sentenceÎ•º Îã®Ïñ¥ Ï≤≠ÌÅ¨Î°ú ÎßåÎìúÎäî Ìå®ÌÇ§ÏßÄ
# Some environments expect punkt_tab as well:
nltk.download('punkt_tab')

print("‚úÖ Setup complete.")


[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.8/60.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m40.8/40.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nltk==3.9.2
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.5/1.5 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.9.1
    Uninstalling nltk-3.9.1:
      Successfully uninstalled nltk-3.9.1
Successfully installed nltk-3.9.2
NumPy: 2.0.2
SciPy: 1.14.1
sklearn: 1.6.1


[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


‚úÖ Setup complete.



## 2) Imports
We use: `nltk` for data & preprocessing, `scikit-learn` for TF‚ÄìIDF and LDA, `matplotlib/pandas` for exploration.


In [2]:

import re
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt   #Ï∞®Ìä∏

from nltk.corpus import inaugural, stopwords
from nltk import word_tokenize, FreqDist  #frequency distribution

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer #TF-idf
from sklearn.metrics.pairwise import cosine_similarity  #Îã®Ïñ¥Í∞Ä ÏÇ¨Ïö©ÎêòÎäî Îß•ÎùΩÎèÑ ÌååÏïÖÌï¥Ï§å (Îã®Ïñ¥Í∞ÑÏùò Í¥ÄÍ≥Ñ)
from sklearn.decomposition import PCA, LatentDirichletAllocation


# Display all rows and columns (adjust numbers as needed)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Show full text in each cell (no truncation)
pd.set_option('display.max_colwidth', None)

# Expand the display width so wide tables don't wrap
pd.set_option('display.width', 0)

print("‚úÖ Pandas display options set for full view.")

print("‚úÖ Imports loaded.")


‚úÖ Pandas display options set for full view.
‚úÖ Imports loaded.



## 3) Load the U.S. Presidential Inaugural Addresses
We‚Äôll load speeches from NLTK‚Äôs `inaugural` corpus. Each document is a speech like `1789-Washington.txt`.


In [3]:

fileids = inaugural.fileids()   #ÌååÏùºÏù¥Î¶ÑÏùÑ Î≥¥Ïó¨Ï£ºÎùº
print(fileids)

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reaga

In [4]:
records = []
for fid in fileids:
    raw = inaugural.raw(fid) #The .raw() method returns the entire text of one file as a single string ‚Äî no tokenization, no cleaning, just raw text
    year, president = fid.replace('.txt', '').split('-')[0], fid.replace('.txt', '').split('-')[1] #extracts the year and president‚Äôs name from each file‚Äôs name in the NLTK inaugural corpus
    records.append({'fileid': fid, 'year': int(year), 'president': president, 'text': raw})
    #ÎßàÏßÄÎßâÏ§ÑÏùÄ Ïó¥ÏùÑ ÏÉùÏÑ±ÌïòÎäî Í≤ÉÏù¥ÎùºÍ≥† ÏÉùÍ∞ÅÌïòÎ©¥Îèº

df = pd.DataFrame(records).sort_values('year').reset_index(drop=True)
df.head(3)


Unnamed: 0,fileid,year,president,text
0,1789-Washington.txt,1789,Washington,"Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President ""to recommend to your consideration such measures as he shall judge necessary and expedient."" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. \n"
1,1793-Washington.txt,1793,Washington,"Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n \n"
2,1797-Adams.txt,1797,Adams,"When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign legislature and a total independence of its claims, men of reflection were less apprehensive of danger from the formidable power of fleets and armies they must determine to resist than from those contests and dissensions which would certainly arise concerning the forms of government to be instituted over the whole and over the parts of this extensive country. Relying, however, on the purity of their intentions, the justice of their cause, and the integrity and intelligence of the people, under an overruling Providence which had so signally protected this country from the first, the representatives of this nation, then consisting of little more than half its present number, not only broke to pieces the chains which were forging and the rod of iron that was lifted up, but frankly cut asunder the ties which had bound them, and launched into an ocean of uncertainty.\n\nThe zeal and ardor of the people during the Revolutionary war, supplying the place of government, commanded a degree of order sufficient at least for the temporary preservation of society. The Confederation which was early felt to be necessary was prepared from the models of the Batavian and Helvetic confederacies, the only examples which remain with any detail and precision in history, and certainly the only ones which the people at large had ever considered. But reflecting on the striking difference in so many particulars between this country and those where a courier may go from the seat of government to the frontier in a single day, it was then certainly foreseen by some who assisted in Congress at the formation of it that it could not be durable.\n\nNegligence of its regulations, inattention to its recommendations, if not disobedience to its authority, not only in individuals but in States, soon appeared with their melancholy consequences -- universal languor, jealousies and rivalries of States, decline of navigation and commerce, discouragement of necessary manufactures, universal fall in the value of lands and their produce, contempt of public and private faith, loss of consideration and credit with foreign nations, and at length in discontents, animosities, combinations, partial conventions, and insurrection, threatening some great national calamity.\n\nIn this dangerous crisis the people of America were not abandoned by their usual good sense, presence of mind, resolution, or integrity. Measures were pursued to concert a plan to form a more perfect union, establish justice, insure domestic tranquillity, provide for the common defense, promote the general welfare, and secure the blessings of liberty. The public disquisitions, discussions, and deliberations issued in the present happy Constitution of Government.\n\nEmployed in the service of my country abroad during the whole course of these transactions, I first saw the Constitution of the United States in a foreign country. Irritated by no literary altercation, animated by no public debate, heated by no party animosity, I read it with great satisfaction, as the result of good heads prompted by good hearts, as an experiment better adapted to the genius, character, situation, and relations of this nation and country than any which had ever been proposed or suggested. In its general principles and great outlines it was conformable to such a system of government as I had ever most esteemed, and in some States, my own native State in particular, had contributed to establish. Claiming a right of suffrage, in common with my fellow-citizens, in the adoption or rejection of a constitution which was to rule me and my posterity, as well as them and theirs, I did not hesitate to express my approbation of it on all occasions, in public and in private. It was not then, nor has been since, any objection to it in my mind that the Executive and Senate were not more permanent. Nor have I ever entertained a thought of promoting any alteration in it but such as the people themselves, in the course of their experience, should see and feel to be necessary or expedient, and by their representatives in Congress and the State legislatures, according to the Constitution itself, adopt and ordain.\n\nReturning to the bosom of my country after a painful separation from it for ten years, I had the honor to be elected to a station under the new order of things, and I have repeatedly laid myself under the most serious obligations to support the Constitution. The operation of it has equaled the most sanguine expectations of its friends, and from an habitual attention to it, satisfaction in its administration, and delight in its effects upon the peace, order, prosperity, and happiness of the nation I have acquired an habitual attachment to it and veneration for it.\n\nWhat other form of government, indeed, can so well deserve our esteem and love?\n\nThere may be little solidity in an ancient idea that congregations of men into cities and nations are the most pleasing objects in the sight of superior intelligences, but this is very certain, that to a benevolent human mind there can be no spectacle presented by any nation more pleasing, more noble, majestic, or august, than an assembly like that which has so often been seen in this and the other Chamber of Congress, of a Government in which the Executive authority, as well as that of all the branches of the Legislature, are exercised by citizens selected at regular periods by their neighbors to make and execute laws for the general good. Can anything essential, anything more than mere ornament and decoration, be added to this by robes and diamonds? Can authority be more amiable and respectable when it descends from accidents or institutions established in remote antiquity than when it springs fresh from the hearts and judgments of an honest and enlightened people? For it is the people only that are represented. It is their power and majesty that is reflected, and only for their good, in every legitimate government, under whatever form it may appear. The existence of such a government as ours for any length of time is a full proof of a general dissemination of knowledge and virtue throughout the whole body of the people. And what object or consideration more pleasing than this can be presented to the human mind? If national pride is ever justifiable or excusable it is when it springs, not from power or riches, grandeur or glory, but from conviction of national innocence, information, and benevolence.\n\nIn the midst of these pleasing ideas we should be unfaithful to ourselves if we should ever lose sight of the danger to our liberties if anything partial or extraneous should infect the purity of our free, fair, virtuous, and independent elections. If an election is to be determined by a majority of a single vote, and that can be procured by a party through artifice or corruption, the Government may be the choice of a party for its own ends, not of the nation for the national good. If that solitary suffrage can be obtained by foreign nations by flattery or menaces, by fraud or violence, by terror, intrigue, or venality, the Government may not be the choice of the American people, but of foreign nations. It may be foreign nations who govern us, and not we, the people, who govern ourselves; and candid men will acknowledge that in such cases choice would have little advantage to boast of over lot or chance.\n\nSuch is the amiable and interesting system of government (and such are some of the abuses to which it may be exposed) which the people of America have exhibited to the admiration and anxiety of the wise and virtuous of all nations for eight years under the administration of a citizen who, by a long course of great actions, regulated by prudence, justice, temperance, and fortitude, conducting a people inspired with the same virtues and animated with the same ardent patriotism and love of liberty to independence and peace, to increasing wealth and unexampled prosperity, has merited the gratitude of his fellow-citizens, commanded the highest praises of foreign nations, and secured immortal glory with posterity.\n\nIn that retirement which is his voluntary choice may he long live to enjoy the delicious recollection of his services, the gratitude of mankind, the happy fruits of them to himself and the world, which are daily increasing, and that splendid prospect of the future fortunes of this country which is opening from year to year. His name may be still a rampart, and the knowledge that he lives a bulwark, against all open or secret enemies of his country's peace. This example has been recommended to the imitation of his successors by both Houses of Congress and by the voice of the legislatures and the people throughout the nation.\n\nOn this subject it might become me better to be silent or to speak with diffidence; but as something may be expected, the occasion, I hope, will be admitted as an apology if I venture to say that if a preference, upon principle, of a free republican government, formed upon long and serious reflection, after a diligent and impartial inquiry after truth; if an attachment to the Constitution of the United States, and a conscientious determination to support it until it shall be altered by the judgments and wishes of the people, expressed in the mode prescribed in it; if a respectful attention to the constitutions of the individual States and a constant caution and delicacy toward the State governments; if an equal and impartial regard to the rights, interest, honor, and happiness of all the States in the Union, without preference or regard to a northern or southern, an eastern or western, position, their various political opinions on unessential points or their personal attachments; if a love of virtuous men of all parties and denominations; if a love of science and letters and a wish to patronize every rational effort to encourage schools, colleges, universities, academies, and every institution for propagating knowledge, virtue, and religion among all classes of the people, not only for their benign influence on the happiness of life in all its stages and classes, and of society in all its forms, but as the only means of preserving our Constitution from its natural enemies, the spirit of sophistry, the spirit of party, the spirit of intrigue, the profligacy of corruption, and the pestilence of foreign influence, which is the angel of destruction to elective governments; if a love of equal laws, of justice, and humanity in the interior administration; if an inclination to improve agriculture, commerce, and manufacturers for necessity, convenience, and defense; if a spirit of equity and humanity toward the aboriginal nations of America, and a disposition to meliorate their condition by inclining them to be more friendly to us, and our citizens to be more friendly to them; if an inflexible determination to maintain peace and inviolable faith with all nations, and that system of neutrality and impartiality among the belligerent powers of Europe which has been adopted by this Government and so solemnly sanctioned by both Houses of Congress and applauded by the legislatures of the States and the public opinion, until it shall be otherwise ordained by Congress; if a personal esteem for the French nation, formed in a residence of seven years chiefly among them, and a sincere desire to preserve the friendship which has been so much for the honor and interest of both nations; if, while the conscious honor and integrity of the people of America and the internal sentiment of their own power and energies must be preserved, an earnest endeavor to investigate every just cause and remove every colorable pretense of complaint; if an intention to pursue by amicable negotiation a reparation for the injuries that have been committed on the commerce of our fellow-citizens by whatever nation, and if success can not be obtained, to lay the facts before the Legislature, that they may consider what further measures the honor and interest of the Government and its constituents demand; if a resolution to do justice as far as may depend upon me, at all times and to all nations, and maintain peace, friendship, and benevolence with all the world; if an unshaken confidence in the honor, spirit, and resources of the American people, on which I have so often hazarded my all and never been deceived; if elevated ideas of the high destinies of this country and of my own duties toward it, founded on a knowledge of the moral principles and intellectual improvements of the people deeply engraven on my mind in early life, and not obscured but exalted by experience and age; and, with humble reverence, I feel it to be my duty to add, if a veneration for the religion of a people who profess and call themselves Christians, and a fixed resolution to consider a decent respect for Christianity among the best recommendations for the public service, can enable me in any degree to comply with your wishes, it shall be my strenuous endeavor that this sagacious injunction of the two Houses shall not be without effect.\n\nWith this great example before me, with the sense and spirit, the faith and honor, the duty and interest, of the same American people pledged to support the Constitution of the United States, I entertain no doubt of its continuance in all its energy, and my mind is prepared without hesitation to lay myself under the most solemn obligations to support it to the utmost of my power.\n\nAnd may that Being who is supreme over all, the Patron of Order, the Fountain of Justice, and the Protector in all ages of the world of virtuous liberty, continue His blessing upon this nation and its Government and give it all possible success and duration consistent with the ends of His providence.\n"



## 4) Light Preprocessing
Simple, transparent steps:
- lowercase ‚Üí tokenize ‚Üí keep alphabetic tokens (len ‚â• 3) ‚Üí remove stopwords  
We also add a small custom stoplist of political boilerplate words.


In [13]:

EN_STOP = set(stopwords.words('english'))
print("English stopwords:", EN_STOP)
CUSTOM_STOP = {
    # corpus artifacts / very generic political words (tweak in class)
    'applause', 'cheers', 'government', 'nation', 'people', 'states', 'united', 'american', 'america'
}
STOPWORDS = EN_STOP.union(CUSTOM_STOP)

def simple_clean_tokens(text):
    """
    1) Lowercase
    2) Tokenize
    3) Keep alphabetic tokens of length >= 3
    4) Remove stopwords
    """
    text = text.lower()
    tokens = word_tokenize(text)
    clean = [tok for tok in tokens if tok.isalpha() and len(tok) >= 3 and tok not in STOPWORDS]
    #ÌÜ†ÌÅ∞ÏùÄ tokensÏóê ÏûáÎäî Ï¶â Ïö∞Î¶¨Ïùò ÌÖçÏä§Ìä∏ÌååÏùºÏóê ÏûàÎäîÍ≤ÉÏù¥Ïñ¥ÏïºÌïòÍ≥†, ÏïåÌååÎ≤≥Ïù¥Ïñ¥ÏïºÌïòÍ≥†, 3ÏïåÌååÎ≤≥Ïù¥ÏÉÅÏù¥Ïñ¥ÏïºÌïòÍ≥†, Ïä§ÌÉëÏõåÎìúÍ∞Ä ÏïÑÎãàÏñ¥ÏñÄÎã§
    return clean
    #tokenize = Î∂ÑÏÑùÏùò Îã®ÏúÑÎ•º Í≤∞Ï†ïÌïòÎäî Í≤É, word level Ïù¥ÎÇò sentence leveÎ°ú Ìï† Ïàò ÏûàÏùå

df['tokens'] = df['text'].apply(simple_clean_tokens)
df['text_clean'] = df['tokens'].apply(lambda toks: " ".join(toks))
print("Sample tokens:", df.loc[0, 'tokens'][:25])
df[['fileid','year','president','text','text_clean']].head(3)



English stopwords: {'just', 'does', "we've", 'most', 'them', 'which', 'be', "needn't", "that'll", "mustn't", 'own', 'hadn', 'same', 'ourselves', 'any', 'been', 'how', 'doesn', 'do', "won't", "it'll", 'up', 'aren', 'his', 'down', 'having', "she's", 'did', 'there', 'on', 'very', 'out', 'yourselves', "it's", 'wasn', 'than', 'yourself', 'd', 'is', 'themselves', "isn't", 'by', 'more', "they'll", "she'd", 'above', 'to', 'too', "weren't", 'few', 'couldn', 'your', "they'd", 'other', 'while', 'of', 'i', "should've", "he'd", 'or', "you've", 'myself', 'herself', 'him', 'if', 'over', 've', "i'll", 'the', 'had', 'will', 'being', 'those', "we'll", 'these', 'against', 'what', 'a', 'was', 'won', 'in', "we're", 'nor', 'hers', 'then', 'we', "you're", 'until', 'll', 'at', 'an', 'our', "didn't", 'were', "we'd", "he'll", 'with', 'haven', 'didn', 'for', 'and', 'they', "hadn't", 'but', "shan't", 'it', 'as', 'mustn', 'during', 'here', 'can', 'my', 'after', "mightn't", 'before', 'from', "it'd", "she'll", 'ma',

Unnamed: 0,fileid,year,president,text,text_clean
0,1789-Washington.txt,1789,Washington,"Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated.\n\nSuch being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.\n\nBy the article establishing the executive department it is made the duty of the President ""to recommend to your consideration such measures as he shall judge necessary and expedient."" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.\n\nBesides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.\n\nTo the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.\n\nHaving thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. \n",senate house representatives among vicissitudes incident life event could filled greater anxieties notification transmitted order received day present month one hand summoned country whose voice never hear veneration love retreat chosen fondest predilection flattering hopes immutable decision asylum declining years retreat rendered every day necessary well dear addition habit inclination frequent interruptions health gradual waste committed time hand magnitude difficulty trust voice country called sufficient awaken wisest experienced citizens distrustful scrutiny qualifications could overwhelm despondence one inheriting inferior endowments nature unpracticed duties civil administration ought peculiarly conscious deficiencies conflict emotions dare aver faithful study collect duty appreciation every circumstance might affected dare hope executing task much swayed grateful remembrance former instances affectionate sensibility transcendent proof confidence fellow citizens thence little consulted incapacity well disinclination weighty untried cares error palliated motives mislead consequences judged country share partiality originated impressions obedience public summons repaired present station would peculiarly improper omit first official act fervent supplications almighty rules universe presides councils nations whose providential aids supply every human defect benediction may consecrate liberties happiness instituted essential purposes may enable every instrument employed administration execute success functions allotted charge tendering homage great author every public private good assure expresses sentiments less fellow citizens large less either bound acknowledge adore invisible hand conducts affairs men every step advanced character independent seems distinguished token providential agency important revolution accomplished system tranquil deliberations voluntary consent many distinct communities event resulted compared means governments established without return pious gratitude along humble anticipation future blessings past seem presage reflections arising present crisis forced strongly mind suppressed join trust thinking none influence proceedings new free auspiciously commence article establishing executive department made duty president recommend consideration measures shall judge necessary expedient circumstances meet acquit entering subject refer great constitutional charter assembled defining powers designates objects attention given consistent circumstances far congenial feelings actuate substitute place recommendation particular measures tribute due talents rectitude patriotism adorn characters selected devise adopt honorable qualifications behold surest pledges one side local prejudices attachments separate views party animosities misdirect comprehensive equal eye ought watch great assemblage communities interests another foundation national policy laid pure immutable principles private morality preeminence free exemplified attributes win affections citizens command respect world dwell prospect every satisfaction ardent love country inspire since truth thoroughly established exists economy course nature indissoluble union virtue happiness duty advantage genuine maxims honest magnanimous policy solid rewards public prosperity felicity since ought less persuaded propitious smiles heaven never expected disregards eternal rules order right heaven ordained since preservation sacred fire liberty destiny republican model justly considered perhaps deeply finally staked experiment entrusted hands besides ordinary objects submitted care remain judgment decide far exercise occasional power delegated fifth article constitution rendered expedient present juncture nature objections urged system degree inquietude given birth instead undertaking particular recommendations subject could guided lights derived official opportunities shall give way entire confidence discernment pursuit public good assure whilst carefully avoid every alteration might endanger benefits effective ought await future lessons experience reverence characteristic rights freemen regard public harmony sufficiently influence deliberations question far former impregnably fortified latter safely advantageously promoted foregoing observations one add properly addressed house representatives concerns therefore brief possible first honored call service country eve arduous struggle liberties light contemplated duty required renounce every pecuniary compensation resolution instance departed still impressions produced must decline inapplicable share personal emoluments may indispensably included permanent provision executive department must accordingly pray pecuniary estimates station placed may continuance limited actual expenditures public good may thought require thus imparted sentiments awakened occasion brings together shall take present leave without resorting benign parent human race humble supplication since pleased favor opportunities deliberating perfect tranquillity dispositions deciding unparalleled unanimity form security union advancement happiness divine blessing may equally conspicuous enlarged views temperate consultations wise measures success must depend
1,1793-Washington.txt,1793,Washington,"Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n \n",fellow citizens called upon voice country execute functions chief magistrate occasion proper shall arrive shall endeavor express high sense entertain distinguished honor confidence reposed previous execution official act president constitution requires oath office oath take presence shall found administration instance violated willingly knowingly injunctions thereof may besides incurring constitutional punishment subject upbraidings witnesses present solemn ceremony
2,1797-Adams.txt,1797,Adams,"When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign legislature and a total independence of its claims, men of reflection were less apprehensive of danger from the formidable power of fleets and armies they must determine to resist than from those contests and dissensions which would certainly arise concerning the forms of government to be instituted over the whole and over the parts of this extensive country. Relying, however, on the purity of their intentions, the justice of their cause, and the integrity and intelligence of the people, under an overruling Providence which had so signally protected this country from the first, the representatives of this nation, then consisting of little more than half its present number, not only broke to pieces the chains which were forging and the rod of iron that was lifted up, but frankly cut asunder the ties which had bound them, and launched into an ocean of uncertainty.\n\nThe zeal and ardor of the people during the Revolutionary war, supplying the place of government, commanded a degree of order sufficient at least for the temporary preservation of society. The Confederation which was early felt to be necessary was prepared from the models of the Batavian and Helvetic confederacies, the only examples which remain with any detail and precision in history, and certainly the only ones which the people at large had ever considered. But reflecting on the striking difference in so many particulars between this country and those where a courier may go from the seat of government to the frontier in a single day, it was then certainly foreseen by some who assisted in Congress at the formation of it that it could not be durable.\n\nNegligence of its regulations, inattention to its recommendations, if not disobedience to its authority, not only in individuals but in States, soon appeared with their melancholy consequences -- universal languor, jealousies and rivalries of States, decline of navigation and commerce, discouragement of necessary manufactures, universal fall in the value of lands and their produce, contempt of public and private faith, loss of consideration and credit with foreign nations, and at length in discontents, animosities, combinations, partial conventions, and insurrection, threatening some great national calamity.\n\nIn this dangerous crisis the people of America were not abandoned by their usual good sense, presence of mind, resolution, or integrity. Measures were pursued to concert a plan to form a more perfect union, establish justice, insure domestic tranquillity, provide for the common defense, promote the general welfare, and secure the blessings of liberty. The public disquisitions, discussions, and deliberations issued in the present happy Constitution of Government.\n\nEmployed in the service of my country abroad during the whole course of these transactions, I first saw the Constitution of the United States in a foreign country. Irritated by no literary altercation, animated by no public debate, heated by no party animosity, I read it with great satisfaction, as the result of good heads prompted by good hearts, as an experiment better adapted to the genius, character, situation, and relations of this nation and country than any which had ever been proposed or suggested. In its general principles and great outlines it was conformable to such a system of government as I had ever most esteemed, and in some States, my own native State in particular, had contributed to establish. Claiming a right of suffrage, in common with my fellow-citizens, in the adoption or rejection of a constitution which was to rule me and my posterity, as well as them and theirs, I did not hesitate to express my approbation of it on all occasions, in public and in private. It was not then, nor has been since, any objection to it in my mind that the Executive and Senate were not more permanent. Nor have I ever entertained a thought of promoting any alteration in it but such as the people themselves, in the course of their experience, should see and feel to be necessary or expedient, and by their representatives in Congress and the State legislatures, according to the Constitution itself, adopt and ordain.\n\nReturning to the bosom of my country after a painful separation from it for ten years, I had the honor to be elected to a station under the new order of things, and I have repeatedly laid myself under the most serious obligations to support the Constitution. The operation of it has equaled the most sanguine expectations of its friends, and from an habitual attention to it, satisfaction in its administration, and delight in its effects upon the peace, order, prosperity, and happiness of the nation I have acquired an habitual attachment to it and veneration for it.\n\nWhat other form of government, indeed, can so well deserve our esteem and love?\n\nThere may be little solidity in an ancient idea that congregations of men into cities and nations are the most pleasing objects in the sight of superior intelligences, but this is very certain, that to a benevolent human mind there can be no spectacle presented by any nation more pleasing, more noble, majestic, or august, than an assembly like that which has so often been seen in this and the other Chamber of Congress, of a Government in which the Executive authority, as well as that of all the branches of the Legislature, are exercised by citizens selected at regular periods by their neighbors to make and execute laws for the general good. Can anything essential, anything more than mere ornament and decoration, be added to this by robes and diamonds? Can authority be more amiable and respectable when it descends from accidents or institutions established in remote antiquity than when it springs fresh from the hearts and judgments of an honest and enlightened people? For it is the people only that are represented. It is their power and majesty that is reflected, and only for their good, in every legitimate government, under whatever form it may appear. The existence of such a government as ours for any length of time is a full proof of a general dissemination of knowledge and virtue throughout the whole body of the people. And what object or consideration more pleasing than this can be presented to the human mind? If national pride is ever justifiable or excusable it is when it springs, not from power or riches, grandeur or glory, but from conviction of national innocence, information, and benevolence.\n\nIn the midst of these pleasing ideas we should be unfaithful to ourselves if we should ever lose sight of the danger to our liberties if anything partial or extraneous should infect the purity of our free, fair, virtuous, and independent elections. If an election is to be determined by a majority of a single vote, and that can be procured by a party through artifice or corruption, the Government may be the choice of a party for its own ends, not of the nation for the national good. If that solitary suffrage can be obtained by foreign nations by flattery or menaces, by fraud or violence, by terror, intrigue, or venality, the Government may not be the choice of the American people, but of foreign nations. It may be foreign nations who govern us, and not we, the people, who govern ourselves; and candid men will acknowledge that in such cases choice would have little advantage to boast of over lot or chance.\n\nSuch is the amiable and interesting system of government (and such are some of the abuses to which it may be exposed) which the people of America have exhibited to the admiration and anxiety of the wise and virtuous of all nations for eight years under the administration of a citizen who, by a long course of great actions, regulated by prudence, justice, temperance, and fortitude, conducting a people inspired with the same virtues and animated with the same ardent patriotism and love of liberty to independence and peace, to increasing wealth and unexampled prosperity, has merited the gratitude of his fellow-citizens, commanded the highest praises of foreign nations, and secured immortal glory with posterity.\n\nIn that retirement which is his voluntary choice may he long live to enjoy the delicious recollection of his services, the gratitude of mankind, the happy fruits of them to himself and the world, which are daily increasing, and that splendid prospect of the future fortunes of this country which is opening from year to year. His name may be still a rampart, and the knowledge that he lives a bulwark, against all open or secret enemies of his country's peace. This example has been recommended to the imitation of his successors by both Houses of Congress and by the voice of the legislatures and the people throughout the nation.\n\nOn this subject it might become me better to be silent or to speak with diffidence; but as something may be expected, the occasion, I hope, will be admitted as an apology if I venture to say that if a preference, upon principle, of a free republican government, formed upon long and serious reflection, after a diligent and impartial inquiry after truth; if an attachment to the Constitution of the United States, and a conscientious determination to support it until it shall be altered by the judgments and wishes of the people, expressed in the mode prescribed in it; if a respectful attention to the constitutions of the individual States and a constant caution and delicacy toward the State governments; if an equal and impartial regard to the rights, interest, honor, and happiness of all the States in the Union, without preference or regard to a northern or southern, an eastern or western, position, their various political opinions on unessential points or their personal attachments; if a love of virtuous men of all parties and denominations; if a love of science and letters and a wish to patronize every rational effort to encourage schools, colleges, universities, academies, and every institution for propagating knowledge, virtue, and religion among all classes of the people, not only for their benign influence on the happiness of life in all its stages and classes, and of society in all its forms, but as the only means of preserving our Constitution from its natural enemies, the spirit of sophistry, the spirit of party, the spirit of intrigue, the profligacy of corruption, and the pestilence of foreign influence, which is the angel of destruction to elective governments; if a love of equal laws, of justice, and humanity in the interior administration; if an inclination to improve agriculture, commerce, and manufacturers for necessity, convenience, and defense; if a spirit of equity and humanity toward the aboriginal nations of America, and a disposition to meliorate their condition by inclining them to be more friendly to us, and our citizens to be more friendly to them; if an inflexible determination to maintain peace and inviolable faith with all nations, and that system of neutrality and impartiality among the belligerent powers of Europe which has been adopted by this Government and so solemnly sanctioned by both Houses of Congress and applauded by the legislatures of the States and the public opinion, until it shall be otherwise ordained by Congress; if a personal esteem for the French nation, formed in a residence of seven years chiefly among them, and a sincere desire to preserve the friendship which has been so much for the honor and interest of both nations; if, while the conscious honor and integrity of the people of America and the internal sentiment of their own power and energies must be preserved, an earnest endeavor to investigate every just cause and remove every colorable pretense of complaint; if an intention to pursue by amicable negotiation a reparation for the injuries that have been committed on the commerce of our fellow-citizens by whatever nation, and if success can not be obtained, to lay the facts before the Legislature, that they may consider what further measures the honor and interest of the Government and its constituents demand; if a resolution to do justice as far as may depend upon me, at all times and to all nations, and maintain peace, friendship, and benevolence with all the world; if an unshaken confidence in the honor, spirit, and resources of the American people, on which I have so often hazarded my all and never been deceived; if elevated ideas of the high destinies of this country and of my own duties toward it, founded on a knowledge of the moral principles and intellectual improvements of the people deeply engraven on my mind in early life, and not obscured but exalted by experience and age; and, with humble reverence, I feel it to be my duty to add, if a veneration for the religion of a people who profess and call themselves Christians, and a fixed resolution to consider a decent respect for Christianity among the best recommendations for the public service, can enable me in any degree to comply with your wishes, it shall be my strenuous endeavor that this sagacious injunction of the two Houses shall not be without effect.\n\nWith this great example before me, with the sense and spirit, the faith and honor, the duty and interest, of the same American people pledged to support the Constitution of the United States, I entertain no doubt of its continuance in all its energy, and my mind is prepared without hesitation to lay myself under the most solemn obligations to support it to the utmost of my power.\n\nAnd may that Being who is supreme over all, the Patron of Order, the Fountain of Justice, and the Protector in all ages of the world of virtuous liberty, continue His blessing upon this nation and its Government and give it all possible success and duration consistent with the ends of His providence.\n",first perceived early times middle course remained unlimited submission foreign legislature total independence claims men reflection less apprehensive danger formidable power fleets armies must determine resist contests dissensions would certainly arise concerning forms instituted whole parts extensive country relying however purity intentions justice cause integrity intelligence overruling providence signally protected country first representatives consisting little half present number broke pieces chains forging rod iron lifted frankly cut asunder ties bound launched ocean uncertainty zeal ardor revolutionary war supplying place commanded degree order sufficient least temporary preservation society confederation early felt necessary prepared models batavian helvetic confederacies examples remain detail precision history certainly ones large ever considered reflecting striking difference many particulars country courier may seat frontier single day certainly foreseen assisted congress formation could durable negligence regulations inattention recommendations disobedience authority individuals soon appeared melancholy consequences universal languor jealousies rivalries decline navigation commerce discouragement necessary manufactures universal fall value lands produce contempt public private faith loss consideration credit foreign nations length discontents animosities combinations partial conventions insurrection threatening great national calamity dangerous crisis abandoned usual good sense presence mind resolution integrity measures pursued concert plan form perfect union establish justice insure domestic tranquillity provide common defense promote general welfare secure blessings liberty public disquisitions discussions deliberations issued present happy constitution employed service country abroad whole course transactions first saw constitution foreign country irritated literary altercation animated public debate heated party animosity read great satisfaction result good heads prompted good hearts experiment better adapted genius character situation relations country ever proposed suggested general principles great outlines conformable system ever esteemed native state particular contributed establish claiming right suffrage common adoption rejection constitution rule posterity well hesitate express approbation occasions public private since objection mind executive senate permanent ever entertained thought promoting alteration course experience see feel necessary expedient representatives congress state legislatures according constitution adopt ordain returning bosom country painful separation ten years honor elected station new order things repeatedly laid serious obligations support constitution operation equaled sanguine expectations friends habitual attention satisfaction administration delight effects upon peace order prosperity happiness acquired habitual attachment veneration form indeed well deserve esteem love may little solidity ancient idea congregations men cities nations pleasing objects sight superior intelligences certain benevolent human mind spectacle presented pleasing noble majestic august assembly like often seen chamber congress executive authority well branches legislature exercised citizens selected regular periods neighbors make execute laws general good anything essential anything mere ornament decoration added robes diamonds authority amiable respectable descends accidents institutions established remote antiquity springs fresh hearts judgments honest enlightened represented power majesty reflected good every legitimate whatever form may appear existence length time full proof general dissemination knowledge virtue throughout whole body object consideration pleasing presented human mind national pride ever justifiable excusable springs power riches grandeur glory conviction national innocence information benevolence midst pleasing ideas unfaithful ever lose sight danger liberties anything partial extraneous infect purity free fair virtuous independent elections election determined majority single vote procured party artifice corruption may choice party ends national good solitary suffrage obtained foreign nations flattery menaces fraud violence terror intrigue venality may choice foreign nations may foreign nations govern govern candid men acknowledge cases choice would little advantage boast lot chance amiable interesting system abuses may exposed exhibited admiration anxiety wise virtuous nations eight years administration citizen long course great actions regulated prudence justice temperance fortitude conducting inspired virtues animated ardent patriotism love liberty independence peace increasing wealth unexampled prosperity merited gratitude commanded highest praises foreign nations secured immortal glory posterity retirement voluntary choice may long live enjoy delicious recollection services gratitude mankind happy fruits world daily increasing splendid prospect future fortunes country opening year year name may still rampart knowledge lives bulwark open secret enemies country peace example recommended imitation successors houses congress voice legislatures throughout subject might become better silent speak diffidence something may expected occasion hope admitted apology venture say preference upon principle free republican formed upon long serious reflection diligent impartial inquiry truth attachment constitution conscientious determination support shall altered judgments wishes expressed mode prescribed respectful attention constitutions individual constant caution delicacy toward state governments equal impartial regard rights interest honor happiness union without preference regard northern southern eastern western position various political opinions unessential points personal attachments love virtuous men parties denominations love science letters wish patronize every rational effort encourage schools colleges universities academies every institution propagating knowledge virtue religion among classes benign influence happiness life stages classes society forms means preserving constitution natural enemies spirit sophistry spirit party spirit intrigue profligacy corruption pestilence foreign influence angel destruction elective governments love equal laws justice humanity interior administration inclination improve agriculture commerce manufacturers necessity convenience defense spirit equity humanity toward aboriginal nations disposition meliorate condition inclining friendly citizens friendly inflexible determination maintain peace inviolable faith nations system neutrality impartiality among belligerent powers europe adopted solemnly sanctioned houses congress applauded legislatures public opinion shall otherwise ordained congress personal esteem french formed residence seven years chiefly among sincere desire preserve friendship much honor interest nations conscious honor integrity internal sentiment power energies must preserved earnest endeavor investigate every cause remove every colorable pretense complaint intention pursue amicable negotiation reparation injuries committed commerce whatever success obtained lay facts legislature may consider measures honor interest constituents demand resolution justice far may depend upon times nations maintain peace friendship benevolence world unshaken confidence honor spirit resources often hazarded never deceived elevated ideas high destinies country duties toward founded knowledge moral principles intellectual improvements deeply engraven mind early life obscured exalted experience age humble reverence feel duty add veneration religion profess call christians fixed resolution consider decent respect christianity among best recommendations public service enable degree comply wishes shall strenuous endeavor sagacious injunction two houses shall without effect great example sense spirit faith honor duty interest pledged support constitution entertain doubt continuance energy mind prepared without hesitation lay solemn obligations support utmost power may supreme patron order fountain justice protector ages world virtuous liberty continue blessing upon give possible success duration consistent ends providence



## 5) Quick Exploration
A glance at token counts and frequent words gives intuition before modeling.


In [None]:

# Document lengths
df['n_tokens'] = df['tokens'].apply(len)
ax = df.plot(x='year', y='n_tokens', kind='bar', figsize=(12,4), legend=False)
ax.set_ylabel("Tokens per speech (after cleaning)")
ax.set_xlabel("Index (chronological order)")
ax.set_title("Document lengths")
plt.show()

# Global top terms (sanity check)
all_terms = [t for toks in df['tokens'] for t in toks]
top20 = Counter(all_terms).most_common(20)
pd.DataFrame(top20, columns=['term','count'])



## 5.5) Keyword Frequency (NLTK `FreqDist`) ‚Üí Why TF‚ÄìIDF?
`FreqDist` counts words across the **entire corpus**. High counts may reflect words that are common everywhere‚Äînot necessarily distinctive.

**Idea:** Use FreqDist to see the *loudest* words, then use TF‚ÄìIDF to see the *most distinctive per document*.


In [None]:

all_tokens = [t for toks in df['tokens'] for t in toks]
fdist = FreqDist(all_tokens)

print("Top 20 most frequent words across all speeches:\n")
for word, freq in fdist.most_common(20):
    print(f"{word:15s} {freq}")

# Visualize (optional)
plt.figure(figsize=(10,4))
fdist.plot(20, cumulative=False)
plt.title("Most Frequent Words in Inaugural Speeches (Cleaned)")
plt.show()



## 6) TF‚ÄìIDF with scikit‚Äëlearn
**TF‚ÄìIDF** = term frequency √ó inverse document frequency  
- Highlights terms that are frequent **and** specific to a document.  
- Downweights terms that appear in many documents.


In [None]:

# convert a collection of text documents (your speeches) into a matrix where: Each row = one document (a speech);
# Each column = one term (a word);
# Each cell value = TF‚ÄìIDF weight of that term in that document
# min_df=2 ‚Üí ignore words that appear in fewer than 2 documents
tfidf = TfidfVectorizer(min_df=2)

# Feeds your cleaned text (from the text_clean column) into the vectorizer.
# Two steps happen in one command:
# .fit() ‚Äî learns the vocabulary and IDF (Inverse Document Frequency) weights.
# .transform() ‚Äî applies the TF‚ÄìIDF transformation to each document.
# Returns a sparse matrix X_tfidf of shape:

X_tfidf = tfidf.fit_transform(df['text_clean'])

# Retrieves the list of all terms (vocabulary) that the vectorizer kept.
# Converts it into a NumPy array for easy indexing and sorting later.
# You‚Äôll use it when finding the top TF‚ÄìIDF terms for each speech:

terms = np.array(tfidf.get_feature_names_out())

# ((# of docs, # of unique words), # of unique words)
X_tfidf.shape, len(terms)


In [None]:

def top_tfidf_terms_for_doc(doc_idx, top_n=12): #Define a function that returns the top-n TF‚ÄìIDF terms for a single document (speech).
    row = X_tfidf.getrow(doc_idx).toarray().ravel() #row is a vector of TF‚ÄìIDF scores for one speech, where each position corresponds to one word in terms
    top_idx = row.argsort()[::-1][:top_n] #top_idx = positions of the most distinctive words in this speech.
    # terms[top_idx] gets the actual word strings for those indices.
    # row[top_idx] gets their corresponding TF‚ÄìIDF scores.
    # zip(...) pairs each word with its score.
    # list(...) turns that into a list of (word, score) tuples.
    return list(zip(terms[top_idx], row[top_idx]))

# Show a few speeches (early, middle, recent)
# This loop picks three speeches: The first (i = 0), The middle one (len(df)//2), The last one (len(df)-1)
for i in [0, len(df)//2, len(df)-1]:
    print(f"\n=== {df.loc[i, 'year']} - {df.loc[i, 'president']} ===") #Prints a header showing which speech you‚Äôre examining
    for term, score in top_tfidf_terms_for_doc(i, top_n=12): #Calls the function to get the top 12 terms for that document
        print(f"{term:15s} {score:.3f}")



### TF‚ÄìIDF Similarity (Cosine)
Find which speeches are lexically similar using cosine similarity on TF‚ÄìIDF vectors.


In [None]:

sim = cosine_similarity(X_tfidf) #Computes a cosine similarity matrix for all speeches
target = len(df) - 1  # Chooses the most recent speech (the last row in your DataFrame) as the target document.
pairs = [(i, sim[target, i]) for i in range(len(df)) if i != target] #Builds a list of tuples for every other speech, skips the target one
pairs_sorted = sorted(pairs, key=lambda x: x[1], reverse=True)[:5] #Sorts the list of (index, similarity) pairs by similarity score in descending order

print(f"Most similar to {df.loc[target,'year']} - {df.loc[target,'president']}:")
for idx, score in pairs_sorted: #Iterates over the top-5 most similar speeches
    print(f"  {df.loc[idx,'year']} - {df.loc[idx,'president']:12s}  (cosine={score:.3f})")

# Cosine similarity treats each speech as a high-dimensional vector (words as axes).
# The closer the angle between two vectors, the more similar their language use ‚Äî even if the speeches differ in length.


In [None]:

# 2D projection (small corpus ‚Üí OK to densify)
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(X_tfidf.toarray())
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(coords[:,0], coords[:,1])
for i, row in df.iterrows():
    ax.annotate(str(row['year']), (coords[i,0], coords[i,1]), fontsize=8)
ax.set_title("Speeches in TF‚ÄìIDF space (PCA projection)")
ax.set_xlabel("PC1"); ax.set_ylabel("PC2")
plt.show()


üé® How to interpret the chart
1. Each point = a speech represented by its overall word usage pattern

* Two speeches close together ‚Üí use similar vocabularies (in terms of TF‚ÄìIDF weighting).

* Far apart ‚Üí distinct word usage ‚Äî different priorities, tone, or historical context.

2. Axes (PC1 and PC2) are abstract ‚Äî they don‚Äôt correspond to literal variables

* PCA components are linear combinations of all TF‚ÄìIDF features (words).

* You can‚Äôt say ‚Äúthe x-axis means optimism vs war,‚Äù but you can say:

* ‚ÄúAlong PC1, speeches separate based on major vocabulary differences ‚Äî early vs modern language, perhaps.‚Äù

* You can interpret them qualitatively by checking which speeches cluster together.

3. Look for clusters, trends, or outliers

* Clusters of speeches by nearby years ‚Üí continuity in rhetoric or themes.

* Isolated points ‚Üí outlier speeches (perhaps unusually short, poetic, or issue-focused).

* You might see a chronological gradient: early 1800s on one side, 2000s on another ‚Äî showing the evolution of presidential language.

| Concept   | Interpretation                                           |
| --------- | -------------------------------------------------------- |
| Distance  | How different two speeches‚Äô vocabularies are             |
| Clusters  | Shared themes, era, or rhetorical style                  |
| Outliers  | Unique speeches that break linguistic patterns           |
| PC1 / PC2 | Major axes of variation in word use ‚Äî not literal topics |


In [None]:
df[['year','president']].assign(PC1=coords[:,0], PC2=coords[:,1]).sort_values('PC1').head()


## üé® Visualizing TF‚ÄìIDF: Word Cloud & Temporal Trend

Now that we‚Äôve mapped speeches in abstract ‚ÄúTF‚ÄìIDF space,‚Äù  
let‚Äôs explore two other ways to *see* what TF‚ÄìIDF tells us.

1. **Word Cloud** ‚Äì visually emphasizes the distinctive words in one speech.  
   - Larger words = higher TF‚ÄìIDF scores.  
   - Great for quick, qualitative insight into what stands out.

2. **Temporal Line Chart** ‚Äì tracks how the importance of a given term changes over time.  
   - Example: does *‚Äúfreedom‚Äù* rise or fall in salience across U.S. history?


In [None]:
# --- 1) Word Cloud for a Selected Speech ---
from wordcloud import WordCloud

# Pick a speech by index (0=earliest, -1=latest)
doc_idx = len(df) - 1  # last speech by default

# Generate dictionary of top TF‚ÄìIDF terms
wc_data = dict(top_tfidf_terms_for_doc(doc_idx, top_n=100))

# Create and display the word cloud
wc = WordCloud(width=800, height=400, background_color='white')
wc.generate_from_frequencies(wc_data)

plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title(f"Word Cloud: {df.loc[doc_idx,'year']} ‚Äì {df.loc[doc_idx,'president']}", fontsize=14)
plt.show()



In [None]:
# --- 2) Temporal Line Chart of a Word's TF‚ÄìIDF Weight ---
# Choose a term to track over time
term = "freedom"  # try swapping to 'war', 'peace', 'america', etc.

if term in terms:
    term_idx = np.where(terms == term)[0][0]
    df[f"tfidf_{term}"] = X_tfidf[:, term_idx].toarray().ravel()

    plt.figure(figsize=(8,4))
    plt.plot(df['year'], df[f"tfidf_{term}"], marker='o', linestyle='-')
    plt.title(f'TF‚ÄìIDF Weight of "{term}" Over Time', fontsize=14)
    plt.xlabel("Year of Inaugural Address")
    plt.ylabel("TF‚ÄìIDF Score")
    plt.grid(alpha=0.3)
    plt.show()
else:
    print(f'Term "{term}" not found in vocabulary. Try another word.')


> ‚úÖ **What TF‚ÄìIDF tells us**: which words uniquely characterize each speech; which speeches use similar vocabularies.  
> ‚ùå **What it doesn‚Äôt**: explicitly uncover *themes* shared across documents.



## 7) Topic Modeling with scikit‚Äëlearn‚Äôs LDA
**Latent Dirichlet Allocation (LDA)** models each document as a mixture of **topics** (word distributions).  
We‚Äôll build a bag‚Äëof‚Äëwords matrix, fit an LDA model, inspect topics, and examine per‚Äëspeech topic mixtures.


In [None]:

# Bag-of-words for LDA
  # CountVectorizer converts each document into a bag-of-words (word counts, not weights).
  # min_df=2 means: ignore words that appear in fewer than 2 speeches (to reduce noise).
  # fit_transform() builds the vocabulary and creates a document-term matrix: Rows = speeches; Columns = unique words; Values = how many times each word appears
  # vocab holds the list of all words (for displaying topic terms later).

cv = CountVectorizer(min_df=2)
X_counts = cv.fit_transform(df['text_clean'])
vocab = np.array(cv.get_feature_names_out())

# Train LDA
  # K is the number of topics you want the model to find.
  # This is not learned automatically ‚Äî it‚Äôs a parameter you choose.
  # Try experimenting with different values:
    # K=5 ‚Üí broader, more general themes (e.g., ‚Äúwar,‚Äù ‚Äúeconomy,‚Äù ‚Äúunity‚Äù).
    # K=8 ‚Üí more nuanced topics (e.g., ‚Äúforeign policy,‚Äù ‚Äúdomestic economy,‚Äù ‚Äúliberty‚Äù).

K = 8  # adjust live (5, 8, 12)


# Initializes the LDA model from scikit-learn.
  # n_components=K ‚Üí tells the model how many topics (components) to find.
  # learning_method="batch" ‚Üí trains on the entire dataset at once.
  # (Alternative: "online" trains incrementally on chunks; ‚Äúbatch‚Äù is stable for small corpora like this.)
  # random_state=42 ‚Üí ensures reproducible results (so every student gets the same topics).
  # max_iter=20 ‚Üí number of passes over the data to improve the model; more iterations = more refined topics, but slower training.

lda = LatentDirichletAllocation(
    n_components=K,
    learning_method="batch",
    random_state=42,
    max_iter=20
)

# This line fits the model and simultaneously transforms the data into topic proportions
topic_mix = lda.fit_transform(X_counts)  # theta: (n_docs, K)

def show_topics(model, vocab, topn=12):
# Defines a helper function called show_topics().
# Inputs:
  # model ‚Üí your trained LDA model (lda).
  # vocab ‚Üí array of all words in your vocabulary (from CountVectorizer).
  # topn ‚Üí how many top words you want to display per topic (default = 12).
    for k, comp in enumerate(model.components_):
    # model.components_ is a 2D NumPy array where:
    # Each row corresponds to a topic (Topic 0, Topic 1, ‚Ä¶).
    # Each column corresponds to a word in the vocabulary.
    # Each value = the importance (weight) of that word within the topic.
    # enumerate() loops through all topics (k) and their corresponding word-weight vectors (comp).
        top_idx = comp.argsort()[::-1][:topn]
        # argsort() returns the indices that would sort the array ‚Äî here, the word weights ‚Äî in ascending order.
        # [::-1] reverses that order to descending (highest-weighted words first).
        # [:topn] takes only the top n indices (e.g., top 12 words).
        print(f"\nTopic {k}: " + ", ".join(vocab[top_idx]))

show_topics(lda, vocab, topn=12)

# Assemble per-document topic proportions
topic_df = pd.DataFrame(topic_mix, columns=[f"topic_{k}" for k in range(K)])
result_df = pd.concat([df[['fileid','year','president']], topic_df], axis=1)
result_df.head(5)


### Map the topics across a heatmap

üß† How to Interpret the LDA Topic Heatmap

Each cell of the heatmap represents the proportion of a given topic within a specific speech.
Color intensity encodes how strongly that topic appears ‚Äî darker (or brighter) = higher proportion, lighter = weaker presence.

In [None]:
!pip -q install plotly

import numpy as np
import plotly.graph_objects as go


In [None]:
def topic_top_words(lda_model, vocab, topn=10):
    """Return:
       - topic_labels: list like ["T0: economy, growth, jobs", ...]
       - topic_words:  list of lists of the topn words per topic (for hover)
    """
    labels = []
    words_list = []
    for k, comp in enumerate(lda_model.components_):
        top_idx = comp.argsort()[::-1][:topn]
        words = vocab[top_idx].tolist()
        words_list.append(words)
        label = f"T{k}: " + ", ".join(words[:6])  # concise label for axis/hover
        labels.append(label)
    return labels, words_list

topic_labels, topic_words = topic_top_words(lda, vocab, topn=12)

# Columns in result_df that are topic proportions
topic_cols = [c for c in result_df.columns if c.startswith("topic_")]


In [None]:
# Pick rows you want to compare
rows = [0, len(result_df)//2, len(result_df)-1] #The list [0, len(result_df)//2, len(result_df)-1] = [first_speech, middle_speech, last_speech]
df_sel = result_df.iloc[rows].copy()

# This code converts the selected speeches‚Äô topic proportions into a NumPy matrix (Z) for plotting,
# creates x-axis labels (x) showing topic numbers (like ‚ÄúT0‚Äù, ‚ÄúT1‚Äù, ‚Ä¶), and builds y-axis labels (y)
# combining each speech‚Äôs year and president name for the heatmap
Z = df_sel[topic_cols].to_numpy()
x = [f"T{int(c.split('_')[-1])}" for c in topic_cols]
y = [f"{r.year} ‚Äî {r.president}" for _, r in df_sel.iterrows()]

# Build hovertext matrix: one string per cell
hovertext = []
for r_i, r in df_sel.iterrows():
    row_texts = []
    for t_i, col in enumerate(topic_cols):
        k = int(col.split('_')[-1])
        row_texts.append(
            f"<b>{int(r.year)} ‚Äî {r.president}</b><br>"
            f"<b>Topic {k}</b><br>"
            f"Top words: {', '.join(topic_words[k][:10])}<br>"
            f"Proportion: {r[col]:.3f}"
        )
    hovertext.append(row_texts)

fig = go.Figure(
    data=go.Heatmap(
        z=Z,
        x=x,
        y=y,
        colorscale="Plasma",
        zmin=0.0, zmax=1.0,
        hoverinfo="text",
        text=hovertext
    )
)
fig.update_layout(
    title="Topic mixture (theta) ‚Äî selected speeches",
    xaxis_title="Topic",
    yaxis_title="Speech",
    height=300 + 40*len(rows),
    margin=dict(l=80, r=20, t=60, b=60)
)
fig.show()


In [None]:
# Ensure chronological order
df_sorted = result_df.sort_values("year").reset_index(drop=True)

# Reorder topics by global prevalence (more interpretable)
mean_by_topic = df_sorted[topic_cols].mean(axis=0).to_numpy()
order = np.argsort(mean_by_topic)[::-1]
ordered_cols = [topic_cols[i] for i in order]
ordered_x = [f"T{int(c.split('_')[-1])}" for c in ordered_cols]

A = df_sorted[ordered_cols].to_numpy()
y_all = df_sorted["year"].astype(str) + " ‚Äî " + df_sorted["president"]

# Hovertext matrix for all speeches
hovertext_all = []
for r_i, r in df_sorted.iterrows():
    row_texts = []
    for c in ordered_cols:
        k = int(c.split('_')[-1])
        row_texts.append(
            f"<b>{int(r['year'])} ‚Äî {r['president']}</b><br>"
            f"<b>Topic {k}</b><br>"
            f"Top words: {', '.join(topic_words[k][:10])}<br>"
            f"Proportion: {r[c]:.3f}"
        )
    hovertext_all.append(row_texts)

fig_all = go.Figure(
    data=go.Heatmap(
        z=A,
        x=ordered_x,
        y=y_all,
        colorscale="Plasma",
        zmin=0.0, zmax=1.0,
        hoverinfo="text",
        text=hovertext_all
    )
)
fig_all.update_layout(
    title="All speeches ‚Äî topic mixture heatmap (topics ordered by prevalence)",
    xaxis_title="Topic",
    yaxis_title="Speech (year ‚Äî president)",
    height=max(450, 14*len(df_sorted)),
    margin=dict(l=120, r=20, t=60, b=80)
)
fig_all.show()



## 8) TF‚ÄìIDF vs LDA ‚Äî Compare & Contrast
| Aspect | TF‚ÄìIDF | LDA (Topics) |
|---|---|---|
| Unit | Terms per document | Topics (word dists); documents are mixtures |
| Great for | Keywording, distinctiveness, similarity | Thematic mapping across corpus |
| Limitations | No explicit themes | Needs K tuning; topics can blend/split |


## 9) üéØ Student-Driven Policy Exploration (‚âà60 minutes)

Work in pairs. Your mission: **choose a policy area**, **build a small text corpus**, and **experiment** with TF‚ÄìIDF and topic modeling to discover what language patterns define that space.

This is not a graded deliverable ‚Äî it‚Äôs a sandbox for exploration, pattern-finding, and discussion.

---

### üß≠ Part A ‚Äî Choose a Policy Area
Pick an issue you care about ‚Äî examples:

- Climate policy & sustainability  
- Immigration & border security  
- Health care & public health  
- Economic growth & inequality  
- Civil rights & social justice  
- Foreign policy & diplomacy  

Then brainstorm: *Whose language represents this issue?*  
(e.g., presidents, UN leaders, legislators, NGOs, media outlets).

---

### üìö Part B ‚Äî Build Your Corpus

You‚Äôll need at least **10‚Äì20 short to medium speeches or statements**.

**Option 1 ‚Äì Use existing open archives**
- U.S. presidential speeches: [American Presidency Project](https://www.presidency.ucsb.edu/speeches)
- UN General Assembly statements: [UN Digital Library](https://digitallibrary.un.org/)
- EU or UK parliament debates: [Hansard](https://hansard.parliament.uk/), [Europarl](https://www.europarl.europa.eu/)
- NGO or think-tank reports: World Bank, IMF, WHO, Brookings, RAND, etc.

**Option 2 ‚Äì Scrape or collect your own (advanced)**
- Use `requests` + `BeautifulSoup` or a library such as `newspaper3k` to extract text.
- Or copy/paste short excerpts into `.txt` files and upload them to Colab.

üìé *Hint:* keep your text clean ‚Äî remove headers, speaker names, and references.

See script below to get you started

---

### üß© Part C ‚Äî Explore Frequency, TF‚ÄìIDF, and Topics

1. **Frequency snapshot:**  
   Compute the top 15 most common words (`FreqDist`). Which ones are generic boilerplate (e.g., ‚Äúpeople,‚Äù ‚Äúgovernment‚Äù)?

2. **Distinctiveness check:**  
   Run **TF‚ÄìIDF** with `min_df=2` or `min_df=5`.  
   - Which words rise to the top?  
   - What do they reveal about your policy domain‚Äôs unique framing?

3. **Similarity sleuthing:**  
   Using cosine similarity on TF‚ÄìIDF vectors, find which two documents are most similar.  
   What links them ‚Äî era, country, tone?

4. **Topic discovery:**  
   Train an **LDA model** (try `K=5`, `K=8`, `K=12`).  
   - Label each topic in 2‚Äì3 words.  
   - Which `K` feels most interpretable?  
   - Do your topics align with known sub-issues (e.g., ‚Äúenergy transition,‚Äù ‚Äúhuman rights,‚Äù ‚Äútrade policy‚Äù)?

5. **Visualize:**  
   Create a PCA or heatmap of your documents.  
   - What clusters appear?  
   - Does time, geography, or institution explain them?

---

### üïµÔ∏è Part D ‚Äî Mini Scavenger Hunt Prompts

- **‚ÄúWord Detective‚Äù**: Which words define your corpus when using TF‚ÄìIDF vs raw frequency?  
- **‚ÄúSimilarity Sleuth‚Äù**: Which two documents look similar numerically but differ substantively?  
- **‚ÄúTopic Whisperer‚Äù**: Choose one topic from your LDA output. Find two speeches that heavily feature it (> 0.3). What do they share?  
- **‚ÄúEra Shift‚Äù**: Does any topic fade or grow over time? What might explain it?  
- **‚ÄúHeadline Writer‚Äù**: Summarize one document twice ‚Äî once using TF‚ÄìIDF terms, once using its dominant LDA topic. How do the headlines differ in tone?

---

### üß† Part E ‚Äî Policy Reflection (Discussion, not submission)

Compare what each method tells you:

| Method | Reveals | Best for |
|---------|----------|----------|
| **TF‚ÄìIDF** | Distinctive vocabulary per document | Comparing actors or countries |
| **LDA (Topic Modeling)** | Underlying shared themes | Tracking issue clusters and framing evolution |

> In your discussion:  
> - What language dominates your policy area?  
> - Whose framing or rhetoric stands out?  
> - How might these tools support evidence-based policy analysis?

---

‚úÖ **Outcome:** You should be able to *talk through* what you learned ‚Äî
not produce a written report. Your goal is pattern recognition, curiosity, and connecting computational text analysis to real policy discourse.


In [None]:
# =======================================================
# üß≠ STEP 1: Mount Google Drive
# =======================================================
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# =======================================================
# üóÇ STEP 2: Create a folder in Google Drive for text corpus
# =======================================================
import os

# Customize the folder name ‚Äî each student can use their initials or topic
folder_name = "policy_corpus"
drive_path = "/content/drive/MyDrive"
corpus_dir = os.path.join(drive_path, folder_name)

os.makedirs(corpus_dir, exist_ok=True)
print(f"‚úÖ Folder ready: {corpus_dir}")


In [None]:
# =======================================================
# üì∞ STEP 3: Scrape Articles with newspaper3k and Save as .txt
# =======================================================

!pip install newspaper3k lxml_html_clean --quiet
# !pip -q install newspaper3k lxml_html_clean

# Import after successful install
from newspaper import Article



import time, os, requests
from newspaper import Article, Config

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

cfg = Config()
cfg.browser_user_agent = HEADERS["User-Agent"]
cfg.request_timeout = 20
cfg.memoize_articles = False

def extract_with_newspaper(url: str) -> str:
    """Try Newspaper with real UA; if download() gets 403, use requests + set_html()."""
    art = Article(url, config=cfg)
    try:
        art.download()              # may 403
        art.parse()
        return art.text.strip()
    except Exception:
        # Fallback: fetch with requests using real headers, then feed raw HTML to Newspaper
        r = requests.get(url, headers=HEADERS, timeout=30)
        r.raise_for_status()        # will throw if not 200 (but you said it's 200)
        art = Article(url, config=cfg)
        art.set_html(r.text)
        art.parse()
        return art.text.strip()



import pathlib
SAVE_DIR = pathlib.Path("/content/drive/MyDrive/policy_corpus")
SAVE_DIR.mkdir(parents=True, exist_ok=True)

def safe_filename(title: str, i: int) -> str:
    base = "".join(c for c in title if c.isalnum() or c in (" ","_")).strip().replace(" ","_")
    if not base: base = f"article_{i}"
    return f"{i:02d}_{base[:60]}.txt"

def extract_article(url: str, i: int) -> str:
    # Try Newspaper (UA) ‚Üí Newspaper with requests HTML ‚Üí Trafilatura
    try:
        text = extract_with_newspaper(url)
        source = "newspaper3k"
    except Exception as e1:
        try:
            text = extract_with_trafilatura(url)
            source = "trafilatura"
        except Exception as e2:
            raise RuntimeError(f"Both extractors failed.\nNewspaper err: {e1}\nTrafilatura err: {e2}")
    return text, source

def save_article(urls):
    import datetime
    for i, url in enumerate(urls, start=1):
        try:
            text, source = extract_article(url, i)
            title_hint = url.split("/")[-2] if "/" in url else "article"
            fname = safe_filename(title_hint, i)
            fpath = SAVE_DIR / fname
            with open(fpath, "w", encoding="utf-8") as f:
                f.write(f"URL: {url}\n")
                f.write(f"SourceExtractor: {source}\n")
                f.write(f"SavedAtUTC: {datetime.datetime.utcnow().isoformat()}Z\n\n")
                f.write(text)
            print(f"‚úÖ Saved ({source}): {fname}")
            time.sleep(1.0)  # be polite
        except Exception as e:
            print(f"‚ö†Ô∏è Skipped {url}: {e}")

# EXAMPLE URLS (swap in your policy-area links)
urls = [
  "https://www.un.org/sg/en/content/sg/statements/2025-11-08/secretary-generals-message-the-20th-conference-of-youth-climate-change",
    "https://www.un.org/sg/en/content/sg/statements/2025-11-07/secretary-generals-remarks-the-belem-climate-summit-energy-transition-roundtable-delivered",
    "https://www.un.org/sg/en/content/sg/statements/2025-11-06/secretary-generals-remarks-the-launch-of-the-tropical-forest-forever-facility-delivered",
]
save_article(urls)
print("Folder:", SAVE_DIR)


In [None]:
# =======================================================
# üßæ STEP 4: Verify Saved Files
# =======================================================
import glob

files = sorted(glob.glob(os.path.join(corpus_dir, "*.txt")))
print(f"Found {len(files)} text files in Drive.")
for f in files:
    print("-", os.path.basename(f))



---

### Closing Thought
**FreqDist** shows what‚Äôs loudest. **TF‚ÄìIDF** shows what‚Äôs distinctive. **LDA** shows what‚Äôs thematic. Use all three to triangulate insights for public‚Äëpolicy questions.
