<a href="https://colab.research.google.com/github/quiet-econ-lab/Text_Analysis_Final_Project/blob/main/final_project_RyotaroTsuchiya_with_plotly_link.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```
Ryotaro Tsuchiya  (UNI: rt302)
MPA Candidate '27 | International Finance and Economic Policy
Columbia University – School of International and Public Affairs (SIPA)
```

---
# Quantifying the Federal Reserve’s Beige Book Using Text Analysis
---



## 1. Introduction

The Federal Reserve’s Beige Book is a qualitative summary of business conditions based on reports from firms, industry contacts, and organizations across the twelve Federal Reserve districts. Released roughly two weeks before each FOMC meeting, it provides timely descriptions of hiring, prices, wages, and demand that often detect turning points earlier than standard macroeconomic indicators.

Despite its importance, the Beige Book is entirely narrative. Policymakers must rely on subjective judgment to determine whether conditions are improving or weakening, which limits its usefulness for empirical analysis. This project examines whether the Beige Book’s tone can be quantified using sentence-level sentiment analysis, and whether such a measure tracks real economic activity.

To investigate this, I scraped the full Beige Book archive, applied sentiment analysis to each sentence, constructed a monthly sentiment index, and compared it with the Philadelphia Fed’s Coincident Economic Activity Index. I also used TF-IDF to identify themes that drive positive and negative sentiment over time.


## 2. Data Collection

### 2.1 Scraping the Beige Book
[The Beige Book](https://www.minneapolisfed.org/region-and-community/regional-economic-indicators/beige-book-archive) is not published as a structured dataset, so I built a scraper that checks each month from 1970 onward and downloads all available releases. When a release is detected, the scraper collects the publication date, the regional section name, and the full narrative text. The resulting dataset contains several thousand district-level narratives describing business conditions across more than five decades.

In [None]:
import requests
import pandas as pd
from datetime import datetime
import time

In [None]:
# Base URL for the Minneapolis Fed Beige Book archive
BASE = "https://www.minneapolisfed.org/beige-book-reports/"

# List of all 12 Federal Reserve districts (plus national summary "su")
regions = ["su", "at", "bo", "ch", "cl", "da", "kc", "mi", "ny", "ph", "ri", "sf", "sl"]

headers = {"User-Agent": "Mozilla/5.0"}

start_year = 1970
current_year = datetime.now().year
url_records = []

# The Beige Book is released about eight times per year,
# but the release months are not fixed and differ across years.
# Because the archive has no index of all release dates,
# we check all 12 months and detect which ones actually contain a release.

for year in range(start_year, current_year + 1):
    for month in range(1, 13):

        # First test if the release exists by checking the national summary page (su)
        test_url = f"{BASE}{year}/{year}-{month:02d}-su"
        try:
            r = requests.get(test_url, headers=headers, timeout=10)
        except requests.RequestException:
            print(f"{year}-{month:02d}: ERROR")
            continue

        if r.status_code == 200:
            print(f"{year}-{month:02d}: FOUND")

            # If the national summary exists, add URLs for all districts
            for region in regions:
                url = f"{BASE}{year}/{year}-{month:02d}-{region}"
                url_records.append({
                    "year": year,
                    "month": month,
                    "region": region,
                    "url": url,
                })

        elif r.status_code == 404:
            print(f"{year}-{month:02d}: NOT FOUND")
        else:
            print(f"{year}-{month:02d}: STATUS {r.status_code}")

        # Sleep briefly to avoid overwhelming the server
        time.sleep(0.1)

In [None]:
# Convert list to DataFrame
df_urls = pd.DataFrame(url_records, columns=["year", "month", "region", "url"])
print("Total URLs collected:", len(df_urls))
print(df_urls.sample(5).to_string())

Total URLs collected: 6032
      year  month region                                                                url
5988  2025      7     ny  https://www.minneapolisfed.org/beige-book-reports/2025/2025-07-ny
1673  1981     11     ph  https://www.minneapolisfed.org/beige-book-reports/1981/1981-11-ph
1387  1979      5     ph  https://www.minneapolisfed.org/beige-book-reports/1979/1979-05-ph
5780  2023      1     ny  https://www.minneapolisfed.org/beige-book-reports/2023/2023-01-ny
4063  2004     10     mi  https://www.minneapolisfed.org/beige-book-reports/2004/2004-10-mi


In [None]:
from bs4 import BeautifulSoup

In [None]:
### Scraping all Beige Book pages using the URL list created above ###

def scrape_beige_page(url):
    """
    Scrapes one Beige Book page and extracts:
        - year_month : e.g., "August 1970"
        - full_date  : e.g., "August 12, 1970" (exact publication date)
        - region_name: e.g., "National Summary", "Atlanta"
        - url        : source URL
        - content    : main economic narrative (merged paragraphs)

    Returns a dictionary if successful.
    Returns None if the page is missing or has unexpected structure.
    """

    # Attempt request
    try:
        r = requests.get(url, headers=headers, timeout=10)
    except requests.RequestException:
        return None

    # Skip pages that do not exist (404) or return unexpected status
    if r.status_code != 200:
        return None

    soup = BeautifulSoup(r.text, "html.parser")

    # ---------------------------
    # Extract title block <h1>
    # ---------------------------
    h1 = soup.find("h1", class_="i9-c-title-banner__title--title")
    if not h1:
        return None

    title_text = h1.get_text(strip=True)

    # Most pages follow the format:
    #   "<Region>: <Month Year>"
    # Example:
    #   "Boston: August 1970"
    if ":" in title_text:
        region_name, year_month = [x.strip() for x in title_text.split(":", 1)]
    else:
        region_name = title_text
        year_month = ""

    # ---------------------------
    # Extract main text block
    # ---------------------------
    div = soup.find("div", class_="i9-c-rich-text-area")
    if not div:
        return None

    # Extract full_date from <strong>
    strong_tag = div.find("strong")
    full_date = strong_tag.get_text(strip=True) if strong_tag else ""

    # Extract ALL paragraphs (including date paragraph)
    p_tags = div.find_all("p")
    paragraphs = [p.get_text(" ", strip=True) for p in p_tags]

    # Merge into content
    content = "\n\n".join(paragraphs)
    content = content.replace(full_date, "")

    return {
        "year_month": year_month,
        "full_date": full_date,
        "region_name": region_name,
        "url": url,
        "content": content,
    }

In [None]:
### -------------------------------------------------------------
### Loop over all URLs and scrape pages (with progress display)
### -------------------------------------------------------------

beige_records = []   # Will contain all text + metadata

for _, row in df_urls.iterrows():
    url    = row["url"]
    year   = row["year"]
    month  = row["month"]
    region = row["region"]

    # Progress display (important for debugging and long scraping runs)
    print(f"Scraping: {year}-{month:02d}  region={region}  URL={url}")

    page_info = scrape_beige_page(url)

    # Skip pages with missing or malformed structure
    if page_info is None:
        print(f"  → skipped (no usable content): {year}-{month:02d}  {region}")
        continue

    beige_records.append(page_info)

In [None]:
df_beige = pd.DataFrame(beige_records)

pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 0)

df_beige.sample(1)


Unnamed: 0,year_month,full_date,region_name,url,content
200,August 1971,"August 18, 1971",Dallas,https://www.minneapolisfed.org/beige-book-reports/1971/1971-08-da,"\n\nRespondents at a sample of retail stores in the Eleventh Federal \r\nReserve District expect business activity to pick up moderately over \r\nthe next six months. Relatively few, however, were even mildly \r\noptimistic about improvement in the employment picture in their \r\nareas for the near term. Recent sales at these stores were somewhat \r\ngreater than during the comparable period last year, with summer \r\nsales being greater than anticipated. Nevertheless, inventories are \r\nstill higher than this time last year and a little higher than some \r\nwould like. Virtually all indicated that wholesale prices for \r\ninventories, as well as retail prices they charge their customers, \r\nhave risen moderately over the last six months. Moreover, nearly all \r\nanticipate that both wholesale and retail prices of their goods will \r\nrise by about a like amount in the next six months.\n\nAbout two-thirds of the respondents reported that sales recently \r\nwere above those a year ago, with a few indicating that the increase \r\nhas been substantial. The increase was attributed by most to greater \r\ndemand, improved internal operations, and expanded facilities. On \r\nbalance, the importance of ""big ticket"" items and the most expensive \r\n""top of the line"" items in total sales has not changed materially \r\nsince a year ago. At least at these outlets, the consumer apparently \r\nhas not become more economy-minded in past months.\n\nAll of the stores surveyed ran special summer sales. A little over \r\n60 percent indicated that the sales volume exceeded their \r\nexpectations and was greater than during the summer sales last year. \r\nMost respondents indicated that the number of items on sale and the \r\nprice reductions this year were comparable with those last year, \r\nalthough about 30 percent reported greater number of items and \r\nlarger price reductions in this year's sales.\n\nEven though sales have been up recently, almost 80 percent have \r\ninventories equal to or greater than those a year ago. Moreover, \r\njust more than a third of the respondents felt that their current \r\ninventory levels are a little too high. Nevertheless, all of the \r\nrespondents are planning to keep their inventories at present levels \r\nor actually increase them slightly over the next six months \r\n(exclusive of changes for seasonal reasons). A few pointed out, \r\nhowever, that anticipated increases in inventories will probably \r\nreflect higher prices rather than greater unit volume.\n\nRegarding price increases, nearly all of the respondents experienced \r\na moderate rise in wholesale prices of inventories over the last six \r\nmonths. Moreover, the majority indicated that salaries in their \r\nstores have risen from 5 to 10 percent during the last year; and, on \r\nbalance, the number of people employed has also risen over this \r\nperiod.\n\nIn light of these increased costs, almost 90 percent of the stores \r\nsurveyed have raised their retail prices in recent months. And in \r\nview of the near unanimous feeling that wholesale prices will \r\ncontinue to rise about as rapidly in the next six months as in the \r\nlast six months, most respondents are planning to increase retail \r\nprices of their goods moderately further in the near term.\n\nDistrict data continue to show a modest recovery in economic \r\nactivity. Total nonagricultural wage and salary employment in the \r\nDistrict states rose 0.2 percent in June above the May level. Most \r\nof the growth came in manufacturing—an employment area that still \r\nlags behind year-ago levels. Car registrations in the four major \r\nreporting areas in Texas showed a sharp 21 percent gain for June and \r\nare running much stronger than a year ago. Similarly, department \r\nstore sales continue to show strength in consumer buying. The \r\nseasonally adjusted Texas industrial production index dropped \r\nslightly in June mainly due to a cut in crude production. Texas \r\ncrude production allowables were reduced to 66.2 percent for August,\r\nthe fourth consecutive month of such reductions. Recent rains have \r\nbroken the drought, and progress is being made in efforts to control \r\nthe outbreak of Venezuelan equine encephalomyelitis."


### 2.2 Economic Indicators from FRED

To compare Beige Book sentiment with real-time economic activity, I retrieved the [Philadelphia Fed’s Coincident Economic Activity Index](https://fred.stlouisfed.org/series/USPHCI#) from the FRED API. I converted it into year-over-year growth rates so it could be aligned with the monthly Beige Book releases. This indicator summarizes the national business cycle using employment, income, manufacturing, and unemployment data, making it a useful reference point for evaluating whether qualitative sentiment contains timely economic information.

In [None]:
import os
from getpass import getpass

# Enter your FRED API key.
# If you do not have one, you can request it at:
# https://fred.stlouisfed.org/docs/api/api_key.html
os.environ["FRED_API_KEY"] = getpass("Paste your FRED API key: ")

# Simple check to ensure the key was set correctly
assert os.environ.get("FRED_API_KEY"), "No API key detected. Please run this cell again and enter your key."

Paste your FRED API key: ··········


In [None]:
# ------------------------------------------------------------
# Retrieve USPHCI (Philadelphia Fed's Coincident Economic Activity Index)
# from the FRED API and compute year-over-year percentage change.
# ------------------------------------------------------------

# Load API key stored in environment variable
fred_api_key = os.environ.get("FRED_API_KEY")

# FRED series ID for the national coincident index
series_id = "USPHCI"

# Construct API URL
url = (
    "https://api.stlouisfed.org/fred/series/observations"
    f"?series_id={series_id}&api_key={fred_api_key}&file_type=json"
)

# Request data from FRED
r = requests.get(url)

# Extract the list of observations (each contains a date and value)
obs = r.json()["observations"]

# Convert to DataFrame
df_fred = pd.DataFrame(obs)

# Convert date to datetime and value to numeric
df_fred["date"] = pd.to_datetime(df_fred["date"])
df_fred["value"] = pd.to_numeric(df_fred["value"], errors="coerce")

# Keep only valid rows and sort chronologically
df_fred = df_fred[["date", "value"]].dropna()
df_fred = df_fred.sort_values("date").reset_index(drop=True)

# Compute year-over-year percentage change (12-month difference)
df_fred["pct_change"] = df_fred["value"].pct_change(12) * 100

# Display first two years of data
df_fred.head(24)

Unnamed: 0,date,value,pct_change
0,1979-01-01,44.91,
1,1979-02-01,45.05,
2,1979-03-01,45.3,
3,1979-04-01,45.36,
4,1979-05-01,45.59,
5,1979-06-01,45.72,
6,1979-07-01,45.84,
7,1979-08-01,45.89,
8,1979-09-01,45.98,
9,1979-10-01,46.07,


## 3. Data Cleaning and Preparation
The Beige Book narratives were cleaned by standardizing whitespace and removing formatting inconsistencies. Each cleaned document was then split into individual sentences. This step was essential because many paragraphs include both positive and negative assessments, and sentence-level granularity ensures that sentiment is measured precisely rather than averaged across mixed passages. These sentences form the core dataset for the sentiment analysis.

To prepare text for identifying economic themes, a second cleaned version of each sentence was created specifically for TF-IDF. Based on the cleaning script in [Krisel (2023)](https://github.com/rskrisel/tfidf_topic_modeling/blob/main/Intro_Text_Analysis_TFIDF_LDA_Inaugurals.ipynb), this process included lowercasing, tokenization, and removal of English stopwords, years, month names, and common Beige Book terms such as “contacts” or “reported.” The purpose was to retain economically meaningful vocabulary while filtering out boilerplate phrasing that appears in nearly every report. The resulting tokenized sentences were then ready for TF-IDF modeling.

In [None]:
# -------------------------------------------------------------
# Clean Beige Book text (normalize whitespace)
# -------------------------------------------------------------
import re

In [None]:
# Clean whitespace in content (remove newlines/tabs/double spaces)
def normalize_spaces(text):
    return re.sub(r"\s+", " ", text).strip() if isinstance(text, str) else ""

df_beige["content_clean"] = df_beige["content"].apply(normalize_spaces)

# Clean year_month_
df_beige["year_month"] = df_beige["year_month"].str.extract(r"([A-Za-z]+ \d{4})")

df_beige[["year_month", "full_date", "region_name", "content_clean"]].sample(1)

Unnamed: 0,year_month,full_date,region_name,content_clean
2373,August 1988,"August 2, 1988",Minneapolis,"General economic conditions have held firm in the Ninth District. Employment demand has remained high, and consumer spending has continued to grow moderately. Recent rains should eventually help some drought-stricken crop, livestock, and dairy operations. And agricultural bank conditions, while not improving, have not yet deteriorated either. Labor Markets The most recent statistics indicate that district labor markets have experienced some additional tightening. During May, Minnesota's unemployment rate dropped to 3.2 percent, its lowest level in 9 years. The unemployment rate in its Minneapolis-St. Paul metropolitan area fell to just 2.8 percent. In that area, temporary help agencies report some spot labor shortages and rising wages paid to clerical workers. Both the labor force and total employment reached record highs during May in South Dakota; its unemployment rate was only 3 percent. Also, North Dakota's unemployment rate dropped to 3.7 percent during May, almost one full percentage point below its level a year earlier. During past years, the Upper Peninsula of Michigan has experienced very high unemployment; this has eased in the 12 months ending in May: from 10.6 to 7.7 percent. Consumer Spending Retail spending has continued to grow moderately in the district. One chain reports that its department store sales were 7 percent higher this June than last. One chain plans to significantly expand two of its stores in the Minneapolis-St. Paul metro area. A chain with stores throughout the district reports much stronger growth during this period, but a higher market share probably accounts for much of that strength. Neither chain reports any inventory or credit problems. District sales of motor vehicles have continued to hold up well. One domestic manufacturer reports that its car sales during June rose 17 percent over their level a year earlier. A district manager for a popular domestic line reports that car sales during June were strong at virtually all its dealers. A recent arena sale in Sioux Falls, South Dakota, went quite well. But district truck sales have slowed relative to car sales, perhaps due to drought-induced buying resistance in farm-dependent areas. Still, vehicle inventories haven't risen above normal levels. Housing activity has held firm. Home sales in the Minneapolis-St. Paul area during May and June were 13 percent ahead of a year earlier. Residential building contracts In Minnesota were up 5 percent during May. But as has been true for some time, housing activity was stagnant in many cities and towns of Montana and North Dakota. District tourist spending has increased sharply this summer. Despite burning bans at campgrounds, all tourism industry representatives contacted report increased activity. For example, Independence Day weekend business was way above the expectations of industry sources in Michigan's Upper Peninsula. A source in northern Wisconsin says that tourism there has beers running 10 percent ahead of last year. And the Black Hills area of South Dakota has also done well. Agriculture Rain during July came too late to significantly help some of the district's wheat, barley, and oats crops but has helped its soybean and corn crops. For example, during the third week in July, the Minnesota Commissioner of Agriculture estimated that 80 percent of' Minnesota's soybean crop would survive. Soybeans are Minnesota's second-largest cash crop. Its largest crop, corn, is not expected to do so well: only 60 percent may survive. Crop insurance will help district farmers cover part of the lost output. Compared to last year, federally sponsored multiperil coverage is up 83 percent in South Dakota, 77 percent in North Dakota, 41 percent in Montana, and 31 percent in Minnesota. More help has come from prices of farm products, which have continued to rise. Mid-July products of corn and soybeans on the cash markets were over 70 percent higher than a year earlier, while barley and wheat prices were up around 50 percent. These high prices imply lower government deficiency payments to farmers under current law, but some form of disaster relief might be enacted to replace that loss. Furthermore, farmers with stored crops carried over from last year will benefit from sales at these high prices. Livestock and dairy operations have been significantly hampered by higher feed prices and a shortage of pasture growth. As a result, more stock has been sold than normal, which has lowered its price as much as 20 percent. Recent rains should help stimulate grass growth, though, and slow the sell-off in some parts of the district. Financial Conditions The safety and soundness of district banks do not seem to have been hurt by the drought yet. Members of this Bank's Advisory Council on Small Business, Agriculture, and Labor report that banks are still liquid and looking for good lending opportunities. A prominent banker in Sioux Falls, South Dakota, says that most serious farm debt problems won't be noticed until next year. This Bank's latest survey of district agricultural bankers does indicate that many bankers are expecting low farm income and slow repayment of farm debt during the third quarter. Still, the condition of district agricultural banks substantially improved during 1987, which should help them weather the drought."


In [None]:
# -------------------------------------------------------------
# Split cleaned text into individual sentences
# -------------------------------------------------------------
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Split each cleaned Beige Book into sentences and store metadata + clean sentence
sent_rows = []

for row in df_beige.itertuples():
    for sent in nltk.sent_tokenize(row.content_clean):
        sent_rows.append({
            "year_month": row.year_month,
            "full_date": row.full_date,
            "region_name": row.region_name,
            "sentence_clean_sent": normalize_spaces(sent),
        })

df_sent = pd.DataFrame(sent_rows)

In [None]:
# -------------------------------------------------------------
# Create TF-IDF clean sentences + token lists
# -------------------------------------------------------------

# Prepare stopwords for TF-IDF
EN_STOP = set(stopwords.words("english"))

# Custom stopwords specific to Beige Book text
CUSTOM_STOP = {
    "contacts", "reported", "noted", "district", "year", "percent"
}

# Add years and month names to custom stopwords
years = {str(y) for y in range(start_year, current_year + 1)}
months = {
    "january","february","march","april","may","june",
    "july","august","september","october","november","december",
    "jan","feb","mar","apr","jun","jul","aug","sep","sept","oct","nov","dec"
}

CUSTOM_STOP = CUSTOM_STOP.union(years).union(months)
STOPWORDS = EN_STOP.union(CUSTOM_STOP)

# Clean tokens for TF-IDF (lowercase, alphabetic only, ≥3 letters)
def simple_clean_tokens(text):
    text = text.lower()
    tokens = word_tokenize(text)
    return [t for t in tokens if t.isalpha() and len(t) >= 3 and t not in STOPWORDS]

df_sent["tokens_clean_tfid"] = df_sent["sentence_clean_sent"].apply(simple_clean_tokens)
df_sent["sentence_clean_tfidf"] = df_sent["tokens_clean_tfid"].apply(lambda t: " ".join(t))

df_sent[
    ["year_month", "region_name", "sentence_clean_sent", "sentence_clean_tfidf", "tokens_clean_tfid"]
].head()

Unnamed: 0,year_month,region_name,sentence_clean_sent,sentence_clean_tfidf,tokens_clean_tfid
0,May 1970,National Summary,"This initial report of economic conditions in the 12 Federal Reserve Districts is based on information gathered from directors of the Reserve Banks, conversations with local bankers, businessmen and economists, regular monthly surveys of manufacturing and trade industries conducted by some of the Reserve Banks, and selected statistical measures of regional economic activity.",initial report economic conditions federal reserve districts based information gathered directors reserve banks conversations local bankers businessmen economists regular monthly surveys manufacturing trade industries conducted reserve banks selected statistical measures regional economic activity,"[initial, report, economic, conditions, federal, reserve, districts, based, information, gathered, directors, reserve, banks, conversations, local, bankers, businessmen, economists, regular, monthly, surveys, manufacturing, trade, industries, conducted, reserve, banks, selected, statistical, measures, regional, economic, activity]"
1,May 1970,National Summary,Reports from the Reserve Banks clearly indicate that the current overriding domestic concern is inflation.,reports reserve banks clearly indicate current overriding domestic concern inflation,"[reports, reserve, banks, clearly, indicate, current, overriding, domestic, concern, inflation]"
2,May 1970,National Summary,Businessmen contacted generally expect that prices will continue to increase at a rapid rate during the remainder of the year.,businessmen contacted generally expect prices continue increase rapid rate remainder,"[businessmen, contacted, generally, expect, prices, continue, increase, rapid, rate, remainder]"
3,May 1970,National Summary,There appears to be considerable skepticism regarding the ability of economic stabilization policies to achieve a significant reduction in the rate of inflation without generating an intolerable level of unemployment or a full-scale recession.,appears considerable skepticism regarding ability economic stabilization policies achieve significant reduction rate inflation without generating intolerable level unemployment recession,"[appears, considerable, skepticism, regarding, ability, economic, stabilization, policies, achieve, significant, reduction, rate, inflation, without, generating, intolerable, level, unemployment, recession]"
4,May 1970,National Summary,"Similarly, there is evidence of extensive concern about the persistence of strong upward wage pressures, despite some easing in labor markets.",similarly evidence extensive concern persistence strong upward wage pressures despite easing labor markets,"[similarly, evidence, extensive, concern, persistence, strong, upward, wage, pressures, despite, easing, labor, markets]"


## 4. Methods

### 4.1 Sentiment Classification
The sentiment analysis was conducted using VADER, a rule-based model designed for short sentences. I computed the VADER compound score for every sentence. The compound score ranges from −1 (most negative) to +1 (most positive). Following standard practice, I classified sentences as positive when the compound score was 0.05 or higher, negative when it was −0.05 or lower, and neutral otherwise.

After labeling all sentences, I aggregated those belonging to the same monthly Beige Book release to construct a national sentiment measure. The Beige Book Sentiment Index for month *t* was defined as:

$$
\text{Index}_t = \frac{\text{# positive}_t - \text{# negative}_t}{\text{# positive_t} + \text{# negative}_t}
$$

This formulation captures the balance of optimistic and pessimistic assessments each month, producing an index that increases when positive evaluations become more frequent and decreases when negative assessments dominate.

In [None]:
# -------------------------------------------------------------
# Compute Beige Book Sentiment Index using VADER
# -------------------------------------------------------------
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
sia = SentimentIntensityAnalyzer()

df_sent["vader_compound"] = df_sent["sentence_clean_sent"].apply(
    lambda x: sia.polarity_scores(x)["compound"]
)

def vader_label(score):
    if score >= 0.05:
        return "pos"
    elif score <= -0.05:
        return "neg"
    else:
        return "neu"

df_sent["vader_label"] = df_sent["vader_compound"].apply(vader_label)

In [None]:
# Top 3 positive sentences
df_sent.sort_values("vader_compound").tail(3)[["vader_compound", "sentence_clean_sent"]]

Unnamed: 0,vader_compound,sentence_clean_sent
4357,0.9712,"Surveys of businessmen and bankers in the Fifth District indicate general agreement on the following points: (1) some improvement in manufacturers' shipments, volume of new orders, and backlogs of orders; (2) significant further improvement in retail sales, including automobiles; (3) stability in the employment situation, but no clear evidence of improvement; (4) further reductions of prices in manufacturing, but not in retail goods and services; (5) sharp improvement in residential construction, and some increase in nonresidential construction; (6) substantial increases in mortgage loan demand, and slight increases in consumer loan demand, but no significant improvement in business loan demand; and (7) a generally more optimistic outlook regarding future business conditions."
540100,0.9719,"Contacts noted that workers are on track to receive bonuses this year, but bonuses are not expected to be overly generous given the softer labor market, though Wall Street bonuses are expected to be strong."
10685,0.9772,"However, a special survey of a cross-section of prominent businessmen located in the Atlanta, Nashville, and New Orleans areas yielded the following conclusions: support for the NEP remains strong, but has diminished somewhat over the past year; the NEP has been effective in checking at least some price increases; wage inflation has been effectively checked by the NEP, inequities have not been so great as to jeopardize the program; some form of controls should be continued beyond April 1973; inflationary expectations have diminished slightly; and the only economic resource that is in short supply is competent labor."


In [None]:
# Top 5 negative sentences
df_sent.sort_values("vader_compound").head(3)[["vader_compound", "sentence_clean_sent"]]

Unnamed: 0,vader_compound,sentence_clean_sent
189739,-0.9716,But adverse weather delayed or damaged crops in other districts and caused heavy livestock death losses and flood losses in the Minneapolis district.
456115,-0.9571,"In late August and early September, Hurricane Irene and Tropical Storm Lee left dozens of people injured or dead, damaged or destroyed thousands of homes, and cost hundreds of millions of dollars in disruption and damage throughout much of the Third District."
337617,-0.9524,Contacts also mentioned other industry changes resulting from the terrorist attacks such as separate terrorism and war clauses for policies (at additional cost) and closer scrutiny of the solvency of re-insurance providers.


In [None]:
# -------------------------------------------------------
# Aggregate sentiment by Beige Book release (year_month)
# -------------------------------------------------------
df_index = (
    df_sent
    .groupby("year_month", as_index=False)
    .agg(
        n_pos=("vader_label", lambda x: (x == "pos").sum()),
        n_neg=("vader_label", lambda x: (x == "neg").sum()),
        n_total=("vader_label", lambda x: ((x == "pos") | (x == "neg")).sum()),
        full_date=("full_date", "first")  # keep one date for reference
    )
)

# Compute Beige Book Sentiment Index: ( #positive − #negative ) / ( #positive + #negative )
df_index["vader_national"] = (
    (df_index["n_pos"] - df_index["n_neg"]) /
    (df_index["n_pos"] + df_index["n_neg"])
)

### 4.2 TF-IDF Theme Identification
To understand what drives movements in sentiment, I applied TF-IDF separately to positive and negative sentences for each monthly release. TF-IDF ranks words by how distinctive they are within that month relative to the entire corpus. This approach highlights the economic concepts that businesses emphasized most strongly in positive or negative reports. By examining these terms over time, it is possible to identify shifts in the themes that shape business sentiment, such as supply-chain issues, labor concerns, or demand conditions.

In [None]:
# -------------------------------------------------------
# Compute TF–IDF separately for positive and negative sentences
# -------------------------------------------------------
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:

def top_tfidf_words(group, n=10):
    """
    Compute TF–IDF for a group of sentences (one month)
    and return the top-n weighted terms.
    """
    # Take all TF-IDF-cleaned sentences for this month/label
    texts = group["sentence_clean_tfidf"].tolist()

    # If empty, return blank
    if len(texts) == 0:
        return ""

    # Vectorize using TF-IDF
    tfidf = TfidfVectorizer(max_features=2000)
    X_tfidf = tfidf.fit_transform(texts)

    # Compute average TF-IDF score across all sentences in this group
    scores = X_tfidf.mean(axis=0).A1
    terms = tfidf.get_feature_names_out()

    # Top-n term indices (descending score)
    top_idx = scores.argsort()[::-1][:n]

    # Return selected words as comma-separated string
    return ", ".join([terms[i] for i in top_idx])

In [None]:
# Compute TF–IDF themes for positive sentences
df_pos = df_sent[df_sent["vader_label"] == "pos"]

tfidf_pos = (
    df_pos
    .groupby("year_month")     # compute monthly themes
    .apply(top_tfidf_words, n=10)
    .reset_index(name="top_pos_terms")
)

In [None]:
# Compute TF–IDF themes for negative sentences
df_neg = df_sent[df_sent["vader_label"] == "neg"]

tfidf_neg = (
    df_neg
    .groupby("year_month")
    .apply(top_tfidf_words, n=10)
    .reset_index(name="top_neg_terms")
)

In [None]:
# Merge TF–IDF results into df_index
df_index = df_index.merge(tfidf_pos, on="year_month", how="left")
df_index = df_index.merge(tfidf_neg, on="year_month", how="left")

# Preview the result
df_index.head()

Unnamed: 0,year_month,n_pos,n_neg,n_total,full_date,vader_national,top_pos_terms,top_neg_terms
0,April 1971,154,120,274,"April 6, 1971",0.124088,"construction, demand, rates, consumer, improvement, loans, directors, loan, residential, increased","demand, rates, loan, banks, unemployment, weak, levels, activity, consumer, cut"
1,April 1972,220,70,290,"April 12, 1972",0.517241,"sales, demand, business, strong, construction, expected, strength, loans, loan, gains","demand, business, unemployment, cent, per, directors, new, phase, investment, respondents"
2,April 1973,184,118,302,"April 11, 1973",0.218543,"strong, sales, business, loan, directors, increased, increase, construction, gains, rates","demand, shortages, increases, labor, prices, price, reports, new, construction, unemployment"
3,April 1974,192,143,335,"April 10, 1974",0.146269,"business, sales, strong, prices, increase, increased, month, optimistic, rates, demand","shortages, demand, prices, steel, continue, loan, recent, business, however, also"
4,April 1975,202,187,389,"April 9, 1975",0.03856,"sales, increase, however, construction, economy, months, business, new, consumer, one","sales, prices, demand, weak, loan, unemployment, recovery, still, capital, one"


## 5. Results

In [None]:
# ---------------------------------------------
# Build a dynamic Plotly graph
# ---------------------------------------------
import numpy as np
import plotly.graph_objs as go

In [None]:
df_index["date"] = pd.to_datetime(df_index["year_month"], format="%B %Y")
df_index = df_index.sort_values("date").reset_index(drop=True)
df_fred  = df_fred.sort_values("date").reset_index(drop=True)

# ---------------------------------------------
# Merge Beige Book index with FRED growth data
# (pct_change is used directly, no renaming)
# ---------------------------------------------
df_plot = pd.merge_asof(
    df_index.sort_values("date"),
    df_fred[["date", "pct_change"]].sort_values("date"),
    on="date",
    direction="backward"
).reset_index(drop=True)

# ---------------------------------------------
# Prepare custom hover data
# customdata columns:
#   0 = Beige Book Index
#   1 = USPHCI YoY (%)
#   2 = positive keywords
#   3 = negative keywords
# ---------------------------------------------
customdata = np.stack([
    df_plot["vader_national"].values,
    df_plot["pct_change"].values,
    df_plot["top_pos_terms"].fillna("").values,
    df_plot["top_neg_terms"].fillna("").values,
], axis=-1)

# ---------------------------------------------
# Build Plotly figure with dual y-axes
# ---------------------------------------------
fig = go.Figure()

# ----- Beige Book Sentiment Index (left axis) -----
fig.add_trace(
    go.Scatter(
        x=df_plot["date"],
        y=df_plot["vader_national"],
        mode="lines+markers",
        name="Beige Book Sentiment Index",
        yaxis="y1",
        customdata=customdata,
        hovertemplate=(
            "<b>%{x|%Y-%m}</b><br>"
            "Beige Book Index: %{customdata[0]:.3f}<br>"
            "USPHCI YoY: %{customdata[1]:.2f}%<br>"
            "<br><b>Positive keywords</b>: %{customdata[2]}<br>"
            "<b>Negative keywords</b>: %{customdata[3]}<br>"
            "<extra></extra>"
        ),
    )
)

# ----- USPHCI YoY (%) (right axis) -----
fig.add_trace(
    go.Scatter(
        x=df_plot["date"],
        y=df_plot["pct_change"],
        mode="lines+markers",
        name="Philly Fed Economic Activity Index YoY (%)",
        yaxis="y2",
        hovertemplate=(
            "<b>%{x|%Y-%m}</b><br>"
            "USPHCI YoY: %{y:.2f}%<br>"
            "<extra></extra>"
        ),
    )
)

# ---------------------------------------------
# Layout settings
# ---------------------------------------------
fig.update_layout(
    title="Beige Book Sentiment Index vs Economic Activity Index",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Beige Book Sentiment Index"),
    yaxis2=dict(
        title="Philly Fed Economic Activity Index YoY (%)",
        overlaying="y",
        side="right",
        showgrid=False,
    ),
    template="plotly_white",
    hovermode="x unified",
)

fig.show()

> ### ⚠️ Interactive Plot of the Beige Book Index  
> GitHub cannot render interactive Plotly graphs.  
> Please click **[here](https://quiet-econ-lab.github.io/Text_Analysis_Final_Project/)** to view the fully interactive graph.

When the Beige Book Sentiment Index is compared with the growth rate of the Philadelphia Fed’s Coincident Economic Activity Index, the two measures tend to move in a similar direction. Periods of stronger sentiment are generally associated with stronger underlying economic conditions, whereas weaker sentiment coincides with slower growth. This alignment indicates that qualitative business narratives contain timely information about economic momentum—an insight that is particularly valuable during periods when official statistics are delayed or unavailable, as in the current government shutdown.

As the graph above shows, the sentiment index has softened. The background for this decline becomes clearer when viewed alongside the TF-IDF results, which are accessible through the hover labels in the plot. In particular, the decline in May 2025 stands out. The most prominent negative terms for that month— “demand,” “uncertain,” and “tariff”—indicate increasing concerns about slowing demand and uncertainty related to trade policy.

A similar pattern is visible in November 2025, where negative sentences again emphasize “demand” and “uncertain.” This persistence indicates that caution among businesses has not eased and that uncertainty remains a major factor shaping economic expectations. In addition, “labor” appears as an important negative term. Given the Federal Reserve’s dual mandate of maximum employment and stable prices, the emergence of “labor” as a source of concern carries meaningful implications for monetary policy, as discussed in the next section.

## 6. Policy Implications and Conclusion
For monetary policymakers, a decline in Beige Book sentiment driven by weakening demand and heightened uncertainty provides an early signal of economic slowing. Such developments strengthen the case for policy easing. As noted above, concerns about the labor market are particularly important. When negative terms related to “labor” appear alongside broader signs of weakening demand, they suggest that the Federal Reserve’s goal of maximum employment may be at risk, creating a strong signal in favor of lowering interest rates. Indeed, the Federal Reserve has recently decided to cut rates, consistent with the deterioration in sentiment documented in the Beige Book.

For fiscal policymakers, the economic themes revealed through the TF-IDF analysis are especially valuable because fiscal policy, unlike monetary policy, can be targeted toward specific sectors or groups. For example, if “tariff” appears as a prominent negative term, this indicates that importers facing higher input costs and consumers affected by price pass-through may require targeted relief. In addition, the TF-IDF results for November 2025 show manufacturing among the positive keywords, suggesting that this sector remains relatively resilient. In such circumstances, the analysis also helps clarify how to finance targeted support. If manufacturing is performing comparatively well while importers and consumers are under strain, redistributing income from the stronger sector to those more adversely affected may represent an appropriate fiscal strategy.

In conclusion, this project demonstrates that the Federal Reserve’s qualitative Beige Book can be transformed into a useful numerical indicator through sentence-level sentiment analysis. The resulting index aligns with real economic activity, and the TF-IDF results help reveal the economic themes that drive month-to-month shifts in sentiment. Together, these methods show that narrative economic information—often viewed as anecdotal or subjective—can be converted into data that meaningfully informs monetary and fiscal decision-making.

Several extensions could further enhance the value of this approach. Applying the same methodology at the district level could highlight regional strengths and challenges, enabling not only the Federal Reserve and federal policymakers but also state and local governments to incorporate these insights into their policy processes. Although the smaller volume of text in district-level reports may increase the volatility of sentiment indices, this limitation could be mitigated by using FinBERT, a language model trained specifically on financial and economic text. While VADER was used in this project for its speed and low computational cost, FinBERT may offer more accurate and less volatile sentiment classification, especially when GPU-based parallel processing is available. With such refinements, the Beige Book could become an even more powerful quantitative resource for real-time economic monitoring and policy design.