[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/05_00_main.ipynb
)


# Module 5 — NLP: From Text to Perceived Agendas (Instructor Main)

## Act I: Why national agendas do (not?) emerge automatically

Our naive question/claim in this workbook: “If we cluster all scientific abstracts, do national research agendas emerge?”

In Act II: "Within a shared, globally contested topic, do regional differences in emphasis emerge?”



**Notebook:** `05_00_main`  
**Cadence:** magic helpers would be good here... but I want to try and keep the live-run narrative going, so the code looks a little less tidy.

**Big idea:** We will treat scientific abstracts as analytical artifacts, convert them into representations,
and examine how different representations can create *perceptions* of differences across coarse regions.

**Important:** A few notes on this... 

## 0) Colab-first setup

This notebook is designed to run in Google Colab.
- We fetch data from the **OpenAlex** public API (no keys).
- We cache a local CSV so you can re-run without re-fetching everything.
  

In [None]:
import os
import time
import json
import textwrap
import re
from dataclasses import dataclass

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# If you care to set a random seed (I won;t use one here)
RNG = np.random.default_rng(1955)

In [None]:
# Output paths (adjust if your repo uses a different structure)
DATA_DIR = "assets/data"
os.makedirs(DATA_DIR, exist_ok=True)

RAW_CACHE_PATH = os.path.join(DATA_DIR, "openalex_abstracts_sample.csv")

## 1) From affiliations to regions (a necessary proxy)

To explore how scientific agendas *appear* across different parts of the world, we need a way to group documents.
For this module, we use a **coarse, affiliation-based proxy** to assign each paper to a geographic region.

### What we are doing
- We query OpenAlex for works that have abstracts and institutional affiliations.
- We extract **institution country codes** from the metadata.
- We assign each work to a broad region (e.g., United States, Europe, China).
- If no clear mapping can be made, we assign **Unknown**.

### What this mapping is not
- Not author nationality
- Not a claim about funding sources or political priorities
- Not a unified "national agenda"
- Not a clean solution for multinational collaborations

Affiliation is a **proxy**, not a truth.

Our question is:
> What patterns appear when we impose structure on scientific writing and then aggregate those structures by region?

Keep in mind:
- If the mapping changes, the story may change.
- If the representation changes, the story may change.
- Visualization confidence is not interpretive certainty.

## 2) Fetch a sample of abstracts from OpenAlex (no keys)

OpenAlex provides a free REST API. We'll pull a *manageable sample* of works with:
- an abstract
- institutional affiliation(s) with country code(s)

We'll keep the sample modest to stay Colab-friendly.

In [None]:
import requests

In [None]:
@dataclass
class FetchConfig:
    per_page: int = 200
    max_works: int = 2000          # keep Colab-friendly; increase if needed
    from_year: int = 2022          # adjust to widen/narrow
    concept_id: str | None = None  # optional: focus on a concept
    mailto: str | None = None      # optional but polite: OpenAlex suggests adding a mailto

CFG = FetchConfig(
    per_page=200,
    max_works=2000,
    from_year=2022,
    concept_id=None,   # if we wanted to test certain concepts: "C154945302" (Artificial intelligence) (maybe verify first)
    mailto=None
)

In [None]:
BASE = "https://api.openalex.org/works"

In [None]:
def build_openalex_url(cursor="*"):
    params = {
        "per-page": CFG.per_page,
        "cursor": cursor,
        # basic filters: has abstract AND has institutions (via authorships)
        "filter": f"has_abstract:true,from_publication_date:{CFG.from_year}-01-01",
        # sorting by recency tends to pull clearer modern language
        "sort": "publication_date:desc",
    }
    if CFG.concept_id:
        params["filter"] += f",concept.id:{CFG.concept_id}"
    if CFG.mailto:
        params["mailto"] = CFG.mailto
    return BASE, params

In [None]:
def inverted_index_to_text(inv):
    """
    OpenAlex stores abstracts as an 'inverted index' (token -> list of positions).
    We'll reconstruct an approximate text by placing tokens back at their positions.
    This preserves word content, but not necessarily punctuation/casing.
    """
    if inv is None or not isinstance(inv, dict) or len(inv) == 0:
        return None
    # Determine length (max position)
    max_pos = 0
    for token, positions in inv.items():
        if positions:
            max_pos = max(max_pos, max(positions))
    tokens = [""] * (max_pos + 1)
    for token, positions in inv.items():
        for p in positions:
            if 0 <= p < len(tokens) and tokens[p] == "":
                tokens[p] = token
    # Fill blanks with nothing (some positions can be empty)
    text = " ".join([t for t in tokens if t])
    return text if text.strip() else None

In [None]:
def extract_country_codes(work: dict) -> list[str]:
    """
    Pull country codes from institutions associated with the work's authorships.
    We allow multiple institutions -> multiple country codes.
    """
    codes = []
    for auth in work.get("authorships", []) or []:
        for inst in auth.get("institutions", []) or []:
            cc = inst.get("country_code")
            if cc:
                codes.append(cc.upper())
    return sorted(set(codes))

In [None]:
def fetch_openalex_sample():
    rows = []
    cursor = "*"
    fetched = 0

    while fetched < CFG.max_works:
        url, params = build_openalex_url(cursor=cursor)
        r = requests.get(url, params=params, timeout=60)
        r.raise_for_status()
        payload = r.json()

        for work in payload.get("results", []) or []:
            if fetched >= CFG.max_works:
                break

            inv = work.get("abstract_inverted_index")
            abstract = inverted_index_to_text(inv)
            if not abstract:
                continue

            country_codes = extract_country_codes(work)

            loc = work.get("primary_location") or {}
            src = loc.get("source") or {}

            rows.append({
                "openalex_id": work.get("id"),
                "doi": work.get("doi"),
                "title": work.get("title"),
                "publication_date": work.get("publication_date"),
                "primary_location": src.get("display_name"),
                "country_codes": "|".join(country_codes) if country_codes else "",
                "n_country_codes": len(country_codes),
                "abstract": abstract,
                "type": work.get("type"),
                "cited_by_count": work.get("cited_by_count"),
            })
            fetched += 1

        cursor = payload.get("meta", {}).get("next_cursor")
        if not cursor:
            break

        # be a good API citizen
        time.sleep(0.15)

    return pd.DataFrame(rows)

In [None]:
if os.path.exists(RAW_CACHE_PATH):
    df = pd.read_csv(RAW_CACHE_PATH)
    print(f"Loaded cached sample: {len(df):,} rows from {RAW_CACHE_PATH}")
else:
    df = fetch_openalex_sample()
    print(f"Fetched sample: {len(df):,} rows")
    df.to_csv(RAW_CACHE_PATH, index=False)
    print(f"Saved cache to {RAW_CACHE_PATH}")

In [None]:
df.head(3)

## 3) Quick data sanity checks

Before modeling: inspect what we actually have.

In [None]:
print("Rows:", len(df))
print("Missing abstracts:", df["abstract"].isna().sum())
print("Avg abstract length (chars):", int(df["abstract"].str.len().mean()))
print("Median # country codes:", int(df["n_country_codes"].median()))

In [None]:
# Look at a few titles + first ~250 chars of abstracts
for i in range(3):
    print("\n—" * 40)
    print(df.loc[i, "title"])
    print(textwrap.shorten(df.loc[i, "abstract"], width=250, placeholder="…"))

## 4) Region mapping (coarse, imperfect, still useful)

We'll map *country codes* to broad regions:
- **US** (United States)
- **China**
- **Europe** (broadly: EU + UK + EFTA + nearby; we keep it simple. I know I know Brexit... but come on.)
- **Other**
- **Unknown**

**Note:** a paper can have multiple country codes. We'll assign:
- If **US** appears anywhere -> label "United States"
- Else if **CN** appears -> "China"
- Else if any European code appears -> "Europe"
- Else if any codes exist -> "Other"
- Else -> "Unknown"

This priority rule is arbitrary on purpose: it gives us a stable grouping for visualization, not truth.

In [None]:
EUROPE_CODES = {
    # EU members + UK + EFTA + common European countries
    "AT","BE","BG","HR","CY","CZ","DK","EE","FI","FR","DE","GR","HU","IE","IT","LV","LT","LU",
    "MT","NL","PL","PT","RO","SK","SI","ES","SE",
    "GB","UK",  # UK sometimes appears as GB; UK included defensively. Brits are indecisive. 
    "NO","CH","IS","LI",
    "UA","TR","RS","BA","ME","MK","AL","MD","BY","GE","AM","AZ",
}

def map_region(country_codes_str: str) -> str:
    if not isinstance(country_codes_str, str) or country_codes_str.strip() == "":
        return "Unknown"
    codes = {c.strip().upper() for c in country_codes_str.split("|") if c.strip()}
    if "US" in codes:
        return "United States"
    if "CN" in codes:
        return "China"
    if len(codes & EUROPE_CODES) > 0:
        return "Europe"
    return "Other"

df["region"] = df["country_codes"].apply(map_region)

In [None]:
df["region"].value_counts(dropna=False)

## 5) Minimal text cleaning (decisions, not perfection)

We'll keep cleaning light and transparent:
- normalize whitespace
- lowercase (for TF–IDF)
- remove obvious URLs
- remove non-letter characters *selectively* (keep hyphens and spaces)

Key principle:
> Cleaning is an argument about what information you consider irrelevant.

In [None]:
url_pat = re.compile(r"https?://\S+|www\.\S+")
multi_space_pat = re.compile(r"\s+")

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.strip()
    s = url_pat.sub(" ", s)
    s = s.lower()
    # keep letters, spaces, and hyphens; convert everything else to space
    s = re.sub(r"[^a-z\s\-]", " ", s)
    s = multi_space_pat.sub(" ", s).strip()
    return s

df["text"] = df["abstract"].apply(clean_text)

In [None]:
# Inspect a before/after example
idx = 0
print("TITLE:", df.loc[idx, "title"])
print("\nRAW:\n", textwrap.shorten(df.loc[idx, "abstract"], width=400, placeholder="…"))
print("\nCLEAN:\n", textwrap.shorten(df.loc[idx, "text"], width=400, placeholder="…"))

## 6) Representation 1: TF–IDF (sparse geometry)

TF–IDF is a classic baseline that is still conceptually powerful:
- You get a high-dimensional sparse vector for each document.
- Distance/similarity becomes a geometric question.

We will keep choices explicit:
- `min_df` / `max_df` control vocabulary inclusion.
- `ngram_range` decides whether phrases matter.
- stop words are a value judgment: we start with English stopwords.

In [None]:
vectorizer = TfidfVectorizer(
    stop_words="english",
    min_df=5,
    max_df=0.9,
    ngram_range=(1, 2),
)

X = vectorizer.fit_transform(df["text"])
print("TF–IDF matrix shape:", X.shape)

In [None]:
# Show a few top-weighted terms for one document
doc_i = 0
row = X[doc_i]
if row.nnz > 0:
    topk = 12
    inds = row.indices[np.argsort(row.data)[-topk:][::-1]]
    terms = [vectorizer.get_feature_names_out()[j] for j in inds]
    weights = np.sort(row.data)[-topk:][::-1]
    print(df.loc[doc_i, "title"])
    for t, w in zip(terms, weights):
        print(f"{t:<28} {w:.3f}")

## 7) A first "agenda lens": similarity search

We'll pick one abstract and retrieve its nearest neighbors (cosine similarity in TF–IDF space).


In [None]:
sim = cosine_similarity(X[0], X).ravel()
nn = np.argsort(sim)[::-1][:10]

print("Query document:")
print(" -", df.loc[0, "title"])
print(" - region:", df.loc[0, "region"])
print()

print("Nearest neighbors:")
for j in nn[1:]:
    print(f"sim={sim[j]:.3f} | {df.loc[j,'region']:<13} | {textwrap.shorten(str(df.loc[j,'title']), width=80, placeholder='…')}")

## 8) Visualizing structure: reduce dimensionality (TruncatedSVD)

TF–IDF lives in a huge space. To visualize structure, we project into 2D.

We'll use **TruncatedSVD** (works directly on sparse matrices).
This is not a "true map" — it's a view. Views can mislead.

In [None]:
svd = TruncatedSVD(n_components=2, random_state=7)
Z = svd.fit_transform(X)

In [None]:
plt.figure(figsize=(8, 6))
for region, sub in df.assign(x=Z[:,0], y=Z[:,1]).groupby("region"):
    plt.scatter(sub["x"], sub["y"], s=12, alpha=0.6, label=region)
plt.title("TF–IDF → 2D projection (TruncatedSVD)")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend(markerscale=2)
plt.show()

## 9) How stable is the "agenda perception"?

We'll do two quick stress tests:

1) Change the representation slightly (unigrams only vs unigrams+bigrams)
2) Change vocabulary thresholds (`min_df`)

When you do that, your not optimizing, the "story" that may emerge depends on the choices you make. 

In [None]:
vectorizer_uni = TfidfVectorizer(
    stop_words="english",
    min_df=5,
    max_df=0.9,
    ngram_range=(1, 1),
)
X_uni = vectorizer_uni.fit_transform(df["text"])

Z_uni = TruncatedSVD(n_components=2, random_state=7).fit_transform(X_uni)

In [None]:
plt.figure(figsize=(8, 6))
for region, sub in df.assign(x=Z_uni[:,0], y=Z_uni[:,1]).groupby("region"):
    plt.scatter(sub["x"], sub["y"], s=12, alpha=0.6, label=region)
plt.title("TF–IDF (unigrams only) → 2D projection")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend(markerscale=2)
plt.show()

### Reflection (live discussion)

- Do the apparent separations persist?
- Do the clouds rotate / smear / overlap?
- If your *interpretation* changes when you tweak `ngram_range`, what does that imply?

A useful conclusion is not "regions differ."
A useful conclusion is:
> The perception of difference is sensitive to representational choices.

## 10) Optional: a lightweight "topic lens" (top terms by cluster)

We'll do a simple KMeans clustering on the TF–IDF vectors, then interpret clusters by top terms.
This is *not* state-of-the-art topic modeling; it's a transparent, geometry-driven baseline.

In [None]:
from sklearn.cluster import KMeans

k = 8
kmeans = KMeans(n_clusters=k, random_state=7, n_init="auto")
labels = kmeans.fit_predict(X)

df["cluster"] = labels

In [None]:
df["cluster"].value_counts().sort_index()

In [None]:
# Show top terms per cluster (centroid weights)
terms = vectorizer.get_feature_names_out()
centroids = kmeans.cluster_centers_

topn = 12
for c in range(k):
    top_idx = np.argsort(centroids[c])[-topn:][::-1]
    top_terms = [terms[i] for i in top_idx]
    print(f"\nCluster {c} — top terms:")
    print(", ".join(top_terms))

### Region composition by cluster (a first aggregation)

This is where "perceived agendas" can appear.
Notice how quickly the table invites narrative — and how dependent it is on upstream decisions.

In [None]:
ct = pd.crosstab(df["cluster"], df["region"], normalize="index")
ct.round(3)

In [None]:
plt.figure(figsize=(10, 5))
ct.plot(kind="bar", stacked=True, figsize=(10, 5))
plt.title("Cluster composition by region (row-normalized)")
plt.xlabel("Cluster")
plt.ylabel("Share within cluster")
plt.legend(title="Region", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()


Region labels mostly reflect coverage and metadata, not content.

Aggregation alone does not create meaning

## Interlude: Why This Result Is So Very Unsatisfying (and Still Useful)

- The absence of clear national differentiation is not a failure of NLP.
- It reflects a deeper issue: we are asking a question that is too broad.
- Scientific agendas are not global properties of “science”... too big it failed. 


### With no evidence (Hand waving assertion)

Topics with genuine international interaction will share three properties:
- emerge within shared, contested problem spaces.   
- heterogeneous regulatory, ethical, and institutional contexts. 
- enough volume to support substructure. 
  
To see agenda differences, we must condition on topic!

## Act II: Conditioning on a Shared Topic — AI Governance

In Act I, we asked a deliberately broad question:
*If we cluster recent scientific abstracts, do national research agendas emerge?*

The answer was largely **no** — and that result is informative.

Scientific agendas are not global properties of “science.”
They emerge within **shared, contested problem spaces**.

To examine whether regional differences in emphasis appear at all, we now **condition the corpus** on a single topic that is:
- internationally active
- socially and institutionally contested
- plausibly shaped by regional context

We will focus on **AI governance, ethics, and regulation**.


### Defining the Topic

Rather than relying on metadata categories, we define AI governance using a filter (below) applied directly to the abstract text.

This is a choice, not a neutral fact. Some attention and justification is needed when choosing what to include or not. 

The goal is not to perfectly capture the topic. This is too fast and loose for that. Rather we are working toward a *shared object of inquiry* where differences in emphasis might *plausibly* emerge.


### Why We Are Changing the Question

Up to this point, we have treated **clusters** as the primary object of analysis.

That framing is no longer appropriate.

When studying research agendas, we are rarely interested in whether documents
fall into clean, separable groups.

Instead, agendas are better understood as **differences in emphasis within a shared topic**.

For the remainder of this notebook, we will treat:
- *clusters* as optional diagnostics
- *term salience and semantic emphasis* as the primary signals of interest


In [None]:
GOV_TERMS = [
    "governance",
    "ethics",
    "ethical",
    "fairness",
    "bias",
    "accountability",
    "transparency",
    "privacy",
    "regulation",
    "regulatory",
    "compliance",
    "risk",
    "responsible ai",
]

pattern = "|".join(GOV_TERMS)

df_gov = df[df["text"].str.contains(pattern, regex=True)].reset_index(drop=True)

print(f"Original corpus size: {len(df):,}")
print(f"AI governance subset: {len(df_gov):,}")
df_gov["region"].value_counts()


Before modeling, we quickly inspect what this conditioning step produced.

This helps answer two questions:
1. Did we meaningfully narrow the corpus?
2. Is the subset still internationally mixed?


In [None]:
for i in range(3):
    print("\n—" * 40)
    print(df_gov.loc[i, "title"])
    print(df_gov.loc[i, "region"])
    print(df_gov.loc[i, "abstract"][:400], "…")


Importantly, we did not change the modeling pipeline.

We reuse:
- the same cleaning decisions
- the same TF–IDF configuration
- the same dimensionality reduction

Any differences we observe now come from **conditioning**, not from new machinery.


In [None]:
X_gov = vectorizer.fit_transform(df_gov["text"])
print("TF–IDF matrix (governance subset):", X_gov.shape)


We again project the TF–IDF space into two dimensions.

Remember this is not some sort of projection of "truth". It's a lens, and here we are zooming in by restricting the question.


In [None]:
Z_gov = TruncatedSVD(n_components=2, random_state=1955).fit_transform(X_gov)

plt.figure(figsize=(8, 6))
for region, sub in df_gov.assign(x=Z_gov[:,0], y=Z_gov[:,1]).groupby("region"):
    plt.scatter(sub["x"], sub["y"], s=14, alpha=0.7, label=region)

plt.title("AI Governance Abstracts — TF–IDF Projection")
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend(markerscale=2)
plt.show()


At this point, the question changes.

We are no longer asking:
“Do clusters correspond to countries?”

Instead, we ask:
**Within a shared topic, how does emphasis differ by region?**

One simple way to examine emphasis is to compare
*which terms are most characteristic* of documents from each region.


In [None]:
feature_names = vectorizer.get_feature_names_out()

def top_terms_for_region(region, top_n=12):
    sub = df_gov[df_gov["region"] == region]
    if len(sub) == 0:
        return []
    X_sub = vectorizer.transform(sub["text"])
    mean_tfidf = X_sub.mean(axis=0).A1
    top_idx = mean_tfidf.argsort()[-top_n:][::-1]
    return [feature_names[i] for i in top_idx]

for r in ["United States", "Europe", "China", "Other"]:
    print(f"\nTop terms — {r}")
    print(top_terms_for_region(r))


### What Changed  (and a little of why it Matters)

When we analyzed all scientific abstracts, national agendas did not emerge. 

When we conditioned on a **shared, contested topic**, did differences in emphasis appear? Is this the same as national priorities? (spoiler-no)

It means that:
- conditioning choices shape what structure becomes visible
- representation choices shape how that structure appears
- aggregation turns emphasis into narrative

This tension  between insight and over interpretation 
is central to how NLP is used in policy, strategy, and research analysis.
