[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/05_01_exercise_guided.ipynb
)

# Module 5 — NLP: Exercise (Guided)
## `05_01_exercise_guided.ipynb`

**Goal:** Reproduce *Act II* from `05_00_main.ipynb` (AI governance subset), then perform **one controlled perturbation** to test whether the *agenda-as-emphasis* signal is robust.

### What you will do
1. Load the cached OpenAlex sample created in `05_00_main.ipynb`
2. Filter to **AI governance** abstracts (keyword conditioning)
3. Build a TF–IDF representation (**baseline: unigrams + bigrams**)
4. Compute **top characteristic terms by region**
5. Make **one change** (perturbation): **unigrams only**
6. Compare results and write a short reflection

### What you will not do
- You will not invent a new topic definition (that is for `05_02`)
- You will not chase metrics (this is about interpretability and robustness)


## 0) Colab-first setup

This notebook is designed to run in Colab from the `tunnel-ai/way` GitHub repo.

- First cell clones the repo and sets the working directory.
- We then load the cached dataset produced by `05_00_main.ipynb`.

If you have not run `05_00_main.ipynb` yet, do that first so the cache exists.


In [None]:
# Clone the course repo (Colab-friendly)
!git clone -q https://github.com/tunnel-ai/way.git

# Move into the repo
%cd way

# Basic imports
import os
import re
import textwrap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer


## 1) Load the cached OpenAlex sample

`05_00_main.ipynb` caches a CSV at:

- `assets/data/openalex_abstracts_sample.csv`

We will load that file here.

If the file is missing, run `05_00_main.ipynb` first.


In [None]:
DATA_PATH = "assets/data/openalex_abstracts_sample.csv"
assert os.path.exists(DATA_PATH), (
    f"Missing cache: {DATA_PATH}\n"
    "Fix: run notebooks/05_00_main.ipynb first to generate the cached sample."
)

df = pd.read_csv(DATA_PATH)
print("Loaded rows:", len(df))
df.head(3)

## 2) Minimal cleaning (same philosophy as `05_00`)

Key principle:

> Cleaning is an argument about what information you consider irrelevant.

We keep cleaning light and transparent:
- lowercase
- remove URLs
- keep letters, spaces, and hyphens
- normalize whitespace


In [None]:
url_pat = re.compile(r"https?://\S+|www\.\S+")
multi_space_pat = re.compile(r"\s+")

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.strip()
    s = url_pat.sub(" ", s)
    s = s.lower()
    s = re.sub(r"[^a-z\s\-]", " ", s)     # keep letters, spaces, hyphens
    s = multi_space_pat.sub(" ", s).strip()
    return s

# Use the cached abstract column
df["text"] = df["abstract"].apply(clean_text)

# Quick peek
i = 0
print(df.loc[i, "title"])
print("\nRAW:\n", textwrap.shorten(str(df.loc[i, "abstract"]), width=320, placeholder="…"))
print("\nCLEAN:\n", textwrap.shorten(str(df.loc[i, "text"]), width=320, placeholder="…"))

## 3) Condition on a shared topic: AI governance

We define an AI governance subset using a **transparent keyword filter**.

This is not perfect. It is a deliberate, inspectable construction of a corpus.


In [None]:
GOV_TERMS = [
    "governance",
    "ethics",
    "ethical",
    "fairness",
    "bias",
    "accountability",
    "transparency",
    "privacy",
    "regulation",
    "regulatory",
    "compliance",
    "risk",
    "responsible ai",
]

pattern = "|".join(GOV_TERMS)

df_gov = df[df["text"].str.contains(pattern, regex=True)].reset_index(drop=True)

print(f"Original corpus size: {len(df):,}")
print(f"AI governance subset: {len(df_gov):,}")
df_gov["region"].value_counts(dropna=False)

### Quick inspection

Before modeling, inspect a few examples to confirm the subset “looks like” governance.

**TODO:** Read 2–3 abstracts. Do they seem on-topic?


In [None]:
for i in range(min(3, len(df_gov))):
    print("\n" + "—"*50)
    print(df_gov.loc[i, "title"])
    print("Region:", df_gov.loc[i, "region"])
    print(textwrap.shorten(str(df_gov.loc[i, "abstract"]), width=380, placeholder="…"))

## 4) Baseline representation: TF–IDF (unigrams + bigrams)

We reuse a simple TF–IDF baseline with explicit choices:
- English stopwords
- `min_df` / `max_df` to control vocabulary extremes
- **(1,2)** n-grams (unigrams + bigrams)

This is our *baseline* representation for the exercise.


In [None]:
vectorizer_base = TfidfVectorizer(
    stop_words="english",
    min_df=5,
    max_df=0.9,
    ngram_range=(1, 2),
)

X_base = vectorizer_base.fit_transform(df_gov["text"])
feature_names_base = vectorizer_base.get_feature_names_out()

print("TF–IDF baseline matrix:", X_base.shape)

## 5) Agenda-as-emphasis: characteristic terms by region (baseline)

We treat “agenda” as **differences in emphasis within a shared topic**.

One proxy for emphasis:
> which terms have the highest average TF–IDF weight within a region’s documents?

**TODO:** Run the cell and scan the output. What differences do you notice?


In [None]:
def top_terms_by_region(df_sub, X_sub, feature_names, region, top_n=12):
    mask = (df_sub["region"] == region).to_numpy()
    if mask.sum() == 0:
        return []
    mean_tfidf = X_sub[mask].mean(axis=0).A1
    top_idx = mean_tfidf.argsort()[-top_n:][::-1]
    return [feature_names[i] for i in top_idx]

REGIONS = ["United States", "Europe", "China", "Other"]

baseline_terms = {}
for r in REGIONS:
    baseline_terms[r] = top_terms_by_region(df_gov, X_base, feature_names_base, r, top_n=12)
    print(f"\nBaseline top terms — {r}")
    print(baseline_terms[r])

## 6) Controlled perturbation: unigrams only

Now we make **one change**:

- Baseline: `ngram_range=(1,2)`
- Perturbation: `ngram_range=(1,1)`

Everything else stays fixed.

This tests whether the perceived “agenda” signal is robust to a reasonable representational change.


In [None]:
vectorizer_uni = TfidfVectorizer(
    stop_words="english",
    min_df=5,
    max_df=0.9,
    ngram_range=(1, 1),
)

X_uni = vectorizer_uni.fit_transform(df_gov["text"])
feature_names_uni = vectorizer_uni.get_feature_names_out()

print("TF–IDF unigram matrix:", X_uni.shape)

## 7) Recompute characteristic terms by region (perturbation)

**TODO:** Compare these lists to the baseline. What persisted? What changed?


In [None]:
perturbed_terms = {}
for r in REGIONS:
    perturbed_terms[r] = top_terms_by_region(df_gov, X_uni, feature_names_uni, r, top_n=12)
    print(f"\nUnigrams-only top terms — {r}")
    print(perturbed_terms[r])

## 8) Side-by-side comparison table

This table is the core artifact of the exercise.

**TODO:** Read the table row by row. Identify at least:
- 1 signal that persists across both representations
- 1 signal that appears representation-dependent


In [None]:
rows = []
for r in REGIONS:
    rows.append({
        "Region": r,
        "Baseline (1–2 grams)": ", ".join(baseline_terms.get(r, [])),
        "Perturbed (unigrams)": ", ".join(perturbed_terms.get(r, [])),
    })

comp = pd.DataFrame(rows)
comp

## 9) Reflection (short, structured)

Write short answers in markdown *in this notebook* (not in code).

### Prompt
1. In 2–3 sentences, describe the **baseline** differences in emphasis you observed across regions.
2. In 2–3 sentences, describe how those differences **changed** under the unigram-only perturbation.
3. In 2–3 sentences, assess whether the perceived agenda signal appears **robust, fragile, or representation-dependent**.

Remember:
- We are not claiming to discover “national agendas.”
- We are examining how *representations* shape the perception of difference within a shared topic.
