[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/05_03_extension_linguistic.ipynb
)

# Module 5 — NLP (Optional Extension)
## `05_03_extension_linguistic.ipynb` — Agenda as Rhetorical Style

This optional notebook is designed as **accordion time**:
- If we move fast, we can use this to extend the module.
- If we move slow, we are going to skip it entirely.

### Big idea
In `05_00` and `05_01`, we treated “agenda” as differences in **semantic emphasis**.

Here we ask a different question:

> Do differences also appear in **how research is written** (rhetorical style), not just what it discusses?

We focus on **AI governance** abstracts and compute simple, interpretable linguistic features:
- lexical diversity (Type–Token Ratio)
- readability proxies (Flesch–Kincaid, Gunning Fog)
- average sentence length, average word length
- part-of-speech (POS) composition

### Standalone design
This notebook **does not assume** you already ran `05_00` or `05_01`.
It will:
1. Fetch a sample of abstracts from OpenAlex (no keys) and cache locally
2. Map affiliations → (very) coarse regions (United States / Europe / China / Other / Unknown)
3. Condition on an AI governance subset
4. Compute linguistic features and compare across regions

> **Caution:** These are descriptive proxies. They do not establish causality or intent.


## 0) Colab-first setup

Run this notebook in Google Colab.

- We clone the repo so relative paths work consistently.
- We install a few lightweight NLP utilities (`textstat`, `nltk`).


In [None]:
# Clone the course repo (Colab-friendly)
!git clone -q https://github.com/tunnel-ai/way.git
%cd way

# Lightweight installs (Colab-safe)
!pip -q install textstat nltk

import os
import re
import time
import textwrap
from dataclasses import dataclass

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import requests
import nltk
import textstat

# NLTK assets for tokenization + POS tagging
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)


## 1) Fetch a sample from OpenAlex (and cache it)

We fetch a manageable sample of works that have:
- an abstract
- institutional affiliation metadata (used later for region proxy)

We cache to:
- `assets/data/openalex_abstracts_sample.csv`

If the cache exists, we load it.


In [None]:
DATA_DIR = "assets/data"
os.makedirs(DATA_DIR, exist_ok=True)
RAW_CACHE_PATH = os.path.join(DATA_DIR, "openalex_abstracts_sample.csv")

@dataclass
class FetchConfig:
    per_page: int = 200
    max_works: int = 2500         # modest, Colab-friendly
    from_year: int = 2022
    mailto: str | None = None     # optional

CFG = FetchConfig(per_page=200, max_works=2500, from_year=2022, mailto=None)

BASE = "https://api.openalex.org/works"

def build_openalex_url(cursor="*"):
    params = {
        "per-page": CFG.per_page,
        "cursor": cursor,
        "filter": f"has_abstract:true,from_publication_date:{CFG.from_year}-01-01",
        "sort": "publication_date:desc",
    }
    if CFG.mailto:
        params["mailto"] = CFG.mailto
    return BASE, params

def inverted_index_to_text(inv):
    if inv is None or not isinstance(inv, dict) or len(inv) == 0:
        return None
    max_pos = 0
    for token, positions in inv.items():
        if positions:
            max_pos = max(max_pos, max(positions))
    tokens = [""] * (max_pos + 1)
    for token, positions in inv.items():
        for p in positions:
            if 0 <= p < len(tokens) and tokens[p] == "":
                tokens[p] = token
    text = " ".join([t for t in tokens if t])
    return text if text.strip() else None

def extract_country_codes(work: dict) -> list[str]:
    codes = []
    for auth in work.get("authorships", []) or []:
        for inst in auth.get("institutions", []) or []:
            cc = inst.get("country_code")
            if cc:
                codes.append(cc.upper())
    return sorted(set(codes))

def fetch_openalex_sample():
    rows = []
    cursor = "*"
    fetched = 0
    while fetched < CFG.max_works:
        url, params = build_openalex_url(cursor=cursor)
        r = requests.get(url, params=params, timeout=60)
        r.raise_for_status()
        payload = r.json()

        for work in payload.get("results", []) or []:
            if fetched >= CFG.max_works:
                break

            abstract = inverted_index_to_text(work.get("abstract_inverted_index"))
            if not abstract:
                continue

            country_codes = extract_country_codes(work)

            # None-safe access for primary location
            loc = work.get("primary_location") or {}
            src = (loc.get("source") or {})

            rows.append({
                "openalex_id": work.get("id"),
                "doi": work.get("doi"),
                "title": work.get("title"),
                "publication_date": work.get("publication_date"),
                "primary_location": src.get("display_name"),
                "country_codes": "|".join(country_codes) if country_codes else "",
                "n_country_codes": len(country_codes),
                "abstract": abstract,
                "type": work.get("type"),
                "cited_by_count": work.get("cited_by_count"),
            })
            fetched += 1

        cursor = payload.get("meta", {}).get("next_cursor")
        if not cursor:
            break
        time.sleep(0.15)

    return pd.DataFrame(rows)

if os.path.exists(RAW_CACHE_PATH):
    df = pd.read_csv(RAW_CACHE_PATH)
    print(f"Loaded cached sample: {len(df):,} rows from {RAW_CACHE_PATH}")
else:
    df = fetch_openalex_sample()
    print(f"Fetched sample: {len(df):,} rows")
    df.to_csv(RAW_CACHE_PATH, index=False)
    print(f"Saved cache to {RAW_CACHE_PATH}")

df.head(3)

## 2) From affiliations to regions (a proxy)

We map country codes to broad regions:
- United States
- China
- Europe (broad)
- Other
- Unknown

**Reminder:** affiliation ≠ nationality; region ≠ agenda; this is a coarse proxy to enable comparison.


In [None]:
EUROPE_CODES = {
    "AT","BE","BG","HR","CY","CZ","DK","EE","FI","FR","DE","GR","HU","IE","IT","LV","LT","LU",
    "MT","NL","PL","PT","RO","SK","SI","ES","SE",
    "GB","UK",
    "NO","CH","IS","LI",
    "UA","TR","RS","BA","ME","MK","AL","MD","BY","GE","AM","AZ",
}

def map_region(country_codes_str: str) -> str:
    if not isinstance(country_codes_str, str) or country_codes_str.strip() == "":
        return "Unknown"
    codes = {c.strip().upper() for c in country_codes_str.split("|") if c.strip()}
    if "US" in codes:
        return "United States"
    if "CN" in codes:
        return "China"
    if len(codes & EUROPE_CODES) > 0:
        return "Europe"
    return "Other"

df["region"] = df["country_codes"].apply(map_region)
df["region"].value_counts(dropna=False)

## 3) Minimal text cleaning

For linguistic features, we want something close to the original text, but we still:
- remove URLs
- normalize whitespace

We do **not** aggressively strip punctuation here, because sentence boundaries matter.


In [None]:
url_pat = re.compile(r"https?://\S+|www\.\S+")
multi_space_pat = re.compile(r"\s+")

def clean_for_style(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.strip()
    s = url_pat.sub(" ", s)
    s = multi_space_pat.sub(" ", s).strip()
    return s

df["text"] = df["abstract"].apply(clean_for_style)

# quick peek
i = 0
print(df.loc[i, "title"])
print(df.loc[i, "region"])
print(textwrap.shorten(df.loc[i, "text"], width=420, placeholder="…"))

## 4) Condition on a shared topic: AI governance

We define an AI governance subset using a transparent keyword filter.

This is a **topic conditioning** step (same spirit as Act II in `05_00`).


In [None]:
GOV_TERMS = [
    "governance",
    "ethics",
    "ethical",
    "fairness",
    "bias",
    "accountability",
    "transparency",
    "privacy",
    "regulation",
    "regulatory",
    "compliance",
    "risk",
    "responsible ai",
]

pattern = "|".join(GOV_TERMS)

df_gov = df[df["text"].str.lower().str.contains(pattern, regex=True)].reset_index(drop=True)

print(f"Original corpus: {len(df):,}")
print(f"AI governance subset: {len(df_gov):,}")
df_gov["region"].value_counts(dropna=False)

### Quick inspection (optional)

Skim a few abstracts to ensure the subset is on-topic.


In [None]:
for i in range(min(3, len(df_gov))):
    print("\n" + "—"*50)
    print(df_gov.loc[i, "title"])
    print("Region:", df_gov.loc[i, "region"])
    print(textwrap.shorten(df_gov.loc[i, "text"], width=420, placeholder="…"))

## 5) Linguistic feature extraction

We compute simple, interpretable features:

- **Type–Token Ratio (TTR):** lexical diversity proxy  
- **Readability (Textstat):** Flesch–Kincaid grade, Gunning Fog  
- **Avg sentence length:** words per sentence  
- **Avg word length:** characters per word  
- **POS composition (NLTK):** counts of nouns, verbs, adjectives, adverbs (normalized)

> These are proxies for rhetorical style. They are not “quality” measures.


In [None]:
def safe_word_tokens(text: str):
    # Basic tokenization via NLTK
    return [t for t in nltk.word_tokenize(text) if re.search(r"[A-Za-z]", t)]

def compute_ttr(text: str) -> float:
    toks = [t.lower() for t in safe_word_tokens(text)]
    if len(toks) == 0:
        return np.nan
    return len(set(toks)) / len(toks)

def avg_sentence_length(text: str) -> float:
    sents = nltk.sent_tokenize(text)
    if len(sents) == 0:
        return np.nan
    lens = []
    for s in sents:
        toks = safe_word_tokens(s)
        if len(toks) > 0:
            lens.append(len(toks))
    return float(np.mean(lens)) if lens else np.nan

def avg_word_length(text: str) -> float:
    toks = safe_word_tokens(text)
    if len(toks) == 0:
        return np.nan
    return float(np.mean([len(t) for t in toks]))

# POS: coarse buckets
POS_BUCKETS = {
    "NOUN": {"NN", "NNS", "NNP", "NNPS"},
    "VERB": {"VB", "VBD", "VBG", "VBN", "VBP", "VBZ"},
    "ADJ":  {"JJ", "JJR", "JJS"},
    "ADV":  {"RB", "RBR", "RBS"},
}

def pos_bucket_shares(text: str):
    toks = safe_word_tokens(text)
    if len(toks) == 0:
        return {k: np.nan for k in POS_BUCKETS}
    tags = nltk.pos_tag(toks)
    counts = {k: 0 for k in POS_BUCKETS}
    total = 0
    for _, tag in tags:
        total += 1
        for bucket, tagset in POS_BUCKETS.items():
            if tag in tagset:
                counts[bucket] += 1
                break
    # normalize to shares
    return {k: (counts[k] / total if total else np.nan) for k in counts}

def readability_fk(text: str) -> float:
    try:
        return float(textstat.flesch_kincaid_grade(text))
    except Exception:
        return np.nan

def readability_gf(text: str) -> float:
    try:
        return float(textstat.gunning_fog(text))
    except Exception:
        return np.nan

# Compute features
df_feat = df_gov[["title", "region", "text"]].copy()

df_feat["ttr"] = df_feat["text"].apply(compute_ttr)
df_feat["fk_grade"] = df_feat["text"].apply(readability_fk)
df_feat["gunning_fog"] = df_feat["text"].apply(readability_gf)
df_feat["avg_sent_len"] = df_feat["text"].apply(avg_sentence_length)
df_feat["avg_word_len"] = df_feat["text"].apply(avg_word_length)

pos_shares = df_feat["text"].apply(pos_bucket_shares)
df_feat["noun_share"] = pos_shares.apply(lambda d: d["NOUN"])
df_feat["verb_share"] = pos_shares.apply(lambda d: d["VERB"])
df_feat["adj_share"]  = pos_shares.apply(lambda d: d["ADJ"])
df_feat["adv_share"]  = pos_shares.apply(lambda d: d["ADV"])

df_feat.head(3)

## 6) Compare style features across regions

We compare distributions by region. These plots are descriptive.

**Tip:** if the subset is small for a region (e.g., China), interpret differences cautiously.


In [None]:
FEATURES = ["ttr", "fk_grade", "gunning_fog", "avg_sent_len", "avg_word_len",
            "noun_share", "verb_share", "adj_share", "adv_share"]

# Basic summary table
summary = df_feat.groupby("region")[FEATURES].agg(["count", "mean", "median"]).round(3)
summary

### Distribution plots (accordion-friendly)

We show a simple boxplot per feature.

If time is tight, you can skip this entire section.


In [None]:
regions = [r for r in ["United States", "Europe", "China", "Other", "Unknown"] if r in df_feat["region"].unique()]

def boxplot_feature(feature):
    data = []
    labels = []
    for r in regions:
        vals = df_feat.loc[df_feat["region"] == r, feature].dropna().to_numpy()
        if len(vals) > 0:
            data.append(vals)
            labels.append(r)
    plt.figure(figsize=(10, 4))
    plt.boxplot(data, labels=labels, showfliers=False)
    plt.title(f"{feature} by region (AI governance subset)")
    plt.xticks(rotation=15)
    plt.tight_layout()
    plt.show()

for f in FEATURES:
    boxplot_feature(f)

## 7) Optional: A single “style profile” radar chart (quick glance)

This is a compact way to compare regions on multiple normalized features at once.

- We z-score each feature across the full governance subset.
- We then plot mean z-scores by region.

If you prefer to keep things simple, skip this section.


In [None]:
# Z-score features (across all docs)
Z = df_feat.copy()
for f in FEATURES:
    col = Z[f]
    mu = col.mean(skipna=True)
    sd = col.std(skipna=True)
    Z[f + "_z"] = (col - mu) / sd if sd and sd > 0 else np.nan

Z_FEATURES = [f + "_z" for f in FEATURES]

profile = Z.groupby("region")[Z_FEATURES].mean().dropna(how="all")

# Keep a few major regions for readability
keep = [r for r in ["United States", "Europe", "China"] if r in profile.index]
profile_small = profile.loc[keep]

profile_small

In [None]:
# Simple radar plot with matplotlib (no styling beyond defaults)
import math

labels = FEATURES
num_vars = len(labels)

angles = [n / float(num_vars) * 2 * math.pi for n in range(num_vars)]
angles += angles[:1]

plt.figure(figsize=(7, 7))
ax = plt.subplot(111, polar=True)

for r in profile_small.index:
    values = profile_small.loc[r, Z_FEATURES].to_numpy().tolist()
    values += values[:1]
    ax.plot(angles, values, linewidth=2, label=r)
    ax.fill(angles, values, alpha=0.08)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels, fontsize=9)
ax.set_title("Region style profile (mean z-scores) — AI governance subset", y=1.08)
ax.legend(loc="upper right", bbox_to_anchor=(1.25, 1.15))
plt.tight_layout()
plt.show()

## 8) Wrap-up: what this extension adds

- In the main notebooks, we treated “agenda” as **semantic emphasis**.
- Here we treated “agenda” as **rhetorical style**.

So we need to stretch a little to think of useful applications for this approach. Will leave that to you. In this very contrived instance, a useful takeaway isn't “region X writes better.”

A useful takeaway might be:

> Different institutional contexts may shape *how research is framed* (complexity, density, lexical diversity, rhetorical structure).

If you want to go further, a natural next step is to compare:
- AI governance vs a purely technical AI subtopic (e.g., optimization)
to see whether stylistic differences increase when policy context matters.
