<a href="https://colab.research.google.com/github/rskrisel/Named-Entity-Recognition-NER-Co-Mention-Network/blob/main/NER_Live_Coding_NER_to_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# üß™ Live Coding (90 min): Named Entity Recognition (NER) + Co‚ÄëMention Networks

**Goal:** Run spaCy NER on a DataFrame of Factiva news articles (one article per row), turn entities into tidy tables, and build a **person‚Äëto‚Äëperson co‚Äëmention network** (with quick centrality + interactive plot).  
We‚Äôll keep the **policy lens** throughout: who are the key actors, which coalitions appear, how do connections shift?

**Roadmap (approx.):**
- **00‚Äì10 min** ‚Äî Why NER for policy (stakeholder mapping, influence networks, crisis response, sanctions/compliance, media framing).  
- **10‚Äì65 min** ‚Äî Live coding: setup ‚Üí NER ‚Üí tidy tables ‚Üí QA ‚Üí co‚Äëmention network ‚Üí centralities ‚Üí Plotly graph ‚Üí export.  
- **65‚Äì85 min** ‚Äî Policy applications, pitfalls/bias, validation.  
- **85‚Äì90 min** ‚Äî Mini scavenger hunt discussion.


üß† What Is Named Entity Recognition (NER)?

| **Aspect**                                      | **Explanation**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**                                  | **Named Entity Recognition (NER)** is a sub-task of *Natural Language Processing (NLP)* that automatically identifies and labels key ‚Äúentities‚Äù in text ‚Äî such as **people, organizations, locations, dates, and events**.                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| **Goal**                                        | To turn unstructured text into structured data by extracting *who*, *where*, and *what* from sentences.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| **Origins**                                     | Modern NER grew out of early information-extraction research funded by the **Defense Advanced Research Projects Agency (DARPA)** in the late 1980s‚Äì1990s. DARPA‚Äôs **Message Understanding Conferences (MUC)** created shared tasks and datasets for entity recognition, laying the foundation for today‚Äôs NLP models.                                                                                                                                                                                                                                                                                                                                                  |
| **Core Models Today**                           | Most NER systems use transformer or statistical models trained on labeled corpora (e.g., spaCy‚Äôs `en_core_web_md`, BERT-based models, etc.) to tag entity spans with categories like `PERSON`, `ORG`, or `GPE` (geo-political entity).                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| **Why It Matters**                              | NER transforms large text collections (news, policy documents, social media, reports) into analyzable networks of actors, institutions, and issues.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| **Policy & International Affairs Applications** | ‚Ä¢ **Stakeholder mapping:** Identify government officials, NGOs, and corporations in policy debates.<br>‚Ä¢ **Crisis response:** Detect which organizations and regions are repeatedly mentioned in disaster coverage.<br>‚Ä¢ **Disinformation tracking:** Map who or what is being referenced across misinformation networks.<br>‚Ä¢ **Diplomatic analysis:** Trace co-mentions of leaders across international press to understand alliances or tensions.<br>‚Ä¢ **Regulatory research:** Extract firms and sectors from financial or environmental regulations.<br>‚Ä¢ **Legislative studies:** Track which policymakers or agencies appear together across bills or hearings. |
| **Limitations**                                 | NER models may confuse similar names, miss multilingual variants, or reflect bias from their training data‚Äîso manual validation and name disambiguation (as you‚Äôll practice) are essential.                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |


> ### üíæ **Did You Know? NER Can Build Databases**
> Named Entity Recognition isn‚Äôt just for tagging text ‚Äî it‚Äôs a bridge between *unstructured language* and *structured data*.  
>
> When you extract entities (like people, organizations, and places) from thousands of documents, you can:
>
> 1. **Normalize names** ‚Äì merge duplicates or variants (e.g., ‚ÄúU.S.‚Äù ‚Üí ‚ÄúUnited States‚Äù).  
> 2. **Store entities** ‚Äì save them as rows in a database or DataFrame, with columns for `entity`, `type`, `source_doc`, `date`, or `context`.  
> 3. **Add relationships** ‚Äì connect entities that co-occur in the same article, sentence, or policy section.  
> 4. **Query and analyze** ‚Äì use SQL, pandas, or network tools to ask:  
>    - Who are the most frequently mentioned actors?  
>    - Which organizations are linked to specific policy issues?  
>    - How do these connections evolve over time?
>
> ‚ûú **In practice:** NER lets analysts turn raw text into a relational database ‚Äî a structured map of *who, what, and where* across an entire corpus of policy or media documents.


In [None]:
# --- CLEAN & PIN (run this first) ---
%pip install -U "pip<24.3" setuptools wheel

# Remove packages that force or prefer NumPy 2.x (not needed for this class)
%pip uninstall -y pytensor opencv-python opencv-contrib-python opencv-python-headless numba cupy-cuda12x tensorflow

# Satisfy IPython's dependency
%pip install "jedi>=0.18.0"

# Pin a NumPy that is ABI-compatible with spaCy wheels
%pip install "numpy==1.26.4"

# Now install the libraries we actually need
%pip install "spacy==3.7.4" "pandas<2.2" "networkx>=3.2" "plotly>=5.18"

# Download a medium English model for better NER
!python -m spacy download en_core_web_md -q

print("‚úÖ Clean install complete. NOW go to Runtime ‚Üí Restart runtime, then run the next cell.")


In [None]:
# --- Imports (post-restart) ---
import sys, re, numpy as np, pandas as pd, spacy, networkx as nx, plotly.graph_objects as go
from itertools import combinations

# Load spaCy English model
nlp = spacy.load("en_core_web_md")
nlp.max_length = 2_000_000  # or 3_000_000 for extra headroom

# Configure pandas display
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 50)

# Verify environment
print("‚úÖ Environment ready")
print(f"Python: {sys.version.split()[0]}")
print(f"NumPy: {np.__version__}")
print(f"spaCy: {spacy.__version__}")
print(f"pandas: {pd.__version__}")
print(f"networkx: {nx.__version__}")



## 1) Load your Factiva articles

- Expect a CSV/Parquet with a **`text`** column (one article per row).  
- Optional helpful columns: `article_id`, `date`, `source`, `section`, `headline`.  
- Replace the demo data below with your real path.


In [None]:
# If using Colab, mount Drive (optional but recommended so outputs persist)
from google.colab import drive
drive.mount('/content/drive')

In [None]:

# üîÅ Replace this with your actual load (e.g., from Drive)
df = pd.read_csv("/content/drive/MyDrive/factiva_ner_project/factiva.csv")

# Demo placeholder (remove once real data is loaded)
# df = pd.DataFrame({
#     "article_id": [1,2,3],
#     "text": [
#         "President Biden met with Ursula von der Leyen in Washington to discuss trade and AI.",
#         "Elon Musk and Tim Cook appeared in Brussels at an EU competition hearing with Margrethe Vestager.",
#         "The IMF and World Bank met in Marrakech; Kristalina Georgieva spoke with Janet Yellen."
#     ],
#     "source": ["ExampleWire","ExampleWire","ExampleWire"],
#     "date": ["2024-09-10","2024-09-12","2024-10-01"]
# })
assert "CombinedText" in df.columns and df["CombinedText"].notna().all(), "Data must include a non-null 'text' column."
print(df.head(3))
print(f"Articles loaded: {len(df)}")



## 2) Minimal cleaning + run NER (spaCy)

We keep cleaning light for NER (avoid over-normalizing names). We‚Äôll batch with `nlp.pipe(...)` for speed.


In [None]:

def clean_text(s: str) -> str:
    return re.sub(r"\s+", " ", str(s)).strip()

df["text_clean"] = df["CombinedText"].map(clean_text)

def iter_docs(texts, batch_size=32):
    for doc in nlp.pipe(texts, batch_size=batch_size, disable=["lemmatizer","textcat"]):
        yield doc

docs = list(iter_docs(df["text_clean"]))
print("Docs processed:", len(docs))



## 3) Extract entities ‚Üí tidy tables

Keep `PERSON`, `ORG`, `GPE` for policy mapping; attach `article_id` for traceability.


In [None]:

KEEP = {"PERSON","ORG","GPE"}

rows = []
# Use article_id if present; else fallback to row index
if "article_id" not in df.columns:
    df["article_id"] = np.arange(1, len(df)+1)

for art_id, doc in zip(df["article_id"], docs):
    for ent in doc.ents:
        if ent.label_ in KEEP:
            rows.append({
                "article_id": art_id,
                "entity_raw": ent.text,
                "label": ent.label_,
                "start": ent.start_char,
                "end": ent.end_char
            })
ents_df = pd.DataFrame(rows)

# Normalize a little (preserve display form too)
ents_df["entity_norm"] = (ents_df["entity_raw"]
                          .str.strip()
                          .str.replace(r"\s+", " ", regex=True)
                          .str.replace(r"[‚Äô'`]", "'", regex=True))
ents_df["entity_key"] = ents_df["entity_norm"].str.lower()

# Canonical display casing (most frequent)
canonical = (ents_df.groupby("entity_key")["entity_norm"]
             .agg(lambda x: x.value_counts().idxmax())
             .rename("entity"))
ents_df = ents_df.merge(canonical, on="entity_key", how="left")

print("Entities extracted:", len(ents_df))
ents_df.head(10)



## 4) Quick QA / sanity checks

Look at frequent entities per type. Expect some noise (titles, partial names, acronyms).


In [None]:

def top_vals(df_, label, n=15):
    s = (df_[df_["label"]==label]["entity"]
         .value_counts()
         .head(n))
    print(f"\nTop {label} entities:")
    display(s)

for lab in ["PERSON","ORG","GPE"]:
    top_vals(ents_df, lab, n=10)



## 5) Build a PERSON‚ÄìPERSON co‚Äëmention network

Two people are connected if they appear **in the same article**. (Extension: connect within the same sentence for tighter links.)


In [None]:

from itertools import combinations

persons = ents_df[ents_df["label"]=="PERSON"][["article_id","entity"]].drop_duplicates()

edge_rows = []
for art_id, group in persons.groupby("article_id"):
    people = sorted(group["entity"].unique())
    for a,b in combinations(people, 2):
        edge_rows.append((a,b,art_id))

edges_df = pd.DataFrame(edge_rows, columns=["src","dst","article_id"])
edge_weights = (edges_df.groupby(["src","dst"]).size()
                .reset_index(name="weight")
                .sort_values("weight", ascending=False))

print(edge_weights.head())
print(f"Edges (unique pairs): {len(edge_weights)} | Articles contributing: {edges_df['article_id'].nunique()}")

# Build graph
G = nx.Graph()
for p in persons["entity"].unique():
    G.add_node(p, type="PERSON")
for _, row in edge_weights.iterrows():
    G.add_edge(row["src"], row["dst"], weight=int(row["weight"]))

print(f"Graph -> Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")



## 6) Centrality: who matters? who bridges?

Degree centrality (connectivity), betweenness (bridging), and weighted degree (total co‚Äëmentions).


In [None]:

deg = nx.degree_centrality(G)
btw = nx.betweenness_centrality(G, normalized=True, weight="weight")
deg_w = {n: sum(d["weight"] for _,_,d in G.edges(n, data=True)) for n in G.nodes()}

cent_df = (pd.DataFrame({
    "entity": list(G.nodes()),
    "degree_centrality": [deg[n] for n in G.nodes()],
    "betweenness": [btw[n] for n in G.nodes()],
    "weighted_degree": [deg_w[n] for n in G.nodes()],
}).sort_values(["weighted_degree","degree_centrality"], ascending=False)
  .reset_index(drop=True))

cent_df.head(10)


In [None]:
import re
import unicodedata
import pandas as pd
import networkx as nx

# --- 1) Start from your entities table ---
# Expect ents_df with columns: ["article_id","entity","label", ...]
people = ents_df[ents_df["label"]=="PERSON"].copy()

# --- 2) Normalization helpers ---
HONORIFICS = r"(president|pres\.|gov\.|governor|sen\.|senator|rep\.|representative|mr\.|mrs\.|ms\.|dr\.|mayor)"
HONOR_RE = re.compile(rf"^\s*{HONORIFICS}\s+", re.IGNORECASE)

def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))

def normalize_person(name: str) -> str:
    if not isinstance(name, str):
        return ""
    n = strip_accents(name).strip()
    n = HONOR_RE.sub("", n)                      # drop titles
    n = re.sub(r"[‚Äú‚Äù\"(),]", "", n)              # punctuation noise
    n = re.sub(r"\b[A-Z]\.\b", "", n)            # drop middle initials like "R."
    n = re.sub(r"\s+", " ", n).strip()
    # Title case but keep common particles intact
    n = " ".join(w.capitalize() if w.lower() not in {"von","van","de","del","di","la"} else w.lower()
                 for w in n.split())
    return n

people["name_norm"] = people["entity"].map(normalize_person)

# --- 3) Split into full vs last-only ---
def last_name(s: str) -> str:
    parts = s.split()
    return parts[-1] if parts else ""

def is_last_only(s: str) -> bool:
    return len(s.split()) == 1

people["last"] = people["name_norm"].map(last_name)
people["last_only"] = people["name_norm"].map(is_last_only)

# --- 4) Compute dominant full name per last name (with a dominance threshold) ---
fulls = people[~people["last_only"]].copy()
# count full-name mentions by last name
cand = (fulls
        .assign(full=lambda d: d["name_norm"])
        .groupby(["last","full"])
        .size()
        .reset_index(name="n"))

# per-last totals and dominant candidate
totals = cand.groupby("last")["n"].sum().rename("total")
dom = (cand.sort_values(["last","n"], ascending=[True, False])
           .groupby("last").head(1)  # top full per last
           .merge(totals, on="last"))
dom["share"] = dom["n"] / dom["total"]

# choose a threshold; 0.6 = ‚Äúdominant enough‚Äù
DOMINANCE_THRESHOLD = 0.6
dom_map = (dom[dom["share"] >= DOMINANCE_THRESHOLD]
           .set_index("last")["full"]
           .to_dict())

# --- 5) Map last-only mentions to dominant full (when unambiguous) ---
def collapse_name(row):
    nm = row["name_norm"]
    if row["last_only"]:
        last = row["last"]
        if last in dom_map:
            return dom_map[last]   # map "Trump" -> "Donald Trump"
        else:
            return nm              # ambiguous last (e.g., "Bush") ‚Üí keep as-is
    else:
        return nm                  # already full name

people["name_clean"] = people.apply(collapse_name, axis=1)

# (Optional) See what changed
# people.loc[people["name_norm"]!=people["name_clean"], ["name_norm","name_clean"]].drop_duplicates().head(20)


In [None]:
# Rebuild PERSON table with cleaned names
persons_clean = (people[["article_id","name_clean"]]
                 .rename(columns={"name_clean":"entity"})
                 .drop_duplicates())

# Recompute co-mentions with cleaned entities
from itertools import combinations
edge_rows = []
for art_id, grp in persons_clean.groupby("article_id"):
    ppl = sorted(grp["entity"].unique())
    for a, b in combinations(ppl, 2):
        edge_rows.append((a, b, art_id))

edges_df = pd.DataFrame(edge_rows, columns=["src","dst","article_id"])
edge_weights = (edges_df.groupby(["src","dst"]).size()
                .reset_index(name="weight")
                .sort_values("weight", ascending=False))

# Build graph and recalc centralities
G = nx.Graph()
for p in persons_clean["entity"].unique():
    G.add_node(p, type="PERSON")
for _, r in edge_weights.iterrows():
    G.add_edge(r["src"], r["dst"], weight=int(r["weight"]))

deg = nx.degree_centrality(G)
btw = nx.betweenness_centrality(G, normalized=True, weight="weight")
deg_w = {n: sum(d["weight"] for _,_,d in G.edges(n, data=True)) for n in G.nodes()}

cent_df = (pd.DataFrame({
    "entity": list(G.nodes()),
    "degree_centrality": [deg[n] for n in G.nodes()],
    "betweenness": [btw[n] for n in G.nodes()],
    "weighted_degree": [deg_w[n] for n in G.nodes()],
}).sort_values(["weighted_degree","degree_centrality"], ascending=False)
  .reset_index(drop=True))

cent_df.head(10)



## 7) Interactive network (Plotly)

Small/medium corpora render fine inline. For larger projects, export to **Gephi**.


In [None]:
# === Show only the TOP-K strongest relationships (by edge "weight") ===
TOP_K = 20  # change as needed

# 1) Pick top-K edges by weight
edges_sorted = sorted(
    G.edges(data=True),
    key=lambda e: e[2].get("weight", 1),
    reverse=True
)
top_edges = edges_sorted[:TOP_K]

# 2) Build a subgraph with just those edges (and their incident nodes)
H = nx.Graph()
H.add_nodes_from(G.nodes(data=True))  # keep node attrs if any
for u, v, d in top_edges:
    H.add_edge(u, v, **d)

# Optional: remove isolated nodes (if any) that snuck in without edges
H.remove_nodes_from(list(nx.isolates(H)))

# 3) Recompute layout and centralities on the subgraph
pos = nx.spring_layout(H, k=0.6, seed=42, weight="weight")

deg_cen_map = nx.degree_centrality(H)
btw_map = nx.betweenness_centrality(H, normalized=True, weight="weight")
wdeg_map = {n: sum(d["weight"] for _,_,d in H.edges(n, data=True)) for n in H.nodes()}

# 4) Build Plotly traces (edges first)
edge_x, edge_y = [], []
for u, v, d in H.edges(data=True):
    x0, y0 = pos[u]; x1, y1 = pos[v]
    edge_x += [x0, x1, None]
    edge_y += [y0, y1, None]

edge_trace = go.Scatter(
    x=edge_x, y=edge_y, mode='lines',
    line=dict(width=0.5),
    hoverinfo='none'
)

# 5) Nodes
node_x = [pos[n][0] for n in H.nodes()]
node_y = [pos[n][1] for n in H.nodes()]
node_sizes = [8 + 12*deg_cen_map.get(n, 0) for n in H.nodes()]
node_text = [
    f"{n}<br>degree={deg_cen_map.get(n,0):.3f}"
    f"<br>betweenness={btw_map.get(n,0):.3f}"
    f"<br>w_degree={wdeg_map.get(n,0)}"
    for n in H.nodes()
]

node_trace = go.Scatter(
    x=node_x, y=node_y, mode='markers+text',
    text=[n for n in H.nodes()],
    textposition="top center",
    marker=dict(size=node_sizes),
    hovertext=node_text, hoverinfo='text'
)

fig = go.Figure(data=[edge_trace, node_trace])
fig.update_layout(
    title=f"Top {min(TOP_K, H.number_of_edges())} Person‚ÄìPerson Relationships (by co-mentions)",
    showlegend=False, height=640,
    margin=dict(l=20, r=20, t=50, b=20),
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)
fig.show()



## 8) Export tables for reuse (CSV) + Gephi (GEXF)

- Nodes: centralities per person  
- Edges: aggregated co‚Äëmentions (weights)  
- GEXF: open in **Gephi** for deeper analysis (layouts, communities).


In [None]:

# nodes_out = cent_df.copy()
# edges_out = edge_weights.copy()

# nodes_out.to_csv("ner_network_nodes.csv", index=False)
# edges_out.to_csv("ner_network_edges.csv", index=False)

# nx.write_gexf(G, "ner_network.gexf")

# print("Saved: ner_network_nodes.csv, ner_network_edges.csv, ner_network.gexf")



## 9) Policy applications (talk track)

- **Stakeholder mapping:** Who are central actors in AI governance, climate finance, or migration coverage this month?  
- **Influence & agenda setting:** Which people bridge communities (high betweenness) between industry and regulators?  
- **Sanctions/compliance:** Co‚Äëappearances of sanctioned individuals with firms/financial institutions.  
- **Crisis response:** During disasters or epidemics, who coordinates (centrality hotspots) across NGOs/governments?  
- **Media framing:** Compare networks by outlet/region to see different coalitions or narratives.
  
**Caveats & fairness:**  
- NER is imperfect (name variants, titles, acronyms). Validate high‚Äëstakes claims.  
- Prominence bias: centrality can reflect media attention, not true influence.  
- Consider sentence‚Äëlevel ties for stricter edges; add ORG/GPE for multiplex networks.



## 10) üéØ Mini Scavenger Hunt (discussion, not graded)

1) **Central Player:** Who has the highest *weighted degree*? Does that match your expectations?  
2) **Bridge Builder:** Top‚Äë3 *betweenness* nodes ‚Äî what communities might they connect?  
3) **Outlet Split (optional):** If you have a `source` column, build separate graphs per source ‚Äî what changes?  
4) **Time Slice (optional):** Group articles by month/quarter; recompute centralities ‚Äî any event‚Äëdriven shifts?  
5) **Tighten the link (stretch):** Redefine edges as co‚Äëmentions **within the same sentence** only; how does the network change?



## 11) Optional extensions

- Add **ORG** and **GPE** nodes for a **tripartite** person‚Äìorg‚Äìplace network.  
- Use `en_core_web_trf` (transformer model) for higher‚Äëaccuracy NER (GPU recommended).  
- Deduplicate entity variants with light heuristics or entity linking (e.g., Wikipedia/Wikidata).  
- Community detection (e.g., Louvain) to find actor clusters.  
- Compare networks across **time windows** or **outlets** for framing analysis.
