# Assignment 11 – Part 1.2  
## Knowledge graph construction from cleaned professor profiles

This notebook:

1. Loads the `cleaned_professors.json` file produced in Notebook 1.
2. Builds a directed knowledge graph where:
   - Each professor is a `PROFESSOR` node.
   - Academic organizations, degrees, subjects, corporate organizations, and locations
     become typed entity nodes.
3. Creates labeled edges such as:
   - **Studied at** (professor → academic organization)  
   - **Has degree** (professor → degree)  
   - **Teaches** (professor → subject)  
   - **Worked at** (professor → corporate organization)  
   - **Worked in** (professor → corporate location)
4. Computes simple graph statistics to sanity-check the extraction.
5. Exports the graph in GraphML, GEXF, and CSV formats for inspection in Gephi (https://lite.gephi.org/v1.0.1/)
   and for use in the written report.
6. Generates small visualizations (ego network + top-degree subgraph) as a quick
   qualitative check on the resulting structure.


In [None]:
# Core libraries and environment setup
import os
import sys
import subprocess
import json

def ensure_package(pkg_name: str):
    """
    Import a package if available; otherwise install it with pip and then import.
    This mirrors Notebook 1 so that a fresh environment can run with 'Run All'.
    """
    try:
        __import__(pkg_name)
    except ImportError:
        print(f"Installing missing package: {pkg_name}")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg_name])

# Ensure required libraries are available
for pkg in ["pandas", "networkx", "matplotlib"]:
    ensure_package(pkg)

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# ------------------------------------------------------------------
# Load cleaned professor data generated by 01_load_and_clean.ipynb
# ------------------------------------------------------------------
data_path = "../data/processed/cleaned_professors.json"
df = pd.read_json(data_path)

print("Data loaded successfully from:", data_path)
print("Shape:", df.shape)
print("Columns:", list(df.columns))
df.head(2)


## Graph schema and construction strategy

Design decisions:

- Use a **directed graph (`DiGraph`)** where edges always point from `PROFESSOR`
  nodes to entity nodes. This keeps semantics simple and avoids multi-edge
  complications for the assignment.
- Each node carries:
  - `label`: human-readable text (e.g., `"Universidad Autónoma de Madrid"`).
  - `ntype`: node type (`PROFESSOR`, `ORG`, `EDU`, `LOC`, `SUBJECT`).
- Each edge carries:
  - `label` and `relation`: a concise relationship name (e.g., `"Studied at"`).
- We only use **strings and simple types** for node/edge attributes so that
  export to GraphML/GEXF (Gephi) is straightforward and does not require
  custom parsers.


In [None]:
# ------------------------------------------------------------------
# Build nodes and labeled edges from professor_dict
# ------------------------------------------------------------------

G = nx.DiGraph()

# 1) Create professor nodes
for i, row in df.iterrows():
    prof_id = f"PROF_{i}"
    if not G.has_node(prof_id):
        G.add_node(prof_id, label=f"Professor {i}", ntype="PROFESSOR")

# 2) Helper to create / reuse entity nodes
def add_entity_node(name: str, ntype: str):
    if not name:
        return None
    node_id = f"{ntype}:{name}"
    if not G.has_node(node_id):
        G.add_node(node_id, label=name, ntype=ntype)
    return node_id

# 3) Helper to add a typed relation from professor to entity
def add_relation(prof_id: str, target_label: str, target_type: str, relation_label: str):
    target_id = add_entity_node(target_label, target_type)
    if target_id is None:
        return
    # For DiGraph, repeated calls will overwrite attributes rather than duplicate edges,
    # which is fine here because relation_label is deterministic.
    G.add_edge(prof_id, target_id, label=relation_label, relation=relation_label)

# 4) Iterate over the dataframe and add semantic edges
for i, row in df.iterrows():
    prof_id = f"PROF_{i}"
    d = row["professor_dict"]

    # Academic background
    for org in d.get("Academic Background - Organization", []):
        add_relation(prof_id, org, "ORG", "Studied at")
    for edu in d.get("Academic Background - Education", []):
        add_relation(prof_id, edu, "EDU", "Has degree")

    # Teaching / subjects
    for subj in d.get("Academic Experience - Subjects", []):
        add_relation(prof_id, subj, "SUBJECT", "Teaches")

    # Corporate experience
    for org in d.get("Corporate Experience - Organization", []):
        add_relation(prof_id, org, "ORG", "Worked at")
    for loc in d.get("Corporate Experience - Location", []):
        add_relation(prof_id, loc, "LOC", "Worked in")


## Sanity checks: node/edge counts and high-degree entities

Before exporting the graph, I compute basic statistics to confirm that:

- The expected node types are present (professors, organizations, locations, etc.).
- The total number of edges is reasonable given the dataset size.
- High-degree organizations and locations (e.g., major universities or “Spain”)
  show up as central hubs, which matches domain expectations.


In [None]:
# ------------------------------------------------------------------
# Quick statistics on the constructed graph
# ------------------------------------------------------------------

node_types = pd.Series(nx.get_node_attributes(G, "ntype")).value_counts()
print("Node counts by type:\n", node_types.to_string(), "\n")
print("Total nodes:", G.number_of_nodes(), "  Total edges:", G.number_of_edges())

# Top connected organizations and locations
deg = dict(G.degree())
top_orgs = sorted(
    [(n, deg[n]) for n, d in G.nodes(data=True) if d.get("ntype") == "ORG"],
    key=lambda x: x[1],
    reverse=True,
)[:15]

top_locs = sorted(
    [(n, deg[n]) for n, d in G.nodes(data=True) if d.get("ntype") == "LOC"],
    key=lambda x: x[1],
    reverse=True,
)[:15]

print("\nTop ORGs by degree:")
labels = nx.get_node_attributes(G, "label")
for n, k in top_orgs:
    print(labels.get(n, n), "->", k)

print("\nTop LOCs by degree:")
for n, k in top_locs:
    print(labels.get(n, n), "->", k)


## Exporting the graph for Gephi and downstream analysis

I export the graph in multiple formats:

- **GraphML** and **GEXF** for Gephi and other graph tools.
- **nodes.csv** and **edges.csv** for tabular inspection and to reference in the report.

To keep these exports robust, I coerce all node and edge attributes to
simple types (strings, numbers, booleans, or `None`). This avoids GraphML
errors about unsupported Python objects.


In [None]:
# ------------------------------------------------------------------
# Export graph artifacts (GraphML, GEXF, and CSV)
# ------------------------------------------------------------------

os.makedirs("../outputs/graph", exist_ok=True)

def _coerce_for_graphml(v):
    if isinstance(v, (set, list, dict, tuple)):
        # store as JSON-like string so it is still readable if needed
        return json.dumps(v, ensure_ascii=False)
    if isinstance(v, (str, int, float, bool)) or v is None:
        return v
    return str(v)

# Sanitize attributes in-place
for _, data in G.nodes(data=True):
    for k in list(data.keys()):
        data[k] = _coerce_for_graphml(data[k])

for _, _, data in G.edges(data=True):
    for k in list(data.keys()):
        data[k] = _coerce_for_graphml(data[k])

# GraphML and GEXF for Gephi
graphml_path = "../outputs/graph/prof_kg.graphml"
gexf_path    = "../outputs/graph/prof_kg.gexf"

nx.write_graphml(G, graphml_path)
nx.write_gexf(G, gexf_path)

# Node and edge tables for the report
nodes_df = pd.DataFrame(
    [(n, d.get("label"), d.get("ntype")) for n, d in G.nodes(data=True)],
    columns=["node_id", "label", "type"],
)

edges_df = pd.DataFrame(
    [(u, v, d.get("relation")) for u, v, d in G.edges(data=True)],
    columns=["source", "target", "relation"],
)

nodes_csv_path = "../outputs/graph/nodes.csv"
edges_csv_path = "../outputs/graph/edges.csv"

nodes_df.to_csv(nodes_csv_path, index=False)
edges_df.to_csv(edges_csv_path, index=False)

print("\nSaved graph artifacts:")
print(" -", graphml_path)
print(" -", gexf_path)
print(" -", nodes_csv_path)
print(" -", edges_csv_path)


## Visual sanity checks

To complement the numeric statistics, I plot:

1. A **1-hop ego network** around a sample professor to see if the local
   neighborhood matches the intended schema (professor → orgs, degrees,
   locations, subjects).
2. A **subgraph of the top-degree nodes** to inspect the global structure
   and visually confirm that major universities and locations act as hubs.

These plots are not meant to be publication-ready but help catch obvious
graph construction errors.


In [None]:
# ------------------------------------------------------------------
# Visual sanity checks: ego network and top-degree subgraph
# ------------------------------------------------------------------

# 1) Example professor ego network (1-hop)
try:
    sample_prof = next(n for n, d in G.nodes(data=True) if d.get("ntype") == "PROFESSOR")
    ego = nx.ego_graph(G, sample_prof, radius=1)

    plt.figure(figsize=(9, 7))
    pos = nx.spring_layout(ego, seed=42)

    type_colors = {
        "PROFESSOR": "tab:blue",
        "ORG": "tab:green",
        "LOC": "tab:orange",
        "EDU": "tab:purple",
        "SUBJECT": "tab:red",
    }

    node_colors = [
        type_colors.get(G.nodes[n].get("ntype"), "tab:gray") for n in ego.nodes()
    ]

    nx.draw(
        ego,
        pos,
        with_labels=True,
        labels=nx.get_node_attributes(ego, "label"),
        node_color=node_colors,
        node_size=400,
        font_size=8,
    )
    plt.title("Example professor ego network (1 hop)")
    plt.show()
except StopIteration:
    print("No professor nodes to visualize.")

# 2) Subgraph of top 200 most connected nodes
sub_nodes = sorted(G.degree, key=lambda x: x[1], reverse=True)[:200]
H = G.subgraph([n for n, _ in sub_nodes])

plt.figure(figsize=(12, 10))
pos = nx.spring_layout(H, seed=42)
nx.draw(
    H,
    pos,
    with_labels=False,
    node_size=30,
    edge_color="gray",
    alpha=0.6,
)
plt.title("Sample knowledge graph (top 200 nodes by degree)")
plt.show()


## Optional: export a smaller sample graph for faster Gephi exploration

The full graph can still be somewhat heavy in Gephi. As an optional step,
I export a smaller subgraph containing only the top-degree nodes. This
makes it easier to experiment with layouts and styling while keeping the
core structure of the network.

View the graph using Gephi! https://lite.gephi.org/v1.0.1/. I figure this is easier than viewing it within the IDE.

In [None]:
# ------------------------------------------------------------------
# Export a smaller sample subgraph for quick Gephi experiments
# ------------------------------------------------------------------

sample_nodes = sorted(G.degree, key=lambda x: x[1], reverse=True)[:50]
H = G.subgraph([n for n, _ in sample_nodes]).copy()

os.makedirs("../outputs/graph", exist_ok=True)
sample_path = "../outputs/graph/prof_kg_sample.graphml"

# Attributes should already be GraphML-safe, but we coerce again defensively
for _, data in H.nodes(data=True):
    for k in list(data.keys()):
        data[k] = _coerce_for_graphml(data[k])

for _, _, data in H.edges(data=True):
    for k in list(data.keys()):
        data[k] = _coerce_for_graphml(data[k])

nx.write_graphml(H, sample_path)

print(f"Sample graph saved to: {sample_path}")
print(f"Sample nodes: {H.number_of_nodes()}   Sample edges: {H.number_of_edges()}")
