# Assignment 11 – Part 1  
## NER preprocessing and professor profile cleaning

This notebook:

1. Loads the raw `teachers_db_practice.csv` file.
2. Cleans and splits each professor profile into **Corporate Experience**, **Academic Experience**, and **Academic Background** sections.
3. Runs a Named Entity Recognition (NER) model to detect organizations and locations.
4. Normalizes and clusters entity strings (e.g. “ie university” → “IE”).
5. Builds a per-professor dictionary with:
   - Corporate organizations and locations  
   - Academic organizations  
   - Degrees / education  
   - Academic subjects
6. Saves a compact JSON file (`cleaned_professors.json`) used later to build the knowledge graph.


In [None]:
# Core libraries
import os
import re
import json
import unicodedata
from collections import Counter, defaultdict
import random

import pandas as pd

# NER + utilities
import sys
import subprocess

def ensure_package(pkg_name: str):
    """
    Import a package if available; otherwise install it with pip and then import.
    This allows running the notebook on a clean machine with just 'Run All'.
    """
    try:
        __import__(pkg_name)
    except ImportError:
        print(f"Installing missing package: {pkg_name}")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg_name])

for pkg in ["transformers", "torch", "tqdm", "rapidfuzz", "pyarrow"]:
    ensure_package(pkg)

from transformers import pipeline
from tqdm import tqdm
from rapidfuzz import fuzz, process

tqdm.pandas()


In [None]:
# Path structure:
#   data/raw/teachers_db_practice.csv
#   data/processed/  (output folder used later)

raw_path = "../data/raw/teachers_db_practice.csv"
df = pd.read_csv(raw_path)

print("Loaded dataset:")
print("  shape:", df.shape)
print("  columns:", list(df.columns))

# Inspect a single raw profile to understand the HTML-ish structure
display(df.loc[0, ["full_info"]])


### Clean and normalize raw profile text

The `full_info` field is stored as an HTML-like blob with tags and entities (e.g., `<h4>`, `&amp;`).  
Before running NER, I normalize each profile by:

- Removing HTML tags and collapsing whitespace.
- Converting `&amp;` variants back to `&`.

This gives a stable, plain-text representation (`clean_text`) that is easier for the NER model and for later regex-based parsing.


In [None]:
# Clean the HTML-ish blob into a single text field per professor

def clean_html_text(html: str) -> str:
    html = html or ""
    # collapse &amp; variants
    html = re.sub(r"&\s*amp;?", "&", html, flags=re.I)
    # remove all tags
    html = re.sub(r"<.*?>", " ", html)
    # collapse whitespace
    html = re.sub(r"\s+", " ", html).strip()
    return html

df["clean_text"] = df["full_info"].fillna("").apply(clean_html_text)

df["clean_text"].head(3).to_list()


### Split profiles into logical sections

Many CVs have explicit section headers: **CORPORATE EXPERIENCE**, **ACADEMIC EXPERIENCE**, and  
**ACADEMIC BACKGROUND**. I use regexes to:

- Detect these headers in the original HTML string.
- Slice the profile into three section-specific text fields.

If the headers are missing, the entire block is treated as corporate experience.  
Working per-section makes it easier to decide whether a given organization or location belongs to  
corporate vs. academic parts of the profile.


In [None]:
# Split each profile into (possibly empty) sections:
#   - CORPORATE EXPERIENCE
#   - ACADEMIC EXPERIENCE
#   - ACADEMIC BACKGROUND

SECTION_HEADERS = {
    "corp": r"(?:<h4>\s*CORPORATE EXPERIENCE\s*</h4>|CORPORATE EXPERIENCE)",
    "acadexp": r"(?:<h4>\s*ACADEMIC EXPERIENCE\s*</h4>|ACADEMIC EXPERIENCE)",
    "acadbg": r"(?:<h4>\s*ACADEMIC BACKGROUND\s*</h4>|ACADEMIC BACKGROUND)",
}

def strip_html(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = re.sub(r"&\s*amp;?", "&", s, flags=re.I)
    s = re.sub(r"<.*?>", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def extract_sections(raw: str) -> pd.Series:
    text = raw if isinstance(raw, str) else ""

    def find_idx(pattern: str):
        m = re.search(pattern, text, flags=re.I)
        return m.start() if m else None

    i_corp    = find_idx(SECTION_HEADERS["corp"])
    i_acadexp = find_idx(SECTION_HEADERS["acadexp"])
    i_acadbg  = find_idx(SECTION_HEADERS["acadbg"])

    def slice_between(start, end):
        if start is None:
            return ""
        end = len(text) if end is None else end
        return strip_html(text[start:end])

    idxs = sorted(
        [(k, v) for k, v in [("corp", i_corp), ("acadexp", i_acadexp), ("acadbg", i_acadbg)] if v is not None],
        key=lambda x: x[1],
    )

    corp_txt = acadexp_txt = acadbg_txt = ""
    if idxs:
        for j, (label, start) in enumerate(idxs):
            end = idxs[j + 1][1] if j + 1 < len(idxs) else None
            chunk = slice_between(start, end)
            if label == "corp":
                corp_txt = chunk
            elif label == "acadexp":
                acadexp_txt = chunk
            elif label == "acadbg":
                acadbg_txt = chunk
    else:
        # if no headers, treat entire blob as "corporate" so we don't lose information
        corp_txt = strip_html(text)

    return pd.Series(
        {
            "corp_text": corp_txt,
            "acadexp_text": acadexp_txt,
            "acadbg_text": acadbg_txt,
        }
    )

sec_df = df["full_info"].apply(extract_sections)
df = pd.concat([df, sec_df], axis=1)

df[["corp_text", "acadexp_text", "acadbg_text"]].head(3)


### Named Entity Recognition (NER) on professor profiles

I use the HuggingFace `dslim/bert-base-NER` model to detect entities such as organizations (ORG)  
and locations (LOC) in each professor profile.

Design choices:

- **Model**: `dslim/bert-base-NER` is a general-purpose English NER model that performs well on CV-style text.
- **Truncation**: for each text field, I truncate to 2,000 characters to keep runtime reasonable while
  still covering the relevant parts of long profiles.
- **Per-section NER**: I run NER both on the full cleaned text and on each section separately, which later
  makes it easier to attach entities to “corporate” vs. “academic” buckets.


In [None]:
# NER model (dslim/bert-base-NER works well for general English entities)
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Quick sanity check: run on one profile
sample_text = df.loc[1, "clean_text"]
ner(sample_text[:500])

# Run NER on the whole cleaned profile (truncated for speed)
df["entities"] = df["clean_text"].progress_apply(lambda x: ner(x[:2000]))

def run_ner_safe(text: str, limit: int = 2000):
    text = text if isinstance(text, str) else ""
    if not text:
        return []
    return ner(text[:limit])

df["entities_corp"]    = df["corp_text"].progress_apply(run_ner_safe)
df["entities_acadexp"] = df["acadexp_text"].progress_apply(run_ner_safe)
df["entities_acadbg"]  = df["acadbg_text"].progress_apply(run_ner_safe)


### Normalize raw entity strings

NER outputs can be noisy and inconsistent (e.g., `ie`, `IE University`, `U.K.`).  
To reduce duplication, I:

- Map common location variants (e.g., `us`, `u.s.`) to a canonical form (`United States`).
- Map known IE-related organizations and Spanish universities to a single form (e.g., `ie university` → `IE`).
- Drop obviously generic ORG tokens such as `Academic`, `School`, or `Faculty`.

The `normalize_entities` function converts the raw NER output into clean `(type, text)` pairs  
that are much better suited for graph construction.


In [None]:
# Location and organization normalization maps

LOCATION_MAP = {
    "us": "United States",
    "usa": "United States",
    "u.s.": "United States",
    "u.k.": "United Kingdom",
    "uk": "United Kingdom",
    "spain": "Spain",
    "madrid": "Madrid",
    "mexico": "Mexico",
    "mexico city": "Mexico City",
    "london": "London",
    "paris": "Paris",
    "france": "France",
    "germany": "Germany",
    "italy": "Italy",
    "portugal": "Portugal",
    "barcelona": "Barcelona",
    "new york": "New York",
}

ORG_FIXES = {
    # IE ecosystem
    "ie": "IE",
    "ie university": "IE",
    "ie business school": "IE",
    "ie law school": "IE",
    "instituto de empresa": "IE",
    "ie school of global and public affairs": "IE",
    # Spanish universities
    "universidad autonoma de madrid": "Universidad Autónoma de Madrid",
    "uam": "Universidad Autónoma de Madrid",
    "universidad complutense de madrid": "Universidad Complutense de Madrid",
    "universidad carlos iii de madrid": "Universidad Carlos III de Madrid",
    "universidad politecnica de madrid": "Universidad Politécnica de Madrid",
    "universidad de navarra": "Universidad de Navarra",
    "universidad pontificia comillas": "Universidad Pontificia Comillas",
    "comillas university": "Universidad Pontificia Comillas",
    "icade": "Universidad Pontificia Comillas",
    "iese business school": "IESE Business School",
    # companies / misc
    "a & am": "A&M Studio",
    "a&m": "A&M Studio",
    "am studio": "A&M Studio",
}

GENERIC_ORG_WORDS = {
    "academic",
    "academic exp",
    "academ",
    "experience",
    "exp",
    "university",
    "universidad",
    "engineering",
    "design",
    "academy",
    "school",
    "faculty",
    "department",
    "college",
    "education",
    "institute",
    "business",
    "finance",
    "management",
    "administration",
    "economics",
    "marketing",
    "law",
    "science",
    "technology",
    "research",
    "professor",
    "lecturer",
}

def normalize_entities(entities):
    """Normalize tokens produced by the NER model into (type, cleaned_text) pairs."""
    cleaned = []
    for ent in entities:
        text = ent["word"].lower().strip()
        text = re.sub(r"[^a-z0-9&.\sáéíóúüñ]", "", text)
        text = re.sub(r"\s+", " ", text).strip()

        if len(text) < 3:
            continue

        if ent["entity_group"] == "LOC":
            text = LOCATION_MAP.get(text, text.title())
        elif ent["entity_group"] == "ORG":
            text = ORG_FIXES.get(text, text.title())
            # drop generic headings and one-word garbage
            if text.lower() in GENERIC_ORG_WORDS or (
                len(text.split()) == 1 and text.lower() not in [v.lower() for v in ORG_FIXES.values()]
            ):
                continue
        else:
            text = text.title()

        cleaned.append((ent["entity_group"], text))
    return cleaned

# Apply normalization
df["normalized_entities"] = df["entities"].apply(normalize_entities)
df["norm_corp"]    = df["entities_corp"].apply(normalize_entities)
df["norm_acadexp"] = df["entities_acadexp"].apply(normalize_entities)
df["norm_acadbg"]  = df["entities_acadbg"].apply(normalize_entities)

df["normalized_entities"].head(2).to_list()


### Fuzzy clustering of organizations and locations

Even after the manual maps, many entities still appear with small spelling differences  
(e.g., accents, pluralization, or extra words). To consolidate these, I:

- Normalize strings (lowercase, remove accents, drop generic words like “university”).
- Group similar entities using token-based fuzzy matching (`rapidfuzz`).
- Build an alias map so that near-duplicates are merged into a single canonical label.

This step reduces fragmentation in the knowledge graph (fewer separate nodes for essentially the same institution).


In [None]:
def simplify_for_match(s: str) -> str:
    """Lowercase, remove accents and generic words to cluster similar entity names."""
    s0 = s.lower().strip()
    s0 = "".join(c for c in unicodedata.normalize("NFKD", s0) if not unicodedata.combining(c))
    s0 = s0.replace("&", "and")
    drop = {
        "university",
        "universidad",
        "universite",
        "università",
        "universita",
        "universidade",
        "school",
        "college",
        "institute",
        "instituto",
        "dept",
        "department",
    }
    tokens = [t for t in re.sub(r"[^a-z0-9\s]", " ", s0).split() if t not in drop]
    return " ".join(tokens)

# Collect all ORG / LOC surface forms
all_orgs, all_locs = [], []
for row in df["normalized_entities"]:
    for typ, val in row:
        if typ == "ORG":
            all_orgs.append(val)
        elif typ == "LOC":
            all_locs.append(val)

def build_alias_map(names, sim_threshold=92):
    """Cluster similar names and return alias -> canonical mapping."""
    names = list(set(names))
    buckets = defaultdict(list)
    for n in names:
        buckets[simplify_for_match(n)].append(n)

    alias_map = {}
    freq = Counter(names)

    # Intra-bucket merge
    for _, variants in buckets.items():
        canonical = max(variants, key=lambda x: freq[x])
        for v in variants:
            alias_map[v] = canonical

    # Inter-bucket fuzzy merge
    canonicals = list(set(alias_map[v] for v in alias_map))
    for c in list(canonicals):
        matches = process.extract(c, canonicals, scorer=fuzz.token_sort_ratio, limit=5)
        for other, score, _ in matches:
            if other == c or score < sim_threshold:
                continue
            winner = c if freq[c] >= freq[other] else other
            loser = other if winner == c else c
            for k, v in list(alias_map.items()):
                if v == loser:
                    alias_map[k] = winner
            canonicals = [winner if x == loser else x for x in canonicals]

    return alias_map

ORG_ALIAS = build_alias_map(all_orgs, sim_threshold=92)
LOC_ALIAS = build_alias_map(all_locs, sim_threshold=95)

def apply_aliases(entities):
    out = []
    for typ, val in entities:
        if typ == "ORG":
            out.append((typ, ORG_ALIAS.get(val, val)))
        elif typ == "LOC":
            out.append((typ, LOC_ALIAS.get(val, val)))
        else:
            out.append((typ, val))
    return out

# Apply alias maps to global and section-wise entities
df["normalized_entities"] = df["normalized_entities"].apply(apply_aliases)
df["norm_corp"]    = df["norm_corp"].apply(apply_aliases)
df["norm_acadexp"] = df["norm_acadexp"].apply(apply_aliases)
df["norm_acadbg"]  = df["norm_acadbg"].apply(apply_aliases)

# Quick frequency summary
org_counts = Counter([e[1] for row in df["normalized_entities"] for e in row if e[0] == "ORG"])
loc_counts = Counter([e[1] for row in df["normalized_entities"] for e in row if e[0] == "LOC"])

print("Top ORGs:", org_counts.most_common(10))
print("Top LOCs:", loc_counts.most_common(10))


### Build structured per-professor dictionaries

Using the section-specific normalized entities, I construct a `professor_dict` for each row:

- **Corporate Experience – Organization / Location**
- **Academic Background – Organization**
- Placeholder lists for **Education** and **Academic Experience** (filled later).

This dictionary is the bridge between the free-text profiles and the structured knowledge graph 
we build in Notebook 2.


In [None]:
def to_lists(section_entities):
    orgs = [e[1] for e in section_entities if e[0] == "ORG"]
    locs = [e[1] for e in section_entities if e[0] == "LOC"]
    return orgs, locs

def build_professor_dict_row(row):
    corp_orgs, corp_locs       = to_lists(row["norm_corp"])
    acadexp_orgs, acadexp_locs = to_lists(row["norm_acadexp"])
    acadbg_orgs, acadbg_locs   = to_lists(row["norm_acadbg"])

    return {
        "Corporate Experience - Organization": corp_orgs,
        "Corporate Experience - Location": corp_locs,
        "Academic Background - Organization": acadbg_orgs,
        "Academic Background - Education": [],
        "Academic Experience - Courses": [],
        "Academic Experience - Subjects": [],
        # acadexp_locs could be stored as a separate key if needed
    }

df["professor_dict"] = df.apply(build_professor_dict_row, axis=1)
df["professor_dict"].head(2).to_dict()


### Extract degrees and subjects from Academic Background

Some degrees and fields of study appear as free text rather than as clean NER entities.  
To capture them, I:

- Use regex patterns to match common degree formats (PhD, MBA, “Master in …”, etc.).
- Search within the **Academic Background** slice of the profile.
- Move these matches into `Academic Background - Education`.
- Remove degree-like phrases that may have leaked into the organization list.

This helps separate “where they studied” (ORG) from “what they studied” (degree/subject).


In [None]:
def get_section(text, start_key, stop_keys):
    """Return substring from `start_key` until the earliest of `stop_keys`."""
    t = text or ""
    t_low = t.lower()
    s = t_low.find(start_key.lower())
    if s == -1:
        return ""
    e_candidates = [t_low.find(k.lower(), s + 1) for k in stop_keys]
    e_candidates = [e for e in e_candidates if e != -1]
    e = min(e_candidates) if e_candidates else len(t)
    return t[s:e]

# Degree patterns (English + some Spanish)
DEGREE_PAT = re.compile(
    r"""
    \b(
        ph\.?d\.?|doctor(?:ate)?\s+of\s+[A-Za-zÁÉÍÓÚÜÑ&\-\s]+|
        m\.?b\.?a\.?|m\.?sc\.?|m\.?s\.?|m\.?a\.?|ll\.?m\.?|
        b\.?sc\.?|b\.?s\.?|b\.?a\.?|
        master(?:'s)?\s+in\s+[A-Za-zÁÉÍÓÚÜÑ&\-\s]+|
        bachelor(?:'s)?\s+in\s+[A-Za-zÁÉÍÓÚÜÑ&\-\s]+|
        licenciatura\s+en\s+[A-Za-zÁÉÍÓÚÜÑ&\-\s]+|
        grado\s+en\s+[A-Za-zÁÉÍÓÚÜÑ&\-\s]+
    )\b
    """,
    re.IGNORECASE | re.VERBOSE,
)

# Subject extractor: "... in X" or "... of X"
SUBJECT_PAT = re.compile(r"\b(?:in|of)\s+([A-Z][A-Za-zÁÉÍÓÚÜÑ&\-\s]{3,})")

def split_background_fields(row):
    bg_text = get_section(
        row.get("clean_text", ""),
        start_key="Academic Background",
        stop_keys=["Academic Experience", "Corporate Experience"],
    )

    degrees = [m.group(0).strip().rstrip(",.;") for m in DEGREE_PAT.finditer(bg_text)]
    subjects = [m.group(1).strip().rstrip(",.;") for m in SUBJECT_PAT.finditer(bg_text)]

    d = row["professor_dict"].copy()
    ab_orgs = d.get("Academic Background - Organization", [])
    cleaned_ab_orgs = []
    for org in ab_orgs:
        if DEGREE_PAT.search(org) or SUBJECT_PAT.search(org):
            continue
        cleaned_ab_orgs.append(org)

    d["Academic Background - Organization"] = cleaned_ab_orgs
    d["Academic Background - Education"] = sorted(
        set(d.get("Academic Background - Education", []) + degrees)
    )
    d["Academic Experience - Subjects"] = sorted(
        set(d.get("Academic Experience - Subjects", []) + subjects)
    )

    return d

df["professor_dict"] = df.apply(split_background_fields, axis=1)


### Clean and finalize professor dictionaries

The raw buckets can still contain noise (section headers, conjunctions, etc.).  
The `finalize_professor_dict` step:

- Removes short or obviously junk tokens.
- Normalizes common variants such as “I E University” → “IE”.
- De-duplicates each list.
- Applies a fallback: if all buckets are empty for a professor, it falls back to the global
  NER entities so that the profile still contributes to the graph.

This produces a compact but robust `professor_dict` for each profile.


In [None]:
JUNK_TOKENS = {
    "academic ba",
    "academic back",
    "of",
    "and",
    "school",
    "university",
    "ll. m",
    "& am",
    "& amp",
    "research",
    "experience",
    "business",
    "administration",
}
JUNK_RE = re.compile(r"^(?:&|of|and|the|section|experience|academic|school|university)\b", re.I)

def _clean_list(vals):
    out, seen = [], set()
    for v in vals:
        v_norm = re.sub(r"\s+", " ", v).strip()
        v_low = v_norm.lower()
        if len(v_norm) < 3:
            continue
        if v_low in JUNK_TOKENS or JUNK_RE.match(v_norm):
            continue
        if v_norm.lower().replace(" ", "") in {"ieuniversity", "ie"}:
            v_norm = "IE"
        if v_norm not in seen:
            out.append(v_norm)
            seen.add(v_norm)
    return out

def finalize_professor_dict(row):
    d = row["professor_dict"].copy()
    d["Corporate Experience - Organization"] = _clean_list(
        d.get("Corporate Experience - Organization", [])
    )
    d["Corporate Experience - Location"] = _clean_list(
        d.get("Corporate Experience - Location", [])
    )
    d["Academic Background - Organization"] = _clean_list(
        d.get("Academic Background - Organization", [])
    )
    d["Academic Background - Education"] = _clean_list(
        d.get("Academic Background - Education", [])
    )
    d["Academic Experience - Subjects"] = _clean_list(
        d.get("Academic Experience - Subjects", [])
    )
    d["Academic Experience - Courses"] = _clean_list(
        d.get("Academic Experience - Courses", [])
    )

    # Fallback: if everything is empty, use global NER buckets
    if not any(d[k] for k in d.keys()):
        ents = row["normalized_entities"]
        d["Corporate Experience - Organization"] = _clean_list([e[1] for e in ents if e[0] == "ORG"])
        d["Corporate Experience - Location"] = _clean_list([e[1] for e in ents if e[0] == "LOC"])
        d["Academic Background - Organization"] = _clean_list([e[1] for e in ents if e[0] == "ORG"])

    return d

df["professor_dict"] = df.apply(finalize_professor_dict, axis=1)


### Re-bucket entities based on section-specific text

To improve attribution, I run a final pass where I:

- Re-run NER on just the **Corporate Experience** and **Academic Background** slices.
- Use these section-specific entities to overwrite the earlier organization/location buckets.

This strengthens the link between each entity and the part of the CV where it appears.


In [None]:
def section_entities(text, start_key, stop_keys, maxlen=2000):
    s = get_section(text, start_key, stop_keys)
    if not s:
        return []
    ents = ner(s[:maxlen])
    return normalize_entities(ents)

def rebuild_from_sections(row):
    d = row["professor_dict"].copy()
    txt = row.get("clean_text", "")

    corp_ents = section_entities(
        txt, "Corporate Experience", ["Academic Background", "Academic Experience"]
    )
    acad_bg_ents = section_entities(
        txt, "Academic Background", ["Corporate Experience", "Academic Experience"]
    )

    d["Corporate Experience - Organization"] = sorted({n for t, n in corp_ents if t == "ORG"})
    d["Corporate Experience - Location"] = sorted({n for t, n in corp_ents if t == "LOC"})
    d["Academic Background - Organization"] = sorted({n for t, n in acad_bg_ents if t == "ORG"})
    return d

df["professor_dict"] = df.apply(rebuild_from_sections, axis=1)


### Final noise filtering

As a last step, I filter out leftover headings and geographic placeholders that slipped into
organization lists. This keeps the organization nodes focused on institutions and companies,
not on countries or section labels.


In [None]:
JUNK_ORG = re.compile(
    r"^(academic( back| exp).*$|of excellence$|section \d+(st|nd|rd|th)$|"
    r"(journal|conference|school of|law school|business school)$|"
    r"&\s*am?p?$|watkins ll?$|ll\.?$)",
    re.IGNORECASE,
)
PLACE_WORDS = {"Spain", "Madrid", "Paris", "London", "Italy", "United States", "Europe"}

def clean_prof_dict(d):
    def filt_org(lst):
        out = []
        for x in lst:
            if len(x) < 3:
                continue
            if JUNK_ORG.search(x):
                continue
            out.append(x)
        return sorted(set(out))

    d["Corporate Experience - Organization"] = filt_org(
        d.get("Corporate Experience - Organization", [])
    )
    d["Academic Background - Organization"] = filt_org(
        d.get("Academic Background - Organization", [])
    )
    d["Academic Experience - Subjects"] = [
        s for s in d.get("Academic Experience - Subjects", []) if s not in PLACE_WORDS
    ]
    return d

df["professor_dict"] = df["professor_dict"].apply(clean_prof_dict)


### Sanity check and export cleaned data

Before exporting, I sample a subset of professors and print their dictionaries to verify that:

- Corporate and academic organizations look reasonable.
- Degrees and subjects are being captured correctly.
- Obvious noise has been removed.

Finally, I save:

- A full parquet file (`teachers_db_cleaned.parquet`) with all intermediate columns.
- A compact JSON file (`cleaned_professors.json`) containing only the cleaned text and `professor_dict`.

The JSON file is the input to Notebook 2, where I build the professor knowledge graph.


In [None]:
for i, dct in enumerate(df["professor_dict"].sample(30, random_state=43).to_list(), 1):
    print(f"\n--- Professor {i} ---")
    for k, v in dct.items():
        print(f"{k}: {v}")


In [None]:
os.makedirs("../data/processed", exist_ok=True)

# Parquet with full intermediate data (useful for debugging or further analysis)
parquet_path = "../data/processed/teachers_db_cleaned.parquet"
df.to_parquet(parquet_path, index=False)
print(f"Saved cleaned parquet to {parquet_path}")

# Compact JSON for Notebook 2: only the text + structured dictionary
json_path = "../data/processed/cleaned_professors.json"
df[["clean_text", "professor_dict"]].to_json(
    json_path,
    orient="records",
    indent=2,
    force_ascii=False,
)
print(f"Saved cleaned JSON for graph building to {json_path}")
