# 01 — Training Pipeline (Adapted to Your CSVs)

This notebook is **tailored to the retrieved files**:

- **`./data/decompose.csv`** → Step 2 (complex → simple). We **explode** each row so that a *Summary sentence* maps to **multiple** simple *Factoids*.
- **`./data/triples.csv`** → Step 3 (simple → triple). We build targets in the format **`Object 1 | Relationship | Object 2`**.
- **`./data/reltype.csv`** → Step 4a (relation normalization). We create a training set of **raw relation strings** (from `triples.csv`) mapped to **canonical relation labels** (from `reltype.csv`) using **exact+fuzzy matching**.
- **`./data/classes.csv`** → Step 4b (class taxonomy mapping). We build a **taxonomy** from the hierarchical columns and map each **Entity** to a **leaf node**. The notebook emits a `taxonomy.csv` that the ingestion notebook uses.

We keep models small and finetune-ready:
- **T5-small** for Steps 2 & 3 (seq2seq)
- **DistilBERT** for Steps 4a & 4b (classification)

> Training is commented out by default, so it should be uncommented to run.


## 0) Environment Setup

If needed, install packages. If you're offline here, make sure the environment already has them.

In [None]:
# !pip install -U pip
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# !pip install transformers datasets accelerate sentencepiece scikit-learn evaluate
# !pip install pandas numpy tqdm matplotlib python-levenshtein rapidfuzz
# !pip install nltk spacy rdflib joblib
# !python -m spacy download en_core_web_sm


## 1) Config & Paths

We point directly at the CSVs uploaded under `./data/`. Outputs go to `./trained_models/` within this working directory.

In [None]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path("./data")

CSV_DECOMPOSE = DATA_DIR / "decompose.csv"   # Source, Summary sentence, Factoids
CSV_TRIPLES   = DATA_DIR / "triples.csv"     # Factoid, Triplet, Object 1, Relationship, Object 2
CSV_RELTYPE   = DATA_DIR / "reltype.csv"     # relationship, type, domain, range, otherCharacteristics
CSV_CLASSES   = DATA_DIR / "classes.csv"     # Class, Parent Class

OUT_DIR = Path("./drive/MyDrive/MGR/trained_models")
OUT_DIR.mkdir(parents=True, exist_ok=True)

MODEL_DIR_SIMPLIFIER   = OUT_DIR / "t5_simplifier_step2"
MODEL_DIR_TRIPLE       = OUT_DIR / "t5_triple_step3"
MODEL_DIR_REL_CLASSIF  = OUT_DIR / "distilbert_relation_step4a"
MODEL_DIR_CLASS_CLASSIF= OUT_DIR / "distilbert_class_step4b"

LABELMAP_DIR = OUT_DIR / "label_maps"
LABELMAP_DIR.mkdir(parents=True, exist_ok=True)

DATA_OUT = Path("./data")
DATA_OUT.mkdir(parents=True, exist_ok=True)
TAXONOMY_CSV = DATA_OUT / "taxonomy.csv"

BASE_NS = "http://example.org/telecom#"


## 2) Inspect & Parse Your CSVs

We auto-parse & normalize columns from four files:
- `decompose.csv` → pairs: **(Summary sentence → individual factoid)**
- `triples.csv` → pairs: **(Simple sentence → `Object 1 | Relationship | Object 2`)**
- `reltype.csv` → canonical **relation vocabulary** (with `domain`, `range`, `type`)
- `classes.csv` → hierarchical **class taxonomy** (we emit `taxonomy.csv` for later use)


In [None]:
from rapidfuzz import process, fuzz
import numpy as np

# ---------- Load raw CSVs ----------
df_dec = pd.read_csv(CSV_DECOMPOSE)
df_tri = pd.read_csv(CSV_TRIPLES)
df_rel = pd.read_csv(CSV_RELTYPE)
df_cls = pd.read_csv(CSV_CLASSES)

print("decompose.csv columns:", df_dec.columns.tolist(), "shape:", df_dec.shape)
print("triples.csv   columns:", df_tri.columns.tolist(), "shape:", df_tri.shape)
print("reltype.csv   columns:", df_rel.columns.tolist(), "shape:", df_rel.shape)
print("classes.csv   columns:", df_cls.columns.tolist(), "shape:", df_cls.shape)

# ---------- Step 2 dataset: explode Summary → Factoids ----------
def parse_factoids_cell(cell: str):
    if pd.isna(cell):
        return []
    txt = str(cell).strip().replace("\r", "\n")
    prelim = []
    for line in txt.split("\n"):
        # Allow comma-separated in same line
        parts = [p for p in line.split(",") if p is not None]
        for piece in parts:
            t = piece.strip().strip('"').strip("'").strip()
            if t:
                prelim.append(t)
    out = []
    seen = set()
    for t in prelim:
        t2 = t.strip().strip(".;:").strip()
        # drop very short noise
        if len(t2) < 3:
            continue
        if t2 not in seen:
            seen.add(t2)
            out.append(t2)
    return out

dec_rows = []
sum_col = "Summary sentence"
fac_col = "Factoids"
for _, row in df_dec.iterrows():
    complex_sent = str(row.get(sum_col, "")).strip()
    facts = parse_factoids_cell(row.get(fac_col, ""))
    for f in facts:
        dec_rows.append({"complex": complex_sent, "simple": f})

df_s2 = pd.DataFrame(dec_rows)
print("Step2 pairs (complex→simple):", df_s2.shape)
print(df_s2.head(5))

# ---------- Step 3 dataset: simple → triple_text ----------
# Use 'Factoid' as the simple input. Prefer prebuilt 'Triplet' if present and in A|R|B format.
def normalize_triplet_text(row):
    trip = str(row.get("Triplet", "")).strip()
    if trip and "|" in trip:
        # assume already formatted
        return " | ".join([p.strip() for p in trip.split("|")[:3]])
    # else, build from columns
    a = str(row.get("Object 1", "")).strip()
    r = str(row.get("Relationship", "")).strip()
    b = str(row.get("Object 2", "")).strip()
    return f"{a} | {r} | {b}"

df_s3 = pd.DataFrame({
    "simple": df_tri["Factoid"].astype(str).str.strip(),
    "triple_text": df_tri.apply(normalize_triplet_text, axis=1).astype(str)
})
print("Step3 pairs (simple→triple_text):", df_s3.shape)
print(df_s3.head(5))

# ---------- Step 4a dataset: raw_relation → normalized_relation ----------
canon_relations = df_rel["relationship"].dropna().astype(str).str.strip().unique().tolist()

def best_match(label, choices, score_cut=70):
    if label is None or str(label).strip()=="" or not choices:
        return None, 0
    label = str(label).strip()
    match, score, _ = process.extractOne(label, choices, scorer=fuzz.token_sort_ratio)
    if score >= score_cut:
        return match, score
    return None, score

rel_pairs = []
for raw in df_tri["Relationship"].fillna("").astype(str).tolist():
    norm, score = best_match(raw, canon_relations, score_cut=70)
    if norm is None:
        if raw in canon_relations:
            norm = raw
        else:
            norm = raw   # leave as-is if no good match
    rel_pairs.append({"raw_relation": raw, "normalized_relation": norm})

df_r = pd.DataFrame(rel_pairs).dropna()
print("Step4a pairs (raw→normalized):", df_r.shape)
print(df_r.head(5))

# ---------- Step 4b dataset & taxonomy from Class / Parent Class ----------
if not {"Class","Parent Class"}.issubset(df_cls.columns):
    raise ValueError("classes.csv must contain 'Class' and 'Parent Class' columns.")

def sanitize(s: str):
    s = str(s).strip()
    return "".join(ch if ch.isalnum() else "_" for ch in s).strip("_")

# Build node table & parent links
labels = set(df_cls["Class"].dropna().astype(str).tolist()) | set(df_cls["Parent Class"].dropna().astype(str).tolist())
labels = {l for l in labels if l and l.lower() != "nan"}
label2id = {lab: f"n{idx+1:05d}" for idx, lab in enumerate(sorted(labels))}

taxonomy_rows = []
for lab, nid in label2id.items():
    taxonomy_rows.append({"node_id": nid, "parent_id": None, "label": lab, "uri": f"{BASE_NS}{sanitize(lab)}"})

# Parent edges (we'll keep them in CSV via parent_id column for the child)
parent_map = dict(zip(df_cls["Class"].astype(str), df_cls["Parent Class"].astype(str)))
# Update parent_id for those with known parent
for row in taxonomy_rows:
    lab = row["label"]
    parent_label = parent_map.get(lab, None)
    if parent_label in label2id:
        row["parent_id"] = label2id[parent_label]

df_tax = pd.DataFrame(taxonomy_rows)
df_tax.to_csv(TAXONOMY_CSV, index=False)

# Class mappings (raw_class → its node_id)
df_c = pd.DataFrame({"raw_class": [lab for lab in labels], "taxonomy_node": [label2id[lab] for lab in labels]})
print("Taxonomy nodes:", df_tax.shape)
print(df_tax.head(8))
print("Class mappings (raw→node_id):", df_c.shape)
print(df_c.head(8))

## 3) Build Datasets for Training

In [None]:
from datasets import Dataset

def build_seq2seq_dataset(df, source_col, target_col):
    df = df[[source_col, target_col]].dropna().reset_index(drop=True)
    return Dataset.from_pandas(df.rename(columns={source_col: "input_text", target_col: "target_text"}))

def build_classification_dataset(df, text_col, label_col):
    df = df[[text_col, label_col]].dropna().reset_index(drop=True)
    labels = sorted(df[label_col].astype(str).unique())
    label2id = {lab:i for i, lab in enumerate(labels)}
    id2label = {i:lab for lab, i in label2id.items()}
    ds = Dataset.from_pandas(df.rename(columns={text_col:"text", label_col:"label"}))
    ds = ds.map(lambda x: {"label_id": label2id[str(x["label"])]})
    return ds, label2id, id2label

ds_s2 = build_seq2seq_dataset(df_s2, "complex", "simple")
ds_s3 = build_seq2seq_dataset(df_s3, "simple", "triple_text")
ds_rel, rel_label2id, rel_id2label = build_classification_dataset(df_r, "raw_relation", "normalized_relation")
ds_cls, cls_label2id, cls_id2label = build_classification_dataset(df_c, "raw_class", "taxonomy_node")

print("Datasets ready:",
      "\n - Step2 seq2seq:", len(ds_s2),
      "\n - Step3 seq2seq:", len(ds_s3),
      "\n - Step4a cls  :", len(ds_rel),
      "\n - Step4b cls  :", len(ds_cls))


## 4) Train Step 2 — Complex → Simple (T5-small)

In [None]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast, DataCollatorForSeq2Seq, TrainingArguments, Trainer

tokenizer_t5 = T5TokenizerFast.from_pretrained("t5-small")
model_s2 = T5ForConditionalGeneration.from_pretrained("t5-small")

def preprocess_t5(batch, tokenizer, max_in=512, max_out=128):
    model_inputs = tokenizer(batch["input_text"], max_length=max_in, truncation=True)
    labels = tokenizer(text_target=batch["target_text"], max_length=max_out, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tok_s2 = ds_s2.map(lambda x: preprocess_t5(x, tokenizer_t5), batched=True, remove_columns=ds_s2.column_names)

args_s2 = TrainingArguments(
    output_dir=str(MODEL_DIR_SIMPLIFIER),
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50
)
# NOTE: We intentionally removed `evaluation_strategy` and `save_strategy` for older Transformers.
collator = DataCollatorForSeq2Seq(tokenizer_t5, model=model_s2)
trainer_s2 = Trainer(model=model_s2, args=args_s2, train_dataset=tok_s2, data_collator=collator, tokenizer=tokenizer_t5)

# trainer_s2.train()  # ← un-comment to train
# trainer_s2.save_model(MODEL_DIR_SIMPLIFIER)
# tokenizer_t5.save_pretrained(MODEL_DIR_SIMPLIFIER)
print("Prepared Step 2 trainer (complex→simple). Un-comment trainer_s2.train() to run.")

## 5) Train Step 3 — Simple → Triple (T5-small)

In [None]:
# Reuse same tokenizer for simplicity
model_s3 = T5ForConditionalGeneration.from_pretrained("t5-small")

tok_s3 = ds_s3.map(lambda x: preprocess_t5(x, tokenizer_t5), batched=True, remove_columns=ds_s3.column_names)

args_s3 = TrainingArguments(
    output_dir=str(MODEL_DIR_TRIPLE),
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50
)
# NOTE: We intentionally removed `evaluation_strategy` and `save_strategy` for older Transformers.
trainer_s3 = Trainer(model=model_s3, args=args_s3, train_dataset=tok_s3, data_collator=collator, tokenizer=tokenizer_t5)

# trainer_s3.train()  # ← un-comment to train
# trainer_s3.save_model(MODEL_DIR_TRIPLE)
# tokenizer_t5.save_pretrained(MODEL_DIR_TRIPLE)
print("Prepared Step 3 trainer (simple→triple). Un-comment trainer_s3.train() to run.")

## 6) Train Step 4a — Relation Normalizer (DistilBERT)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
import joblib

tok_rel = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model_rel = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(rel_label2id),
    id2label={i:l for i,l in enumerate(sorted(rel_label2id, key=lambda k: rel_label2id[k]))},
    label2id=rel_label2id
)

def tokenize_cls(batch, tokenizer, max_len=64):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=max_len)

tok_ds_rel = ds_rel.map(lambda x: tokenize_cls(x, tok_rel), batched=True)
# Remove any leftover text/label columns and rename label_id→labels
drop_cols = [c for c in tok_ds_rel.column_names if c in ("text","label")]
tok_ds_rel = tok_ds_rel.remove_columns(drop_cols)
tok_ds_rel = tok_ds_rel.rename_column("label_id", "labels")

args_rel = TrainingArguments(
    output_dir=str(MODEL_DIR_REL_CLASSIF),
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    num_train_epochs=8,
    weight_decay=0.01,
    logging_steps=50
)
collator_rel = DataCollatorWithPadding(tokenizer=tok_rel)
trainer_rel = Trainer(model=model_rel, args=args_rel, train_dataset=tok_ds_rel, data_collator=collator_rel, tokenizer=tok_rel)

# trainer_rel.train()
# trainer_rel.save_model(MODEL_DIR_REL_CLASSIF)
# tok_rel.save_pretrained(MODEL_DIR_REL_CLASSIF)
# joblib.dump(rel_label2id, LABELMAP_DIR / "relation_label2id.joblib")
# joblib.dump({v:k for k,v in rel_label2id.items()}, LABELMAP_DIR / "relation_id2label.joblib")
print("Prepared Step 4a trainer (relation normalization). Un-comment trainer_rel.train() to run.")

## 7) Train Step 4b — Class Taxonomy Mapper (DistilBERT)

Quick summary: when (and why) use BERT here

**When BERT is a good fit**

We can reduce label space (e.g., per vertical / per parent class) and have dozens+ examples per label.

We implement hierarchical classification (predict parent → child), metric learning (bi-encoder + nearest label), or retrieval-augmented matching (encode labels + pick nearest).

We need low latency, fixed cost, or offline/on-prem inference.

**Why BERT struggled in our case**

We have ~1.8k classes and extremely sparse per-class data. With a flat softmax, the random baseline loss ≈ ln(1894) ≈ 7.54, so the model hovers ~7+ without lots of data or architectural tricks.

**Practical hybrid recipe**

Keep BERT for relations (4a) and/or for candidate retrieval (bi-encoder) → then let the OpenAI model do the final pick among 20–50 candidates (what v7 does). This keeps costs low and accuracy high.

In [None]:
OPENAI_API_KEY="example"
# optional: pick a model
OPENAI_CLASS_MODEL="gpt-5"

In [None]:
# Uses (or builds) taxonomy embeddings to preview mappings for objects from triples.csv,
# computes simple score stats, suggests thresholds, and writes a debug CSV.
# If embeddings are missing, this cell will build them here (no need to run Section 5 first).
# Prereq: set OPENAI_API_KEY in your environment.

# !pip install -U openai  # uncomment if the SDK isn't available

import os, json, numpy as np, pandas as pd, re
from pathlib import Path

try:
    from openai import OpenAI
except Exception as e:
    raise RuntimeError("OpenAI SDK not installed. Run `pip install -U openai` and re-run this cell.") from e

OPENAI_4B_DIR = Path("./drive/MyDrive/MGR/trained_models/openai_class_mapper_4b")
OPENAI_4B_DIR.mkdir(parents=True, exist_ok=True)
TAXO_EMB_NPY  = OPENAI_4B_DIR / "taxonomy_embeddings.npy"
TAXO_META_JSON= OPENAI_4B_DIR / "taxonomy_meta.json"
TAXONOMY_CSV  = Path("./data/taxonomy.csv")

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
client = OpenAI()
emb_model = "text-embedding-3-large"

# --- Helpers ---
def l2norm(x): return x / (np.linalg.norm(x, axis=1, keepdims=True) + 1e-9)
def cosine_sim(a,b): return l2norm(a) @ l2norm(b).T

def batched(iterable, n=256):
    for i in range(0, len(iterable), n):
        yield iterable[i:i+n]

def embed_texts(texts, batch=256):
    out = []
    for chunk in batched(texts, batch):
        resp = client.embeddings.create(model=emb_model, input=chunk)
        out.extend([e.embedding for e in resp.data])
    return np.array(out, dtype="float32")

def boost_with_path(object_str, node_text):
    s = object_str.lower()
    t = node_text.lower()
    score = 0.0
    leaf = node_text.split("⟂")[0].strip().lower() if "⟂" in node_text else node_text.lower()
    if s == leaf:
        score += 0.06
    toks = [w for w in re.findall(r"[a-zA-Z0-9]+", s) if len(w) > 2]
    overlap = sum(1 for w in toks if w in t)
    if overlap:
        score += min(0.04, 0.01 * overlap)
    return score

# --- Ensure taxonomy is available ---
if 'df_tax' not in globals():
    assert TAXONOMY_CSV.exists(), "Missing ./data/taxonomy.csv — run earlier sections to build taxonomy first."
    df_tax = pd.read_csv(TAXONOMY_CSV)

label_by_id = {row["node_id"]: row["label"] for _, row in df_tax.iterrows()}
uri_by_id   = {row["node_id"]: row["uri"] for _, row in df_tax.iterrows()}
parents     = {row["node_id"]: row["parent_id"] for _, row in df_tax.iterrows()}

def label_with_context(node_id):
    chain = []
    cur = node_id
    while cur:
        chain.append(label_by_id[cur])
        cur = parents.get(cur, None)
    return " ⟂ ".join(chain)

# --- Build embeddings here if missing ---
if not (TAXO_EMB_NPY.exists() and TAXO_META_JSON.exists()):
    print("Embeddings not found. Building taxonomy embeddings now (Section 7 fallback)...")
    node_ids = sorted(label_by_id.keys())
    texts = [label_with_context(nid) for nid in node_ids]
    mat = embed_texts(texts, batch=256)
    np.save(TAXO_EMB_NPY, mat)
    with open(TAXO_META_JSON, "w", encoding="utf-8") as f:
        json.dump({"model": emb_model, "node_ids": node_ids, "texts": texts}, f, ensure_ascii=False, indent=2)
    print("Saved:", TAXO_EMB_NPY, "and", TAXO_META_JSON)

# --- Load embeddings ---
mat = np.load(TAXO_EMB_NPY)
with open(TAXO_META_JSON, "r", encoding="utf-8") as f:
    meta = json.load(f)
node_ids = meta["node_ids"]
texts    = meta["texts"]
assert mat.shape[0] == len(node_ids), "Embedding matrix vs node_ids mismatch"

# --- Preview mappings from triples.csv ---
if 'df_tri' not in globals():
    df_tri = pd.read_csv(Path('./data/triples.csv'))

objects = pd.Series(df_tri["Object 1"].astype(str).tolist() + df_tri["Object 2"].astype(str).tolist()) \
            .dropna().unique().tolist()

obj_vecs = embed_texts(objects, batch=256)
sims = cosine_sim(obj_vecs, mat)
top_idx = sims.argsort(axis=1)[:, ::-1][:, :5]
top_scores = np.take_along_axis(sims, top_idx, axis=1)

# Apply small path-based boost and pick best
best_ids = []
best_scores = []
for i, obj in enumerate(objects):
    cand_idxs = top_idx[i]
    cand_scores = top_scores[i]
    adj = [cand_scores[j] + boost_with_path(obj, texts[cand_idxs[j]]) for j in range(len(cand_idxs))]
    j = int(np.argmax(adj))
    best_ids.append(node_ids[cand_idxs[j]])
    best_scores.append(float(adj[j]))

df_dbg = pd.DataFrame({"object_text": objects, "node_id": best_ids, "score": best_scores})
df_dbg["label"] = df_dbg["node_id"].map(lambda nid: label_by_id.get(nid, nid))
df_dbg["uri"]   = df_dbg["node_id"].map(lambda nid: uri_by_id.get(nid, nid))

# Suggest thresholds from distribution
p50, p75, p90 = np.percentile(df_dbg["score"], [50,75,90])
suggested = {"mid": float(max(0.70, p50)), "high": float(max(0.78, p75))}

df_dbg.to_csv("training_openai4b_mapping_preview.csv", index=False)
with open("training_openai4b_thresholds.json", "w") as f:
    json.dump({"p50": float(p50), "p75": float(p75), "p90": float(p90), "suggested": suggested}, f, indent=2)

print("Wrote training_openai4b_mapping_preview.csv with", len(df_dbg), "rows")
print("Suggested thresholds:", suggested)

## 8) (Optional) Simple Evaluation Examples

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 9) Save Artifacts

Un-comment the save lines in Sections 4–7 after training.

In [None]:
# Change this to the folder you want
FOLDER = "/content/trained_models"
ARCHIVE = "/content/trained_models.zip"

# Zip it (recursive, store paths relative to the folder’s parent)
!cd "$(dirname "$FOLDER")" && zip -r "$(basename "$ARCHIVE")" "$(basename "$FOLDER")"

from google.colab import files
files.download(ARCHIVE)  # opens a browser download dialog