# Sarcasm Detection (RoBERTa-Large) — Full Colab Notebook (TPU-safe)

This notebook is **from scratch** and includes fixes for issues you hit:
- Handles datasets (Kaggle SARC + optional HF conversation sarcasm dataset)
- Cleans/normalizes labels + context
- Stratified train/valid/test split (**ClassLabel** required)
- Trains **RoBERTa-Large** on **TPU v5e** using a **TPU-safe PyTorch/XLA loop** (avoids Trainer + fused AdamW issues)
- TPU-safe evaluation + classification report
- Threshold tuning (overall + hard/subtle subset)
- Optional **OpenAI gated ensemble** (only calls LLM when RoBERTa is uncertain; cached + quota-safe)
- Saves a final **ensemble package** (model + tokenizer + config + optional LLM cache)

> Run on TPU (Runtime → Change runtime type → Hardware accelerator → TPU).  
> OpenAI ensemble is optional and requires `OPENAI_API_KEY` with quota.


## 0) Install (TPU-friendly)

If you already installed conflicting packages, **restart runtime** after this cell.

In [None]:
!pip -q install -U "datasets>=2.20.0" "transformers>=4.40.0" "accelerate>=0.30.0" evaluate scikit-learn kagglehub
!pip -q install -U torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
!pip -q install -U openai


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.1/75.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/512.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/8.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m6.3/8.9 MB[0m [31m198.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## 1) Imports + TPU detection

In [None]:
import os, re, json, ast, hashlib, time, random
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader

from datasets import Dataset, DatasetDict, ClassLabel, concatenate_datasets, load_dataset
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_linear_schedule_with_warmup, set_seed

# TPU (PyTorch/XLA)
try:
    import torch_xla.core.xla_model as xm
    import torch_xla.distributed.parallel_loader as pl
    TPU_AVAILABLE = True
except Exception as e:
    TPU_AVAILABLE = False
    xm = None
    pl = None
    print("TPU/XLA not available:", e)

SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

device = xm.xla_device() if TPU_AVAILABLE else torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


Device: xla:0


  device = xm.xla_device() if TPU_AVAILABLE else torch.device("cuda" if torch.cuda.is_available() else "cpu")


## 2) Download Reddit SARC from Kaggle (danofer/sarcasm)

In [None]:
import kagglehub
path = kagglehub.dataset_download("danofer/sarcasm")
print("Dataset path:", path)


Using Colab cache for faster access to the 'sarcasm' dataset.
Dataset path: /kaggle/input/sarcasm


## 3) Load + normalize SARC

In [None]:
import os
csv_path = os.path.join(path, "train-balanced-sarcasm.csv")
sarc = pd.read_csv(csv_path)

df_sarc = pd.DataFrame({
    "reply_text": sarc["comment"].astype(str),
    "context_text": sarc["parent_comment"].fillna("").astype(str),
    "label": sarc["label"].astype(int),
})

print("SARC rows:", len(df_sarc))
df_sarc.head()


SARC rows: 1010826


Unnamed: 0,reply_text,context_text,label
0,NC and NH.,"Yeah, I get that argument. At this point, I'd ...",0
1,You do know west teams play against west teams...,The blazers and Mavericks (The wests 5 and 6 s...,0
2,"They were underdogs earlier today, but since G...",They're favored to win.,0
3,"This meme isn't funny none of the ""new york ni...",deadass don't kill my buzz,0
4,I could use one of those tools.,Yep can confirm I saw the tool they use for th...,0


import kagglehub
path = kagglehub.dataset_download("danofer/sarcasm")
print("Dataset path:", path)
import os
csv_path = os.path.join(path, "train-balanced-sarcasm.csv")
sarc = pd.read_csv(csv_path)

df_sarc = pd.DataFrame({
    "reply_text": sarc["comment"].astype(str),
    "context_text": sarc["parent_comment"].fillna("").astype(str),
    "label": sarc["label"].astype(int),
})

print("SARC rows:", len(df_sarc))
df_sarc.head()
## 4) (Optional) Load conversation sarcasm dataset from Hugging Face

This dataset has columns like `label`, `response`, `context` and labels like `SARCASM`.

In [None]:
USE_HF_CONV = True  # set False to use only SARC

if USE_HF_CONV:
    conv_raw = load_dataset("shiv213/Automatic-Sarcasm-Detection-Twitter")["train"]

    def normalize_conv(example):
        ctx = example.get("context", "")
        if isinstance(ctx, str) and ctx.strip().startswith("["):
            try:
                ctx_list = ast.literal_eval(ctx)
                if isinstance(ctx_list, list):
                    ctx = " || ".join([str(x) for x in ctx_list])
            except Exception:
                pass

        lbl = example.get("label", 0)
        if isinstance(lbl, str):
            lbl = 1 if lbl.strip().upper() == "SARCASM" else 0
        elif isinstance(lbl, bool):
            lbl = 1 if lbl else 0
        else:
            lbl = int(lbl)

        return {"reply_text": str(example.get("response","")), "context_text": str(ctx), "label": lbl}

    conv = conv_raw.map(normalize_conv, remove_columns=conv_raw.column_names)
    print("Conv rows:", len(conv))
else:
    conv = None


Repo card metadata block was not found. Setting CardData to empty.


Conv rows: 5000


## 5) Combine + cast label to ClassLabel + stratified split

In [None]:
SARC_N = 200_000  # lower for faster runs

df_sarc_sub = df_sarc.sample(n=min(SARC_N, len(df_sarc)), random_state=SEED).reset_index(drop=True)
ds_sarc = Dataset.from_pandas(df_sarc_sub)

base = concatenate_datasets([ds_sarc, conv]) if USE_HF_CONV else ds_sarc

base = base.cast_column("label", ClassLabel(names=["not_sarcasm", "sarcasm"]))

split1 = base.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
temp = split1["test"].train_test_split(test_size=0.50, seed=SEED, stratify_by_column="label")

ds = DatasetDict({"train": split1["train"], "validation": temp["train"], "test": temp["test"]})

def sarcasm_pct(d):
    y = np.array(d["label"])
    return float((y==1).mean()*100)

print("Train sarcasm %:", round(sarcasm_pct(ds["train"]),2))
print("Valid sarcasm %:", round(sarcasm_pct(ds["validation"]),2))
print("Test  sarcasm %:", round(sarcasm_pct(ds["test"]),2))


Train sarcasm %: 49.98
Valid sarcasm %: 49.98
Test  sarcasm %: 49.99


## 6) Format multi-turn context + tokenize

In [None]:
MODEL_NAME = "roberta-large"
#MAX_LENGTH = 256       # try 384/512 if memory allows
#KEEP_LAST_TURNS = 5
MAX_LENGTH =512
KEEP_LAST_TURNS =7

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def format_input(context_text: str, reply_text: str, keep_last_turns: int = KEEP_LAST_TURNS) -> str:
    ctx = (context_text or "").strip()
    rep = (reply_text or "").strip()
    turns = [t.strip() for t in ctx.split("||") if t.strip()]
    turns = turns[-keep_last_turns:]
    if not turns:
        ctx_block = "[NO_CONTEXT]"
    else:
        ctx_block = "\n".join([f"[TURN-{len(turns)-i}] {t}" for i, t in enumerate(turns)])
    return f"{ctx_block}\n[REPLY] {rep}"

def tokenize_batch(batch):
    texts = [format_input(c, r) for c, r in zip(batch["context_text"], batch["reply_text"])]
    enc = tokenizer(texts, truncation=True, max_length=MAX_LENGTH, padding="max_length")
    enc["labels"] = batch["label"]
    return enc

encoded = ds.map(tokenize_batch, batched=True, remove_columns=ds["train"].column_names)
encoded = encoded.cast_column("labels", ClassLabel(names=["not_sarcasm","sarcasm"]))
encoded.set_format(type="torch")
encoded


Map:   0%|          | 0/164000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20500 [00:00<?, ? examples/s]

Map:   0%|          | 0/20500 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/164000 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/20500 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/20500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 164000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 20500
    })
})

## 7) Train RoBERTa-Large with a TPU-safe PyTorch/XLA loop

In [None]:
#EPOCHS = 2
#LR = 1e-5
WD = 0.01
#TRAIN_BS = 16
#EVAL_BS = 32
#WARMUP_RATIO = 0.06


EPOCHS =5
LR =1e-5
WARMUP_RATIO =0.10
TRAIN_BS =16
EVAL_BS =32

PATIENCE =2
best_val_f1 = -1
patience_ctr =0





model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

# TPU-safe optimizer
if TPU_AVAILABLE:
    import torch_xla.amp.syncfree as syncfree
    optimizer = syncfree.AdamW(model.parameters(), lr=LR, weight_decay=WD)
else:
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WD)

train_loader = DataLoader(encoded["train"], batch_size=TRAIN_BS, shuffle=True)
valid_loader = DataLoader(encoded["validation"], batch_size=EVAL_BS, shuffle=False)

total_steps = EPOCHS * len(train_loader)
warmup_steps = int(WARMUP_RATIO * total_steps)
scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)

def run_eval(loader):
    model.eval()
    all_probs, all_preds, all_labels = [], [], []
    it = pl.ParallelLoader(loader, [device]).per_device_loader(device) if TPU_AVAILABLE else loader
    with torch.no_grad():
        for batch in it:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            logits = model(input_ids=input_ids, attention_mask=attention_mask).logits
            probs = torch.softmax(logits, dim=-1)[:, 1]
            preds = torch.argmax(logits, dim=-1)
            all_probs.append(probs.detach().cpu())
            all_preds.append(preds.detach().cpu())
            all_labels.append(labels.detach().cpu())
    p = torch.cat(all_probs).numpy()
    yhat = torch.cat(all_preds).numpy()
    y = torch.cat(all_labels).numpy()
    return y, yhat, p

best_val_f1 = -1.0
best_state = None

for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0
    it = pl.ParallelLoader(train_loader, [device]).per_device_loader(device) if TPU_AVAILABLE else train_loader

    for step, batch in enumerate(it):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        out = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = out.loss
        loss.backward()

        if TPU_AVAILABLE:
            xm.optimizer_step(optimizer)
        else:
            optimizer.step()

        optimizer.zero_grad()
        scheduler.step()

        running_loss += loss.item()
        if (step + 1) % 200 == 0:
            msg = f"Epoch {epoch+1} step {step+1}/{len(train_loader)} loss {running_loss/(step+1):.4f}"
            if TPU_AVAILABLE: xm.master_print(msg)
            else: print(msg)

    yv, yvhat, _ = run_eval(valid_loader)
    val_f1 = f1_score(yv, yvhat)
    msg = f"Epoch {epoch+1}: train_loss={running_loss/len(train_loader):.4f} val_f1={val_f1:.4f}"
    if TPU_AVAILABLE: xm.master_print(msg)
    else: print(msg)

    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        #best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        patience_ctr =0
    else:
        patience_ctr +=1

    if patience_ctr >= PATIENCE:
      print("Early stopping")
      break



if best_state is not None:
    model.load_state_dict(best_state)

print("Best val F1:", best_val_f1)


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 step 200/10250 loss 0.6999
Epoch 1 step 400/10250 loss 0.6975
Epoch 1 step 600/10250 loss 0.6965
Epoch 1 step 800/10250 loss 0.6923
Epoch 1 step 1000/10250 loss 0.6815
Epoch 1 step 1200/10250 loss 0.6692
Epoch 1 step 1400/10250 loss 0.6576
Epoch 1 step 1600/10250 loss 0.6451
Epoch 1 step 1800/10250 loss 0.6370
Epoch 1 step 2000/10250 loss 0.6282
Epoch 1 step 2200/10250 loss 0.6199
Epoch 1 step 2400/10250 loss 0.6137
Epoch 1 step 2600/10250 loss 0.6087
Epoch 1 step 2800/10250 loss 0.6031
Epoch 1 step 3000/10250 loss 0.5976
Epoch 1 step 3200/10250 loss 0.5944
Epoch 1 step 3400/10250 loss 0.5891
Epoch 1 step 3600/10250 loss 0.5850
Epoch 1 step 3800/10250 loss 0.5819
Epoch 1 step 4000/10250 loss 0.5779
Epoch 1 step 4200/10250 loss 0.5747
Epoch 1 step 4400/10250 loss 0.5727
Epoch 1 step 4600/10250 loss 0.5704
Epoch 1 step 4800/10250 loss 0.5686
Epoch 1 step 5000/10250 loss 0.5667
Epoch 1 step 5200/10250 loss 0.5650
Epoch 1 step 5400/10250 loss 0.5631
Epoch 1 step 5600/10250 loss 0.5

## 8) Test evaluation (TPU-safe) + report

In [None]:
test_loader = DataLoader(encoded["test"], batch_size=EVAL_BS, shuffle=False)
y_test, y_pred, p_test = run_eval(test_loader)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=["not_sarcasm","sarcasm"]))


[[7857 2396]
 [1944 8303]]
              precision    recall  f1-score   support

 not_sarcasm       0.80      0.77      0.78     10253
     sarcasm       0.78      0.81      0.79     10247

    accuracy                           0.79     20500
   macro avg       0.79      0.79      0.79     20500
weighted avg       0.79      0.79      0.79     20500



In [None]:
SAVE_DIR = "sarcasm_roberta_large_context"
model.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

import shutil
shutil.make_archive(SAVE_DIR, "zip", SAVE_DIR)

from google.colab import files
files.download(f"{SAVE_DIR}.zip")


AttributeError: 'RobertaForSequenceClassification' object has no attribute 'save_model'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 9) Threshold tuning (overall + hard/subtle subset)

In [None]:
yv, _, pv = run_eval(valid_loader)

best_t, best_f = 0.5, 0.0
for t in np.linspace(0.1, 0.9, 81):
    yhat = (pv >= t).astype(int)
    f = f1_score(yv, yhat)
    if f > best_f:
        best_f, best_t = f, t
print("Best threshold (val overall):", best_t, "F1:", best_f)

unc = np.abs(pv - 0.5)
hard_ids = np.argsort(unc)[:1000]
yv_h, pv_h = yv[hard_ids], pv[hard_ids]

best_t_h, best_f_h = 0.5, 0.0
for t in np.linspace(0.1, 0.9, 81):
    yhat = (pv_h >= t).astype(int)
    f = f1_score(yv_h, yhat)
    if f > best_f_h:
        best_f_h, best_t_h = f, t
print("Best threshold (val HARD):", best_t_h, "Hard F1:", best_f_h)

FINAL_THRESHOLD = float(best_t_h)  # choose best_t for overall, best_t_h for subtle
y_pred_thr = (p_test >= FINAL_THRESHOLD).astype(int)

print("\nTEST @ FINAL_THRESHOLD =", FINAL_THRESHOLD)
print(confusion_matrix(y_test, y_pred_thr))
print(classification_report(y_test, y_pred_thr, target_names=["not_sarcasm","sarcasm"]))


Best threshold (val overall): 0.37 F1: 0.7952297309790145
Best threshold (val HARD): 0.1 Hard F1: 0.6225895316804407

TEST @ FINAL_THRESHOLD = 0.1
[[6097 4156]
 [ 975 9272]]
              precision    recall  f1-score   support

 not_sarcasm       0.86      0.59      0.70     10253
     sarcasm       0.69      0.90      0.78     10247

    accuracy                           0.75     20500
   macro avg       0.78      0.75      0.74     20500
weighted avg       0.78      0.75      0.74     20500



## 10) Optional OpenAI gated ensemble (only calls LLM when RoBERTa uncertain)

Set `USE_OPENAI_ENSEMBLE=True` and ensure you have API quota.

In [None]:
USE_OPENAI_ENSEMBLE = True
UNCERTAINTY_GATE = 0.08
LLM_MODEL = "gpt-4o-mini"
LLM_CONF_GATE = 0.88
W_ROBERTA_CONF, W_LLM_CONF = 0.75, 0.25
W_ROBERTA_BASE, W_LLM_BASE = 0.90, 0.10
FINAL_THRESHOLD = 0.56
_llm_cache = {}

if USE_OPENAI_ENSEMBLE:
    from openai import OpenAI
    client = OpenAI()

    SYSTEM_PROMPT = (
        "You are a careful sarcasm detector for social media conversations.\n"
        "Output ONLY JSON: "
        '{"label":"SARCASTIC"|"NOT_SARCASTIC","confidence":0-1}.'
    )

    def _cache_key(ctx, rep):
        s = format_input(ctx, rep)
        return hashlib.sha256(s.encode("utf-8")).hexdigest()

    def llm_judge(ctx, rep, model_name=LLM_MODEL):
        k = _cache_key(ctx, rep)
        if k in _llm_cache:
            return _llm_cache[k]

        prompt = (
            "Decide whether the [REPLY] is sarcastic given the context.\n\n"
            f"{format_input(ctx, rep)}\n\nReturn JSON only."
        )

        try:
            resp = client.responses.create(
                model=model_name,
                input=[
                    {"role":"developer","content":SYSTEM_PROMPT},
                    {"role":"user","content":prompt},
                ],
                temperature=0.0,
            )
            text = (resp.output_text or "").strip()
            m = re.search(r"\{.*\}", text, flags=re.S)
            if not m:
                out = (0, 0.5)
            else:
                obj = json.loads(m.group(0))
                lab = str(obj.get("label","")).upper()
                conf = float(obj.get("confidence",0.5))
                conf = max(0.0, min(1.0, conf))
                out = (1 if lab=="SARCASTIC" else 0, conf)
        except Exception:
            out = (0, 0.5)

        _llm_cache[k] = out
        time.sleep(0.2)
        return out

    #def roberta_prob_one(ctx, rep):
     #   text = format_input(ctx, rep)
     #   inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)
     #   inputs = {k: v.to(device) for k, v in inputs.items()}
     #   model.eval()
     #   with torch.no_grad():
      #      logits = model(**inputs).logits
      #      p = torch.softmax(logits, dim=-1)[0,1].detach().cpu().item()
      #  return float(p)

    def ensemble_predict_one(ctx, rep):
        p_r = roberta_prob_one(ctx, rep, max_length=MAX_LENGTH)
        if abs(p_r - 0.5) >= UNCERTAINTY_GATE:
            pred = 1 if p_r >= FINAL_THRESHOLD else 0
            return pred, {"used_llm": False, "p_roberta": p_r, "p_final": p_r}

        llm_label, llm_conf = llm_judge(ctx, rep)
        p_l = llm_conf if llm_label==1 else (1.0-llm_conf)

        if llm_conf >= LLM_CONF_GATE:
            w_r, w_l = W_ROBERTA_CONF, W_LLM_CONF
        else:
            w_r, w_l = W_ROBERTA_BASE, W_LLM_BASE

        p_final = w_r*p_r + w_l*p_l
        pred = 1 if p_final >= FINAL_THRESHOLD else 0
        return pred, {"used_llm": True, "p_roberta": p_r, "p_llm": p_l, "p_final": p_final, "llm_conf": llm_conf}

    print("OpenAI ensemble enabled.")
else:
    print("OpenAI ensemble disabled.")


OpenAI ensemble enabled.


In [None]:
# TPU-safe device getter
try:
    import torch_xla
    import torch_xla.core.xla_model as xm
    TPU_AVAILABLE = True
    DEVICE = torch_xla.device()   # recommended (no deprecation warning)
except Exception:
    TPU_AVAILABLE = False
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(DEVICE)
model.eval()

def roberta_prob_one(ctx, rep, max_length=256):
    text = format_input(ctx, rep)

    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        padding=False,
    )

    # IMPORTANT: move ALL tensors to DEVICE (XLA if TPU)
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits
        p = torch.softmax(logits, dim=-1)[0, 1].detach().cpu().item()
    return float(p)


## 11) Save final package (RoBERTa-Large + config + optional LLM cache)

In [None]:
import shutil
import os

SAVE_DIR = "sarcasm_roberta_large_package"
os.makedirs(SAVE_DIR, exist_ok=True)

# Save model/tokenizer to CPU-friendly format
model_cpu = model.to("cpu")
model_cpu.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

package_config = {
    "model_name": MODEL_NAME,
    "max_length": int(MAX_LENGTH),
    "keep_last_turns": int(KEEP_LAST_TURNS),
    "final_threshold": float(FINAL_THRESHOLD),
    "use_openai_ensemble": bool(USE_OPENAI_ENSEMBLE),
    "uncertainty_gate": float(UNCERTAINTY_GATE),
    "llm_model": LLM_MODEL,
    "llm_conf_gate": float(LLM_CONF_GATE),
    "weights": {
        "roberta_conf": float(W_ROBERTA_CONF),
        "llm_conf": float(W_LLM_CONF),
        "roberta_base": float(W_ROBERTA_BASE),
        "llm_base": float(W_LLM_BASE),
    },
}

with open(os.path.join(SAVE_DIR, "ensemble_config.json"), "w") as f:
    json.dump(package_config, f, indent=2)

# Check if _llm_cache is defined and save it
if '_llm_cache' in locals() or '_llm_cache' in globals():
    with open(os.path.join(SAVE_DIR, "llm_cache.json"), "w") as f:
        json.dump(_llm_cache, f)
else:
    print("Warning: _llm_cache not found. Skipping saving LLM cache.")

shutil.make_archive(SAVE_DIR, "zip", SAVE_DIR)
print("Saved:", f"{SAVE_DIR}.zip")
# Check size
p = os.path.join(SAVE_DIR, "model.safetensors")
print("New size (GB):", os.path.getsize(p)/1024**3)
print("Done ✅")

try:
    from google.colab import files
    files.download(f"{SAVE_DIR}.zip")
except Exception as e:
    print("Download helper not available:", e)


Saved: sarcasm_roberta_large_package.zip
New size (GB): 1.3238707706332207
Done ✅


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#model = model.to(device) # Ensure the model is on the correct device for XLA/TPU operations

ctx = "I waited 2 hours for this. || They said it would be quick."
rep = "Wow, so efficient."

pred, dbg = ensemble_predict_one(ctx, rep)
print("Pred:", "sarcasm" if pred==1 else "not_sarcasm")
print("Debug:", dbg)

RuntimeError: Expected XLA tensor. Got: torch.FloatTensor

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
FINAL_THRESHOLD = 0.56

def eval_ensemble_on_test(N=500):  # start with 200–1000 to control cost
    y_true, y_pred = [], []
    used_llm = 0

    for i in range(min(N, len(ds["test"]))):
        ex = ds["test"][i]
        pred, dbg = ensemble_predict_one(ex["context_text"], ex["reply_text"])
        y_true.append(int(ex["label"]))
        y_pred.append(int(pred))
        if dbg.get("used_llm"):
            used_llm += 1

    print("Evaluated:", len(y_true))
    print("LLM used for:", used_llm, "examples (", round(100*used_llm/len(y_true),2), "% )")

    print(confusion_matrix(y_true, y_pred))
    print(classification_report(y_true, y_pred, target_names=["not_sarcasm","sarcasm"]))

# Run
eval_ensemble_on_test(N=500)


RuntimeError: Expected XLA tensor. Got: torch.FloatTensor

In [None]:
import pandas as pd

test_df = pd.DataFrame(ds["test"])
test_df.to_csv("test_set.csv", index=False)

print("Saved test set with", len(test_df), "rows")


Saved test set with 20500 rows


In [None]:
import os
print(os.getcwd())
os.listdir()


/content


['.config',
 'sarcasm_roberta_large_package',
 'test_set.csv',
 'sarcasm_roberta_large_package.zip',
 'sample_data']

In [None]:
!find . -maxdepth 2 -type f | grep .ipynb


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import json, os, torch

SAVE_DIR = "/content/sarcasm_roberta_large_package"  # or wherever you saved it
p = "/content/drive/MyDrive/model.safetensors"

tokenizer = AutoTokenizer.from_pretrained(SAVE_DIR)
model = AutoModelForSequenceClassification.from_pretrained(p).to(device)
model.eval()

with open(os.path.join(SAVE_DIR, "ensemble_config.json"), "r") as f:
    cfg = json.load(f)

FINAL_THRESHOLD = cfg["final_threshold"]
MAX_LENGTH = cfg["max_length"]
KEEP_LAST_TURNS = cfg["keep_last_turns"]

UNCERTAINTY_GATE = cfg["uncertainty_gate"]
LLM_MODEL = cfg["llm_model"]
LLM_CONF_GATE = cfg["llm_conf_gate"]
W_ROBERTA_CONF = cfg["weights"]["roberta_conf"]
W_LLM_CONF = cfg["weights"]["llm_conf"]
W_ROBERTA_BASE = cfg["weights"]["roberta_base"]
W_LLM_BASE = cfg["weights"]["llm_base"]


HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/drive/MyDrive/model.safetensors'. Use `repo_type` argument if needed.

In [None]:
eval_ensemble_on_test(N=200)

In [None]:
import os
p = "/content/drive/MyDrive/model.safetensors"
print("Size (bytes):", os.path.getsize(p))
print("Size (GB):", os.path.getsize(p)/1024**3)


Size (bytes): 1421495416
Size (GB): 1.3238707706332207


Run saved model in one go

In [None]:
import os, re, json, hashlib, time
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from openai import OpenAI

SAVE_DIR = "/content/drive/MyDrive/sarcasm_roberta_large_package"  # <-- use full path
p = "/content/drive/MyDrive"

print("Exists?", os.path.exists(SAVE_DIR))
print("Files:", os.listdir(SAVE_DIR)[:10])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

tokenizer = AutoTokenizer.from_pretrained(SAVE_DIR, local_files_only=True)
model = AutoModelForSequenceClassification.from_pretrained(SAVE_DIR, local_files_only=True).to(device)
model.eval()

print("Loaded local model ✅")


# ====== 3) Load ensemble config ======
with open(os.path.join(SAVE_DIR, "ensemble_config.json"), "r") as f:
    cfg = json.load(f)

MAX_LENGTH       = int(cfg["max_length"])
KEEP_LAST_TURNS  = int(cfg["keep_last_turns"])
FINAL_THRESHOLD  = float(cfg["final_threshold"])

UNCERTAINTY_GATE = float(cfg["uncertainty_gate"])
LLM_MODEL        = cfg["llm_model"]
LLM_CONF_GATE    = float(cfg["llm_conf_gate"])

W_ROBERTA_CONF   = float(cfg["weights"]["roberta_conf"])
W_LLM_CONF       = float(cfg["weights"]["llm_conf"])
W_ROBERTA_BASE   = float(cfg["weights"]["roberta_base"])
W_LLM_BASE       = float(cfg["weights"]["llm_base"])

USE_OPENAI_ENSEMBLE = True

UNCERTAINTY_GATE = 0.08
LLM_MODEL = "gpt-5.2"
LLM_CONF_GATE = 0.88
W_ROBERTA_CONF, W_LLM_CONF = 0.75, 0.25
W_ROBERTA_BASE, W_LLM_BASE = 0.90, 0.10
FINAL_THRESHOLD = 0.56

print("Loaded config:", {
    "FINAL_THRESHOLD": FINAL_THRESHOLD,
    "UNCERTAINTY_GATE": UNCERTAINTY_GATE,
    "LLM_MODEL": LLM_MODEL
})

# ====== 4) Load optional LLM cache ======
cache_path = os.path.join(SAVE_DIR, "llm_cache.json")
_llm_cache = json.load(open(cache_path)) if os.path.exists(cache_path) else {}
print("LLM cache entries:", len(_llm_cache))

# ====== 5) Formatter ======
def format_input(context_text: str, reply_text: str, keep_last_turns: int = KEEP_LAST_TURNS) -> str:
    ctx = (context_text or "").strip()
    rep = (reply_text or "").strip()
    turns = [t.strip() for t in ctx.split("||") if t.strip()]
    turns = turns[-keep_last_turns:]
    if not turns:
        ctx_block = "[NO_CONTEXT]"
    else:
        ctx_block = "\n".join([f"[TURN-{len(turns)-i}] {t}" for i, t in enumerate(turns)])
    return f"{ctx_block}\n[REPLY] {rep}"

# ====== 6) RoBERTa prob ======
def roberta_prob_one(context_text: str, reply_text: str) -> float:
    text = format_input(context_text, reply_text)
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_LENGTH)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits
        p = torch.softmax(logits, dim=-1)[0, 1].item()  # prob(sarcasm)
    return float(p)

# ====== 7) OpenAI judge (with caching + safe fallback) ======
# Make sure OPENAI_API_KEY is set in your environment/secrets
client = OpenAI()

SYSTEM_PROMPT = (
    "You are a careful sarcasm detector for social media conversations.\n"
    "Sarcasm means intended meaning differs from literal meaning.\n"
    "Be conservative. Output ONLY strict JSON:\n"
    '{"label":"SARCASTIC"|"NOT_SARCASTIC","confidence":0-1}.'
)

def _cache_key(ctx: str, rep: str) -> str:
    s = format_input(ctx, rep)
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

def llm_judge(ctx: str, rep: str, model_name: str = LLM_MODEL):
    k = _cache_key(ctx, rep)
    if k in _llm_cache:
        return _llm_cache[k]  # (label_int, conf_float)

    prompt = (
        "Decide whether the [REPLY] is sarcastic given the context.\n\n"
        f"{format_input(ctx, rep)}\n\n"
        "Return JSON only."
    )

    try:
        resp = client.responses.create(
            model=model_name,
            input=[
                {"role": "developer", "content": SYSTEM_PROMPT},
                {"role": "user", "content": prompt},
            ],
            temperature=0.0,
        )
        text = (resp.output_text or "").strip()
        m = re.search(r"\{.*\}", text, flags=re.S)
        if not m:
            out = (0, 0.5)
        else:
            obj = json.loads(m.group(0))
            lab = str(obj.get("label", "")).upper()
            conf = float(obj.get("confidence", 0.5))
            conf = max(0.0, min(1.0, conf))
            out = (1 if lab == "SARCASTIC" else 0, conf)
    except Exception:
        # quota errors etc -> safe fallback
        out = (0, 0.5)

    _llm_cache[k] = out
    # throttle a bit
    time.sleep(0.2)
    return out

# ====== 8) Ensemble predictor ======
def ensemble_predict_one(ctx: str, rep: str):
    p_r = roberta_prob_one(ctx, rep)

    # If RoBERTa is confident enough, don't call OpenAI
    if abs(p_r - 0.5) >= UNCERTAINTY_GATE:
        pred = 1 if p_r >= FINAL_THRESHOLD else 0
        return pred, {"used_llm": False, "p_roberta": p_r, "p_final": p_r}

    llm_label, llm_conf = llm_judge(ctx, rep)
    # Convert LLM label+confidence into prob(sarcasm)
    p_l = llm_conf if llm_label == 1 else (1.0 - llm_conf)

    # Weighting: trust LLM more only when it's confident
    if llm_conf >= LLM_CONF_GATE:
        w_r, w_l = W_ROBERTA_CONF, W_LLM_CONF
    else:
        w_r, w_l = W_ROBERTA_BASE, W_LLM_BASE

    p_final = w_r * p_r + w_l * p_l
    pred = 1 if p_final >= FINAL_THRESHOLD else 0
    return pred, {
        "used_llm": True,
        "p_roberta": p_r,
        "p_llm": p_l,
        "llm_conf": llm_conf,
        "p_final": p_final,
    }

# ====== 9) Try it ======
ctx = "Oh, what is that are you hurt?"
rep = "No /s"
pred, dbg = ensemble_predict_one(ctx, rep)
print("Pred:", "sarcasm" if pred==1 else "not_sarcasm")
print("Debug:", dbg)

# ====== 10) Save updated cache back to disk (optional) ======
with open(os.path.join(SAVE_DIR, "llm_cache.json"), "w") as f:
    json.dump(_llm_cache, f)
print("Cache saved.")


Exists? True
Files: ['model.safetensors', 'vocab.json', 'tokenizer_config.json', 'tokenizer.json', 'special_tokens_map.json', 'merges.txt', 'config.json', 'llm_cache.json', 'ensemble_config.json']
Device: cpu
Loaded local model ✅
Loaded config: {'FINAL_THRESHOLD': 0.56, 'UNCERTAINTY_GATE': 0.08, 'LLM_MODEL': 'gpt-5.2'}
LLM cache entries: 0
Pred: not_sarcasm
Debug: {'used_llm': False, 'p_roberta': 0.043070364743471146, 'p_final': 0.043070364743471146}
Cache saved.


In [None]:
from sklearn.metrics import confusion_matrix, classification_report

def eval_ensemble(ds_test, N=300):
    y_true, y_pred = [], []
    used = 0
    for i in range(min(N, len(ds_test))):
        ex = ds_test[i]
        pred, dbg = ensemble_predict_one(ex["context_text"], ex["reply_text"])
        y_true.append(int(ex["label"]))
        y_pred.append(int(pred))
        used += int(dbg.get("used_llm", False))
    print("N:", len(y_true), "| LLM used:", used, f"({100*used/len(y_true):.1f}%)")
    print(confusion_matrix(y_true, y_pred))
    print(classification_report(y_true, y_pred, target_names=["not_sarcasm","sarcasm"]))

eval_ensemble(ds["test"], N=600)


NameError: name 'ds' is not defined

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

def eval_ensemble_df(test_df, N=500):
    y_true, y_pred = [], []
    used = 0

    for i in range(min(N, len(test_df))):
        row = test_df.iloc[i]
        pred, dbg = ensemble_predict_one(row["context_text"], row["reply_text"])
        y_true.append(int(row["label"]))
        y_pred.append(int(pred))
        if dbg.get("used_llm"):
            used += 1

    print("Evaluated:", len(y_true))
    print("LLM used for:", used, f"({100*used/len(y_true):.1f}%)")
    print(confusion_matrix(y_true, y_pred))
    print(classification_report(y_true, y_pred, target_names=["not_sarcasm","sarcasm"]))


In [None]:
import pandas as pd

test_df = pd.read_csv("/content/drive/MyDrive/test_set.csv")
print(test_df.shape)
test_df.head()

from datasets import load_dataset
import pandas as pd

# Load TweetEval Irony test split
external = load_dataset("cardiffnlp/tweet_eval", "irony")["test"]

# Convert to your standard dataframe format
df_external2 = pd.DataFrame({
    "reply_text": external["text"],
    "context_text": [""] * len(external),   # no parent context in this dataset
    "label": [int(x) for x in external["label"]],
})

print("External2 rows:", len(df_external2))
df_external2.head()


eval_ensemble_df(df_external2, N=700)


(20500, 3)
External2 rows: 784
Evaluated: 700
LLM used for: 73 (10.4%)
[[355  72]
 [111 162]]
              precision    recall  f1-score   support

 not_sarcasm       0.76      0.83      0.80       427
     sarcasm       0.69      0.59      0.64       273

    accuracy                           0.74       700
   macro avg       0.73      0.71      0.72       700
weighted avg       0.73      0.74      0.73       700

