<a href="https://colab.research.google.com/github/zelaneroz/ipvresearch25/blob/main/1_LLM_Eval/qwen2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qwen2.5-7B-Instruct to predict IPV and utilize DetectIPV Paper Evaluation Pipeline
Key tasks:
* Binary IPV detection (IPV vs NOT_IPV)
* Multi-label subtype detection (Physical, Emotional, Sexual)

**Key Evaluation Components from the Paper**
| Figure/Table | What it Shows |
|---------------|----------------|
| **Fig. 2(a)** | ROC curve for *General Violence Model* |
| **Fig. 2(b–c)** | “Waterfall” plot: sentences sorted by predicted confidence; color = ground-truth subtype |
| **Fig. 3(a)** | AUROC of Type-Specific vs General Models |
| **Fig. 3(b–e)** | More waterfall plots for type-specific models |
| **Fig. 4** | Radial visualization of confidence in 3-D type space |
| **Table 4** | AUROC for models trained on different negative examples |

## 1. Prompt Design

To evaluate how prompt structure influences LLM performance on IPV detection, I used **five prompting strategies** for both binary (IPV vs. NOT_IPV) and multi-label (Physical, Emotional, Sexual) tasks.  
This allows testing how reasoning depth, context, and self-reflection affect accuracy, AUC, and interpretability.

| Prompt Type | Description | Advantage |
|--------------|-------------|------------|
| **Zero-Shot** | Direct question without examples or reasoning. | Tests model’s innate understanding of IPV cues. |
| **Few-Shot** | Includes short examples before prediction. | Leverages in-context learning and improves clarity. |
| **Chain-of-Thought** | Prompts reasoning steps before final label. | Makes predictions more explainable and robust. |
| **Meta-Prompt** | Frames model as an IPV analyst with reflective reasoning. | Encourages careful, context-aware decisions. |
| **Self-Consistency** | Runs multiple internal judgments and votes. | Improves stability and reduces random errors. |

These prompt variants help identify which instruction style yields the **most reliable and interpretable results** for both general and type-specific IPV detection.

### 1.1 Binary IPV Detection Prompts

In [42]:
# (1) Zero-Shot — final (no inline JSON examples)
prompt_binary_zeroshot = """
You are analyzing whether a sentence describes Intimate Partner Violence (IPV).

Definition:
Intimate Partner Violence includes any form of physical, emotional, or sexual harm
inflicted by a partner. This may involve control, manipulation, threats, humiliation,
or physical aggression.

Decide if the following sentence involves IPV or not.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: a single key "label" with value "IPV" or "NOT_IPV".
Print nothing else.
""".strip()


# (2) Few-Shot — final (no inline JSON examples)
prompt_binary_fewshot = """
You are classifying sentences as describing Intimate Partner Violence (IPV) or NOT_IPV.

Definition:
IPV occurs when one partner harms or controls another physically, emotionally, or sexually.

Here are examples (for guidance only):
1. "He pushed me against the wall and shouted that I’m worthless." → IPV
2. "She often insults me and forbids me from seeing my friends." → IPV
3. "We argued but both apologized and moved on peacefully." → NOT_IPV
4. "He surprised me with flowers after work." → NOT_IPV

Now analyze this sentence:
"{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()


# (3) Chain-of-Thought — final (no inline JSON examples)
prompt_binary_cot = """
You are reasoning step-by-step to decide if a sentence describes Intimate Partner Violence (IPV).

Guidelines:
- IPV includes threats, coercion, physical harm, humiliation, or emotional manipulation.
- NOT_IPV describes healthy, neutral, or unrelated situations.

Think internally (do NOT show your reasoning) about:
1. Does the sentence show any behavior that causes harm, fear, or control?
2. Is there a partner/relationship context?
3. Does it express affection or support instead of harm?

After thinking silently, return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.

Sentence: "{text}"
""".strip()


# (4) Meta — final (no inline JSON examples)
prompt_binary_meta = """
You are a social-behavioral analyst evaluating sentences for signs of Intimate Partner Violence (IPV).

Your objective is to be accurate but cautious.
- If the sentence clearly involves harm, coercion, or control → label as IPV.
- If the sentence shows affection, neutrality, or uncertainty → label as NOT_IPV.

Reflect internally before answering; do NOT print your reasoning.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()


# (5) Self-Consistency — final (no inline JSON examples)
prompt_binary_selfconsistency = """
You will internally evaluate the sentence for Intimate Partner Violence (IPV) multiple times
and choose the majority label as your final answer.

Guidelines:
- IPV → signs of physical, emotional, or sexual harm, threats, or coercion.
- NOT_IPV → supportive, neutral, or unrelated content.

Do NOT reveal thoughts or votes.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()

### 1.2 Multi-label IPV Type Detection Prompts

In [57]:
# (1) Zero-Shot — strict JSON output
prompt_multilabel_zeroshot = """
You are classifying a sentence for Intimate Partner Violence (IPV) subtypes.

Valid labels (choose any subset): Physical, Emotional, Sexual, NOT_IPV.
If the sentence does not describe IPV, include only NOT_IPV.
If it shows multiple IPV types, include all that apply.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Do not include any other keys or text.

Sentence: "{text}"

<json>{"labels": []}</json>
""".strip()


# (2) Few-Shot — guided, strict JSON output
prompt_multilabel_fewshot = """
You are classifying a sentence for Intimate Partner Violence (IPV) subtypes.

Valid labels (choose any subset): Physical, Emotional, Sexual, NOT_IPV.

Guidance (for understanding only — do not copy these into your output):
- Physical: hitting, pushing, choking, restraining, use or threat of physical force.
- Emotional: humiliation, manipulation, isolation, threats, control, verbal abuse.
- Sexual: coercion, unwanted sexual acts, pressure, harassment.
- NOT_IPV: ordinary disagreement or neutral statement without violence or coercion.

If the sentence shows multiple IPV types, include all.
If it shows none, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"

<json>{"labels": []}</json>
""".strip()


# (3) Chain-of-Thought — silent reasoning, strict JSON output
prompt_multilabel_cot = """
Decide which IPV subtype(s) apply to the sentence.
Think silently and do not reveal your reasoning.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"

<json>{"labels": []}</json>
""".strip()


# (4) Meta — expert framing, strict JSON output
prompt_multilabel_meta = """
You are an expert on Intimate Partner Violence (IPV) classification.
Reflect internally; do not show your reasoning.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"

<json>{"labels": []}</json>
""".strip()


# (5) Self-Consistency — internal deliberation, strict JSON output
prompt_multilabel_selfconsistency = """
Evaluate the sentence multiple times INTERNALLY and output a stable, consistent final set of labels.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"

<json>{"labels": []}</json>
""".strip()

## 2. System & Model Setup

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
from tqdm import tqdm
from pathlib import Path
import json
import os
import time
import re, json

In [4]:
#FILENAMES
model_name = "Qwen/Qwen2.5-7B-Instruct"

#Load Model & Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

## 3. Generate Predictions

In [10]:
#Clone from git
!git clone https://github.com/zelaneroz/ipvresearch25
%cd ipvresearch25/1_LLM_Eval

Cloning into 'ipvresearch25'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 31 (delta 11), reused 22 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 5.52 MiB | 21.41 MiB/s, done.
Resolving deltas: 100% (11/11), done.
/content/ipvresearch25/1_LLM_Eval


In [12]:
#Load dataset
filename = "../Dataset/617points.csv"
df = pd.read_csv(filename)
print("Loaded dataset with", len(df), "rows")
df.head()

Loaded dataset with 618 rows


Unnamed: 0.1,Unnamed: 0,items,Physical Abuse,Emotional Abuse,Sexual Abuse,Tag,type
0,0,I'm sitting here with a goofy smile and feelin...,False,False,False,False,soft
1,1,It's really boosting my confidence when he say...,False,False,False,False,soft
2,2,I never imagined that someone could make me s...,False,False,False,False,soft
3,3,He motivates me to become the best version of ...,False,False,False,False,soft
4,4,He’s like a best friend that I can also live w...,False,False,False,False,soft


In [59]:
#Prompt dictionaries
binary_prompts = {
    "zeroshot": prompt_binary_zeroshot,
    "fewshot": prompt_binary_fewshot,
    "cot": prompt_binary_cot,
    "meta": prompt_binary_meta,
    "selfconsistency": prompt_binary_selfconsistency
}

multilabel_prompts = {
    "zeroshot": prompt_multilabel_zeroshot,
    "fewshot": prompt_multilabel_fewshot,
    "cot": prompt_multilabel_cot,
    "meta": prompt_multilabel_meta,
    "selfconsistency": prompt_multilabel_selfconsistency
}

In [58]:
# ============================================
# Helper: Run model and return decoded text
# ============================================
from pathlib import Path
from tqdm import tqdm
import json
import re
from datetime import datetime


def run_llm(prompt_text):
    """Feed a prompt into the LLM and return only the generated portion (not echoed prompt)."""
    inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    output = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.0,   # deterministic
        do_sample=False
    )
    gen_tokens = output[0][inputs["input_ids"].shape[-1]:]
    text = tokenizer.decode(gen_tokens, skip_special_tokens=True)
    return text.strip()


# ============================================
# Robust label extractor
# ============================================
def extract_label(result_text, task_type="binary"):
    """
    Robustly extract a label or list of labels from the model response.
    Works for both binary (IPV/NOT_IPV) and multilabel (IPV subtype) tasks.
    """
    # collect <json>...</json> blocks
    matches = re.findall(r"<json[^>]*>\s*(.*?)\s*</json>", result_text, re.DOTALL | re.IGNORECASE)
    candidates = []

    for block in matches:
        block = block.strip()

        # Try to parse as JSON
        if block.startswith("{"):
            try:
                parsed = json.loads(block)
                val = parsed.get("label") or parsed.get("labels")
                if val:
                    candidates.append(val)
                    continue
            except json.JSONDecodeError:
                pass

        # Handle plain text cases
        if task_type == "binary":
            if "IPV" in block.upper() and "NOT_IPV" not in block.upper():
                candidates.append("IPV")
            elif "NOT_IPV" in block.upper():
                candidates.append("NOT_IPV")
        else:
            # multilabel subtypes — normalize spaces, case, punctuation
            IPV_TYPES = ["Physical", "Emotional", "Sexual", "NOT_IPV"]
            block_norm = re.sub(r"[^A-Za-z\s_]", "", block).lower()
            block_norm = block_norm.replace("_", " ")

            found = []
            for t in IPV_TYPES:
                if t.lower() in block_norm:
                    found.append(t)
            if found:
                candidates.append(found)

    # ---- Return final label(s) ----
    if task_type == "binary":
        # same logic as before
        if candidates:
            if isinstance(candidates[-1], list):
                return ", ".join(candidates[-1])
            return str(candidates[-1])
        text_upper = result_text.upper()
        if "IPV" in text_upper and "NOT_IPV" not in text_upper:
            return "IPV"
        elif "NOT_IPV" in text_upper:
            return "NOT_IPV"
        return None

    else:  # multilabel
        # Flatten nested lists, deduplicate, and join
        all_labels = []
        for c in candidates:
            if isinstance(c, list):
                all_labels.extend(c)
            else:
                all_labels.append(c)
        unique_labels = sorted(set(all_labels))
        if unique_labels:
            return ", ".join(unique_labels)

        # fallback to raw text search
        raw = result_text.lower()
        IPV_TYPES = ["Physical", "Emotional", "Sexual", "NOT_IPV"]
        for t in IPV_TYPES:
            if t.lower() in raw:
                return t
        return None
# def extract_label(result_text, task_type="binary"):
#     """
#     Robustly extract a label or list of labels from the model response.
#     Works for both binary (IPV/NOT_IPV) and multilabel (IPV type) tasks.
#     """
#     # collect <json>...</json> blocks
#     matches = re.findall(r"<json[^>]*>\s*(.*?)\s*</json>", result_text, re.DOTALL | re.IGNORECASE)
#     candidates = []

#     for block in matches:
#         block = block.strip()

#         # Try to parse as JSON
#         if block.startswith("{"):
#             try:
#                 parsed = json.loads(block)
#                 val = parsed.get("label") or parsed.get("labels")
#                 if val:
#                     candidates.append(val)
#                     continue
#             except json.JSONDecodeError:
#                 pass

#         # Handle plain label text
#         if task_type == "binary":
#             if "IPV" in block.upper() and "NOT_IPV" not in block.upper():
#                 candidates.append("IPV")
#             elif "NOT_IPV" in block.upper():
#                 candidates.append("NOT_IPV")
#         else:
#             # multilabel subtype keywords
#             possible_types = [
#                 "Physical_Assault", "Sexual_Coercion", "Verbal_Abuse",
#                 "Injury", "Negotiation", "Emotional_Abuse"
#             ]
#             for t in possible_types:
#                 if t.replace("_", " ").upper() in block.upper():
#                     candidates.append(t)

#     # Return the last valid candidate or None
#     if candidates:
#         if isinstance(candidates[-1], list):
#             return ", ".join(candidates[-1])
#         return str(candidates[-1])

#     # Fallback: keyword check in full text
#     text_upper = result_text.upper()
#     if task_type == "binary":
#         if "IPV" in text_upper and "NOT_IPV" not in text_upper:
#             return "IPV"
#         elif "NOT_IPV" in text_upper:
#             return "NOT_IPV"

#     return None


# ============================================
# Flexible run function
# ============================================
def run_prompt(df, task_type="binary", prompt_type="zeroshot", n_samples=10):
    """
    Run one selected prompt (binary or multilabel) on a specified number of samples.
    Saves JSONL with clean responses: {id, prompt_type, response}.
    """
    import pandas as pd

    # Select correct prompt template
    if task_type == "binary":
        prompt_template = binary_prompts.get(prompt_type)
    elif task_type == "multilabel":
        prompt_template = multilabel_prompts.get(prompt_type)
    else:
        raise ValueError("task_type must be either 'binary' or 'multilabel'.")

    if not prompt_template:
        raise KeyError(f"Prompt type '{prompt_type}' not found in {task_type}_prompts dictionary.")

    # Subset data
    df_subset = df.head(n_samples) if n_samples else df
    print(f"\nRunning {task_type.upper()} → {prompt_type} on {len(df_subset)} samples")

    # Prepare output directory
    results_dir = Path("../1_LLM_Eval/results")
    results_dir.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    output_file = results_dir / f"{task_type}_{prompt_type}_sample{len(df_subset)}.jsonl"

    # Main loop
    with open(output_file, "w", encoding="utf-8") as f_out:
        for i, row in tqdm(df_subset.iterrows(), total=len(df_subset)):
            text = row["items"] if "items" in df.columns else str(row.iloc[0])
            prompt_text = prompt_template.replace("{text}", text)

            try:
                result_text = run_llm(prompt_text)
            except Exception as e:
                result_text = f"ERROR: {e}"

            label = extract_label(result_text, task_type=task_type)

            record = {
                "id": int(i),
                "prompt_type": prompt_type,
                "response": label or "UNKNOWN"
            }
            f_out.write(json.dumps(record, ensure_ascii=False) + "\n")

    print(f"Cleaned outputs saved to {output_file}")

In [60]:
# ============================================
# Example run — test 10 samples
# ============================================
# df = pd.read_csv("../Dataset/617points.csv")

# Run a 10-sample binary few-shot test
run_prompt(df, task_type="binary", prompt_type="zeroshot", n_samples=30)
run_prompt(df, task_type="multilabel", prompt_type="zeroshot", n_samples=30)


Running BINARY → zeroshot on 30 samples


100%|██████████| 30/30 [00:17<00:00,  1.67it/s]


Cleaned outputs saved to ../1_LLM_Eval/results/binary_zeroshot_sample30.jsonl

Running MULTILABEL → zeroshot on 30 samples


100%|██████████| 30/30 [00:12<00:00,  2.32it/s]

Cleaned outputs saved to ../1_LLM_Eval/results/multilabel_zeroshot_sample30.jsonl





## 4. Evaluation Pipeline

1. **Prompt Comparison Phase**
   * Design multiple prompt variants for each task:
     - **Binary Prompts:** Different phrasings for IPV vs NOT_IPV classification.  
     - **Multi-label Prompts:** Different phrasings for type-specific labeling (Physical, Emotional, Sexual, None).
   * Run all prompts on the same dataset.
   * Compute for each prompt:
     - Accuracy, F1-score, ROC-AUC (per task or per subtype)
     - Average confidence calibration (mean predicted probability for positives/negatives)
   * Visualize:
     - **Prompt Comparison Table or Bar Plot** showing AUC/F1 across prompts
     - Identify the **best-performing prompt** for each task (Binary, Multi-label)

2. **Final Evaluation with Best Prompt**
   * **Binary (General IPV Model):**
     - Use the best binary prompt to compute final **ROC curve**, **AUC**, **Precision-Recall**, **Accuracy**, and **F1-score**.
     - Plot **ROC Curve** (replicating Fig. 2a).
     - Generate **Waterfall Plot** — sentences sorted by prediction confidence, color-coded by true label (Physical, Emotional, Sexual, Negative).

   * **Multi-label (Type-Specific Models):**
     - Use the best multi-label prompt to compute **per-type ROC/AUC** for Physical, Emotional, and Sexual abuse.
     - Compare type-specific vs. general model performance (bar chart similar to Fig. 3a).
     - Produce **Waterfall Plots** for each subtype showing confidence distribution and overlap between true labels and predictions.

3. **(Optional) Extended Visualization**
   * Create a **3D or Radial Plot** showing confidence magnitudes of all three type-specific predictions for interpretability.