# Qwen2.5-7B-Instruct to predict IPV and utilize DetectIPV Paper Evaluation Pipeline
Key tasks:
* Binary IPV detection (IPV vs NOT_IPV)
* Multi-label subtype detection (Physical, Emotional, Sexual)

**Key Evaluation Components from the Paper**
| Figure/Table | What it Shows |
|---------------|----------------|
| **Fig. 2(a)** | ROC curve for *General Violence Model* |
| **Fig. 2(b–c)** | “Waterfall” plot: sentences sorted by predicted confidence; color = ground-truth subtype |
| **Fig. 3(a)** | AUROC of Type-Specific vs General Models |
| **Fig. 3(b–e)** | More waterfall plots for type-specific models |
| **Fig. 4** | Radial visualization of confidence in 3-D type space |
| **Table 4** | AUROC for models trained on different negative examples |

## 1. Prompt Design

To evaluate how prompt structure influences LLM performance on IPV detection, I used **five prompting strategies** for both binary (IPV vs. NOT_IPV) and multi-label (Physical, Emotional, Sexual) tasks.  
This allows testing how reasoning depth, context, and self-reflection affect accuracy, AUC, and interpretability.

| Prompt Type | Description | Advantage |
|--------------|-------------|------------|
| **Zero-Shot** | Direct question without examples or reasoning. | Tests model’s innate understanding of IPV cues. |
| **Few-Shot** | Includes short examples before prediction. | Leverages in-context learning and improves clarity. |
| **Chain-of-Thought** | Prompts reasoning steps before final label. | Makes predictions more explainable and robust. |
| **Meta-Prompt** | Frames model as an IPV analyst with reflective reasoning. | Encourages careful, context-aware decisions. |
| **Self-Consistency** | Runs multiple internal judgments and votes. | Improves stability and reduces random errors. |

These prompt variants help identify which instruction style yields the **most reliable and interpretable results** for both general and type-specific IPV detection.

### 1.1 Binary IPV Detection Prompts

In [2]:
# (1) Zero-Shot — final (no inline JSON examples)
prompt_binary_zeroshot = """
You are analyzing whether a sentence describes Intimate Partner Violence (IPV).

Definition:
Intimate Partner Violence includes any form of physical, emotional, or sexual harm
inflicted by a partner. This may involve control, manipulation, threats, humiliation,
or physical aggression.

Decide if the following sentence involves IPV or not.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: a single key "label" with value "IPV" or "NOT_IPV".
Print nothing else.
""".strip()


# (2) Few-Shot — final (no inline JSON examples)
prompt_binary_fewshot = """
You are classifying sentences as describing Intimate Partner Violence (IPV) or NOT_IPV.

Definition:
IPV occurs when one partner harms or controls another physically, emotionally, or sexually.

Here are examples (for guidance only):
1. "He pushed me against the wall and shouted that I’m worthless." → IPV
2. "She often insults me and forbids me from seeing my friends." → IPV
3. "We argued but both apologized and moved on peacefully." → NOT_IPV
4. "He surprised me with flowers after work." → NOT_IPV

Now analyze this sentence:
"{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()


# (3) Chain-of-Thought — final (no inline JSON examples)
prompt_binary_cot = """
You are reasoning step-by-step to decide if a sentence describes Intimate Partner Violence (IPV).

Guidelines:
- IPV includes threats, coercion, physical harm, humiliation, or emotional manipulation.
- NOT_IPV describes healthy, neutral, or unrelated situations.

Think internally (do NOT show your reasoning) about:
1. Does the sentence show any behavior that causes harm, fear, or control?
2. Is there a partner/relationship context?
3. Does it express affection or support instead of harm?

After thinking silently, return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.

Sentence: "{text}"
""".strip()


# (4) Meta — final (no inline JSON examples)
prompt_binary_meta = """
You are a social-behavioral analyst evaluating sentences for signs of Intimate Partner Violence (IPV).

Your objective is to be accurate but cautious.
- If the sentence clearly involves harm, coercion, or control → label as IPV.
- If the sentence shows affection, neutrality, or uncertainty → label as NOT_IPV.

Reflect internally before answering; do NOT print your reasoning.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()


# (5) Self-Consistency — final (no inline JSON examples)
prompt_binary_selfconsistency = """
You will internally evaluate the sentence for Intimate Partner Violence (IPV) multiple times
and choose the majority label as your final answer.

Guidelines:
- IPV → signs of physical, emotional, or sexual harm, threats, or coercion.
- NOT_IPV → supportive, neutral, or unrelated content.

Do NOT reveal thoughts or votes.

Sentence: "{text}"

Return ONLY one JSON object, enclosed between <json> and </json>.
Format: {"label": "IPV"} or {"label": "NOT_IPV"}.
Print nothing else.
""".strip()

### 1.2 Multi-label IPV Type Detection Prompts

In [1]:
# (1) Zero-Shot — strict JSON output (no example line)
prompt_multilabel_zeroshot = """
You are classifying a sentence for Intimate Partner Violence (IPV) subtypes.

Valid labels (choose any subset): Physical, Emotional, Sexual, NOT_IPV.
If the sentence does not describe IPV, include only NOT_IPV.
If it shows multiple IPV types, include all that apply.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Do not include any other keys or text.

Sentence: "{text}"
""".strip()


# (2) Few-Shot — guided, strict JSON output (no example line)
prompt_multilabel_fewshot = """
You are classifying a sentence for Intimate Partner Violence (IPV) subtypes.

Valid labels (choose any subset): Physical, Emotional, Sexual, NOT_IPV.

Guidance (for understanding only — do not copy these into your output):
- Physical: hitting, pushing, choking, restraining, use or threat of physical force.
- Emotional: humiliation, manipulation, isolation, threats, control, verbal abuse.
- Sexual: coercion, unwanted sexual acts, pressure, harassment.
- NOT_IPV: ordinary disagreement or neutral statement without violence or coercion.

If the sentence shows multiple IPV types, include all.
If it shows none, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"
""".strip()


# (3) Chain-of-Thought — silent reasoning, strict JSON output (no example line)
prompt_multilabel_cot = """
Decide which IPV subtype(s) apply to the sentence.
Think silently and do not reveal your reasoning.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"
""".strip()


# (4) Meta — expert framing, strict JSON output (no example line)
prompt_multilabel_meta = """
You are an expert on Intimate Partner Violence (IPV) classification.
Reflect internally; do not show your reasoning.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"
""".strip()


# (5) Self-Consistency — internal deliberation, strict JSON output (no example line)
prompt_multilabel_selfconsistency = """
Evaluate the sentence multiple times INTERNALLY and output a stable, consistent final set of labels.

Valid labels: Physical, Emotional, Sexual, NOT_IPV.
If none apply, include only NOT_IPV.

Return EXACTLY one JSON object wrapped in <json> and </json>.
Use a single key "labels" whose value is a list drawn ONLY from:
["Physical", "Emotional", "Sexual", "NOT_IPV"].
Print nothing else.

Sentence: "{text}"
""".strip()

## 2. System & Model Setup

In [10]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import re
import math
import numpy as np
from datetime import datetime
from tqdm import tqdm
from pathlib import Path
import json
import os
import time
from datetime import datetime
from __future__ import annotations
from typing import Dict, List, Optional, Tuple
# Try to import vllm; if unavailable, provide a lightweight compatibility shim
try:
    from vllm import LLM, SamplingParams
except Exception:
    from dataclasses import dataclass
    from types import SimpleNamespace

    @dataclass
    class SamplingParams:
        max_tokens: int = 128
        temperature: float = 0.0
        logprobs: int = 0
        prompt_logprobs: int = 0

    class LLM:
        """
        Minimal compatibility wrapper that mimics vllm.LLM.generate behavior for simple use.
        Uses transformers' AutoTokenizer/AutoModelForCausalLM already imported in this notebook.
        """

        def __init__(self, model: str, dtype: Optional[str] = None, device: Optional[str] = None):
            self.model_name = model
            # determine device
            self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

            # load tokenizer & model (use already-imported transformers objects)
            self.tokenizer = AutoTokenizer.from_pretrained(model)
            # Map string dtype like "bfloat16" to torch.<dtype> if possible
            torch_dtype = None
            if isinstance(dtype, str) and hasattr(torch, dtype):
                torch_dtype = getattr(torch, dtype)

            # Load model; avoid device_map on CPU to keep it simple
            kwargs = {}
            if torch_dtype is not None:
                kwargs["torch_dtype"] = torch_dtype
            self.model = AutoModelForCausalLM.from_pretrained(model, **kwargs)
            if self.device == "cuda":
                try:
                    self.model.to("cuda")
                except Exception:
                    # fallback: keep on CPU
                    pass

        def generate(self, prompts, sampling_params: Optional[SamplingParams] = None):
            """
            Accepts a single prompt string or a list of prompt strings.
            Returns a SimpleNamespace with an 'outputs' list where each item has a 'text' attribute.
            """
            sp = sampling_params or SamplingParams()
            if isinstance(prompts, str):
                prompts = [prompts]

            outputs = []
            for prompt in prompts:
                inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
                out = self.model.generate(
                    **inputs,
                    max_new_tokens=sp.max_tokens,
                    temperature=sp.temperature,
                    do_sample=(sp.temperature > 0),
                )
                gen_tokens = out[0][inputs["input_ids"].shape[-1]:]
                text = self.tokenizer.decode(gen_tokens, skip_special_tokens=True)
                outputs.append(SimpleNamespace(text=text))

            return SimpleNamespace(outputs=outputs)

In [11]:
# #FILENAMES
# model_name = "Qwen/Qwen2.5-7B-Instruct"

# #Load Model & Tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map="auto",
#     torch_dtype=torch.bfloat16
# )

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", dtype="bfloat16")  # adapt to your hardware
params = SamplingParams(
    max_tokens=128,
    temperature=0.0,
    logprobs=5,            # top-k logprobs for generated tokens
    prompt_logprobs=0      # set >0 only if you need prompt token logprobs
)


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## 3. Generate Predictions

In [None]:
#Clone from git
# !git clone https://github.com/zelaneroz/ipvresearch25
# %cd ipvresearch25/1_LLM_Eval

Cloning into 'ipvresearch25'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (43/43), done.[K
remote: Total 48 (delta 23), reused 22 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (48/48), 5.56 MiB | 5.89 MiB/s, done.
Resolving deltas: 100% (23/23), done.
/content/ipvresearch25/1_LLM_Eval


In [12]:
#Load dataset
filename = "../Dataset/617points.csv"
df = pd.read_csv(filename)
print("Loaded dataset with", len(df), "rows")
df.head()

Loaded dataset with 618 rows


Unnamed: 0.1,Unnamed: 0,items,Physical Abuse,Emotional Abuse,Sexual Abuse,Tag,type
0,0,I'm sitting here with a goofy smile and feelin...,False,False,False,False,soft
1,1,It's really boosting my confidence when he say...,False,False,False,False,soft
2,2,I never imagined that someone could make me s...,False,False,False,False,soft
3,3,He motivates me to become the best version of ...,False,False,False,False,soft
4,4,He’s like a best friend that I can also live w...,False,False,False,False,soft


In [13]:
#Prompt dictionaries
binary_prompts = {
    "zeroshot": prompt_binary_zeroshot,
    "fewshot": prompt_binary_fewshot,
    "cot": prompt_binary_cot,
    "meta": prompt_binary_meta,
    "selfconsistency": prompt_binary_selfconsistency
}

multilabel_prompts = {
    "zeroshot": prompt_multilabel_zeroshot,
    "fewshot": prompt_multilabel_fewshot,
    "cot": prompt_multilabel_cot,
    "meta": prompt_multilabel_meta,
    "selfconsistency": prompt_multilabel_selfconsistency
}

In [14]:
prompt = binary_prompts['zeroshot']
outs = llm.generate([prompt], params)

# vLLM returns: outs[0].outputs[0].text and .logprobs (list of per-token dicts)
gen = outs[0].outputs[0]
text = gen.text
token_logprobs = gen.logprobs  # list[dict]: at each step, {token_id_or_str: LogProb} (model/vers. dependent)


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


KeyboardInterrupt: 

In [None]:
from typing import Iterable, Sequence

ID_COLUMN_CANDIDATES: Sequence[str] = ("id", "ID", "sample_id", "entry_id", "idx", "index", "Unnamed: 0", "﻿")
TEXT_COLUMN_CANDIDATES: Sequence[str] = ("items", "text", "sentence", "content", "utterance", "description")


def _find_first_column(df: pd.DataFrame, candidates: Iterable[str]) -> str | None:
    for name in candidates:
        if name in df.columns:
            return name
    return None


def _resolve_text_column(df: pd.DataFrame) -> str:
    column = _find_first_column(df, TEXT_COLUMN_CANDIDATES)
    if column is None:
        raise ValueError("Could not infer the text column. Please set TEXT_COLUMN_CANDIDATES to include the column name.")
    return column


def _resolve_id_column(df: pd.DataFrame) -> str | None:
    return _find_first_column(df, ID_COLUMN_CANDIDATES)


def _safe_prompt_fill(template: str, text: str) -> str:
    safe_text = text.replace('{', '{{').replace('}', '}}')
    return template.format(text=safe_text)


def _parse_multilabel_response(raw_text: str) -> list[str]:
    match = re.search(r"<json>(.*?)</json>", raw_text, flags=re.DOTALL | re.IGNORECASE)
    if not match:
        raise ValueError(f"Could not find <json> block in model output: {raw_text!r}")
    payload = json.loads(match.group(1).strip())
    labels = payload.get("labels", [])
    if isinstance(labels, str):
        labels = [labels]
    return [str(label).strip() for label in labels if str(label).strip()]


def _labels_to_flags(labels: list[str]) -> tuple[bool, bool, bool]:
    normalized = {label.lower() for label in labels}
    return (
        "emotional" in normalized,
        "physical" in normalized,
        "sexual" in normalized,
    )


def _estimate_confidence(gen_output) -> float:
    token_ids = getattr(gen_output, "token_ids", None) or []
    if not token_ids:
        return 0.0
    cumulative = getattr(gen_output, "cumulative_logprob", None)
    if cumulative is not None:
        avg_logprob = cumulative / max(len(token_ids), 1)
        return float(math.exp(avg_logprob))

    token_logprobs = getattr(gen_output, "logprobs", None) or []
    total = 0.0
    count = 0
    for tid, per_token in zip(token_ids, token_logprobs):
        if per_token is None:
            continue
        chosen = None
        if isinstance(per_token, dict):
            chosen = per_token.get(tid)
            if chosen is None:
                for candidate in per_token.values():
                    token_id = getattr(candidate, "token_id", None)
                    if token_id == tid:
                        chosen = candidate
                        break
        elif isinstance(per_token, (list, tuple)):
            for candidate in per_token:
                token_id = getattr(candidate, "token_id", None)
                if token_id == tid:
                    chosen = candidate
                    break
        if chosen is None:
            continue
        logprob = getattr(chosen, "logprob", None)
        if logprob is None:
            continue
        total += float(logprob)
        count += 1
    if count == 0:
        return 0.0
    return float(math.exp(total / count))


In [None]:
multilabel_prompt_key = "zeroshot"  # change to other prompt variants as needed
batch_size = 8

text_column = _resolve_text_column(df)
id_column = _resolve_id_column(df)
output_records: list[dict] = []

for start in range(0, len(df), batch_size):
    batch = df.iloc[start:start + batch_size]
    prompts = []
    for _, row in batch.iterrows():
        text_value = str(row[text_column]) if not pd.isna(row[text_column]) else ""
        prompts.append(_safe_prompt_fill(multilabel_prompts[multilabel_prompt_key], text_value))

    batch_outputs = llm.generate(prompts, params)

    for (row_index, row), request_output in zip(batch.iterrows(), batch_outputs):
        gen_output = request_output.outputs[0]
        raw_completion = gen_output.text.strip()
        try:
            labels = _parse_multilabel_response(raw_completion)
        except ValueError as exc:
            print(f"[warn] row {row_index}: {exc}")
            labels = []

        emotional, physical, sexual = _labels_to_flags(labels)
        confidence = _estimate_confidence(gen_output)

        if id_column is not None:
            try:
                data_id = int(row[id_column])
            except (TypeError, ValueError):
                data_id = int(row_index)
        else:
            data_id = int(row_index)

        output_records.append(
            {
                "id": data_id,
                "emotional": bool(emotional),
                "physical": bool(physical),
                "sexual": bool(sexual),
                "confidence": float(confidence),
            }
        )

output_dir = Path("test_results")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / f"qwen_multilabel_{multilabel_prompt_key}.json"

with output_path.open("w", encoding="utf-8") as json_file:
    json.dump(output_records, json_file, indent=2)

print(f"Saved {len(output_records)} results to {output_path}")
print("Sample:")
print(json.dumps(output_records[:3], indent=2))
