# Qwen2.5-7B-Instruct to predict IPV and utilize DetectIPV Paper Evaluation Pipeline
Key tasks:
* Binary IPV detection (IPV vs NOT_IPV)
* Multi-label subtype detection (Physical, Emotional, Sexual)

**Key Evaluation Components from the Paper**
| Figure/Table | What it Shows |
|---------------|----------------|
| **Fig. 2(a)** | ROC curve for *General Violence Model* |
| **Fig. 2(b–c)** | “Waterfall” plot: sentences sorted by predicted confidence; color = ground-truth subtype |
| **Fig. 3(a)** | AUROC of Type-Specific vs General Models |
| **Fig. 3(b–e)** | More waterfall plots for type-specific models |
| **Fig. 4** | Radial visualization of confidence in 3-D type space |
| **Table 4** | AUROC for models trained on different negative examples |

## 1. Prompt Design

## 2. System & Model Setup

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
from tqdm import tqdm
from pathlib import Path

In [None]:
#FILENAMES
model_name = "Qwen/Qwen2.5-7B-Instruct"
filename="617points.csv"

#Load Model & Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",      
    torch_dtype=torch.bfloat16  
)

## 3. Generate Predictions

In [None]:
def load_prompt(name):
    with open(Path("prompts") / f"{name}.txt", "r") as f:
        return f.read()

binary_prompt = load_prompt("binary_v1")
prompt = binary_prompt.format(text=row["items"])

## 4. Evaluation Pipeline

1. **Prompt Comparison Phase**
   * Design multiple prompt variants for each task:
     - **Binary Prompts:** Different phrasings for IPV vs NOT_IPV classification.  
     - **Multi-label Prompts:** Different phrasings for type-specific labeling (Physical, Emotional, Sexual, None).
   * Run all prompts on the same dataset.
   * Compute for each prompt:
     - Accuracy, F1-score, ROC-AUC (per task or per subtype)
     - Average confidence calibration (mean predicted probability for positives/negatives)
   * Visualize:
     - **Prompt Comparison Table or Bar Plot** showing AUC/F1 across prompts
     - Identify the **best-performing prompt** for each task (Binary, Multi-label)

2. **Final Evaluation with Best Prompt**
   * **Binary (General IPV Model):**
     - Use the best binary prompt to compute final **ROC curve**, **AUC**, **Precision-Recall**, **Accuracy**, and **F1-score**.
     - Plot **ROC Curve** (replicating Fig. 2a).
     - Generate **Waterfall Plot** — sentences sorted by prediction confidence, color-coded by true label (Physical, Emotional, Sexual, Negative).

   * **Multi-label (Type-Specific Models):**
     - Use the best multi-label prompt to compute **per-type ROC/AUC** for Physical, Emotional, and Sexual abuse.
     - Compare type-specific vs. general model performance (bar chart similar to Fig. 3a).
     - Produce **Waterfall Plots** for each subtype showing confidence distribution and overlap between true labels and predictions.

3. **(Optional) Extended Visualization**
   * Create a **3D or Radial Plot** showing confidence magnitudes of all three type-specific predictions for interpretability.