# To see wether Overnorm changes hallucination 

1. Setup & Imports
2. Base Model Loading
3. Dataset Preparation
4. Over-Normalization Probe Set (Set A)
5. Hallucination Evaluation Set (Set B)
6. Model Variants
7. Evaluation Pipeline
8. Results & Analysis
9. Conclusions

In [2]:
%pip install transformers datasets accelerate torch sentence-transformers faiss-cpu evaluate

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
%pip install unsloth

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm

In [5]:
import json
from datasets import Dataset

# Load the unified hallucination benchmark from local JSON file
dataset_path = "unified_hallucination_benchmark.jsonl"

with open(dataset_path, "r", encoding="utf-8") as f:
    unified_data = [json.loads(line) for line in f]

print(f"Loaded {len(unified_data)} samples from unified hallucination benchmark")

# Convert to HuggingFace Dataset
raw_dataset = Dataset.from_list(unified_data)

# Preprocess examples to match our pipeline format
def preprocess_example(example):
    """
    Convert unified schema to pipeline format:
    - prompt: the question/instruction
    - context: background information (if any)
    - reference: the correct/expected answer
    - knowledge: domain knowledge for grounding
    - hallucination_label: yes/no/unknown
    """
    return {
        "id": example["id"],
        "dataset_source": example["dataset"],
        "task_style": example["task_style"],
        "prompt": example["prompt"],
        "context": example["context"],
        "knowledge": example["knowledge"],
        "reference": example["reference_answer"],
        "model_response": example["model_response"],
        "hallucination_label": example["hallucination_label"]
    }

processed_dataset = raw_dataset.map(preprocess_example)

# Split by dataset source for analysis
datasets_by_source = {}
for source in ["DefAN", "HalluEval", "HalluDial"]:
    datasets_by_source[source] = processed_dataset.filter(lambda x: x["dataset_source"] == source)
    print(f"{source}: {len(datasets_by_source[source])} samples")

# Peek at samples from each dataset
print("\n=== Sample from each dataset ===")
for source, ds in datasets_by_source.items():
    if len(ds) > 0:
        print(f"\n{source} sample:")
        sample = ds[0]
        print(f"  Prompt: {sample['prompt'][:100]}...")
        print(f"  Reference: {sample['reference'][:100] if sample['reference'] else 'N/A'}...")
        print(f"  Hallucination Label: {sample['hallucination_label']}")

Loaded 100793 samples from unified hallucination benchmark


Map: 100%|██████████| 100793/100793 [00:09<00:00, 11012.43 examples/s]
Filter: 100%|██████████| 100793/100793 [00:01<00:00, 81699.16 examples/s]


DefAN: 70793 samples


Filter: 100%|██████████| 100793/100793 [00:01<00:00, 93246.91 examples/s]


HalluEval: 10000 samples


Filter: 100%|██████████| 100793/100793 [00:01<00:00, 91502.58 examples/s]

HalluDial: 20000 samples

=== Sample from each dataset ===

DefAN sample:
  Prompt: Which team won the 1930 FIFA World Cup?(Give me the name only)...
  Reference: Uruguay...
  Hallucination Label: unknown

HalluEval sample:
  Prompt: Which magazine was started first Arthur's Magazine or First for Women?...
  Reference: Arthur's Magazine...
  Hallucination Label: unknown

HalluDial sample:
  Prompt: Hi, how do you feel about Prince?...
  Reference: N/A...
  Hallucination Label: yes





# Here we run the benchmark without solving for OverNormalization

## Unified Hallucination Benchmark Dataset

This is a comprehensive benchmark dataset combining three major hallucination evaluation datasets into a unified format for testing LLM hallucination tendencies.

### Source Datasets

#### 1. **DefAN** (~424,760 samples)
- **Type**: Question-Answer pairs
- **Domain**: Multi-domain (Sports, Math, General Knowledge)
- **Structure**: Simple factual questions with single correct answers
- **Examples**: "Which team won the 1930 FIFA World Cup?" → "Uruguay"
- **Hallucination Label**: Unknown (needs model evaluation)
- **Task Style**: `qa` (question-answering)

#### 2. **HalluEval** (~10,000 samples)
- **Type**: Knowledge-grounded QA
- **Domain**: General knowledge with explicit knowledge context
- **Structure**: Questions paired with background knowledge and both correct/hallucinated answers
- **Purpose**: Tests if models can distinguish factually correct from hallucinated responses
- **Hallucination Label**: Unknown (pre-labeled correct/incorrect answers available)
- **Task Style**: `knowledge_qa`

#### 3. **HalluDial** (from Hugging Face `FlagEval/HalluDial`)
- **Type**: Conversational dialogues
- **Domain**: Multi-turn conversations with knowledge grounding
- **Structure**: Dialogue history, knowledge base, model response, and hallucination judgment
- **Purpose**: Evaluates hallucinations in conversational contexts
- **Hallucination Label**: Pre-labeled (Yes/No)
- **Task Style**: `dialogue`

### Unified Schema

Each sample in the combined dataset has this structure:



In [6]:
{
  "id": "dataset_index",
  "dataset": "DefAN|HalluEval|HalluDial",
  "task_style": "qa|knowledge_qa|dialogue",
  "prompt": "The question or last user turn",
  "context": "Dialogue history (for HalluDial) or empty",
  "knowledge": "Background knowledge or reference answer",
  "reference_answer": "Correct/expected answer",
  "model_response": "Pre-generated response (HalluDial) or empty",
  "hallucination_label": "yes|no|unknown"
}

{'id': 'dataset_index',
 'dataset': 'DefAN|HalluEval|HalluDial',
 'task_style': 'qa|knowledge_qa|dialogue',
 'prompt': 'The question or last user turn',
 'context': 'Dialogue history (for HalluDial) or empty',
 'knowledge': 'Background knowledge or reference answer',
 'reference_answer': 'Correct/expected answer',
 'model_response': 'Pre-generated response (HalluDial) or empty',
 'hallucination_label': 'yes|no|unknown'}



### Use Cases

1. **Evaluation**: Test LLMs across different hallucination scenarios (factual QA, knowledge-grounded, conversational)
2. **Training**: Fine-tune models to reduce hallucinations using labeled examples
3. **Analysis**: Compare hallucination patterns across task types and domains
4. **Benchmarking**: Standardized format allows consistent evaluation across different model architectures

### Output Formats

- **JSONL** (`unified_hallucination_benchmark.jsonl`): One JSON object per line for streaming
- **JSON Array** (`unified_hallucination_benchmark.json`): Single array for easy loading and inspection

In [12]:
%pip install --upgrade bitsandbytes

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          
    bnb_4bit_use_double_quant=True,             
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B",
    quantization_config=bnb_config,
    device_map={"": "cuda"}
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

pipe("Who are you?", max_new_tokens=64)

AssertionError: Torch not compiled with CUDA enabled

In [18]:
%pip uninstall -y torch torchvision torchaudio


Found existing installation: torch 2.9.1
Uninstalling torch-2.9.1:
  Successfully uninstalled torch-2.9.1
Found existing installation: torchvision 0.24.1
Uninstalling torchvision-0.24.1:
  Successfully uninstalled torchvision-0.24.1
Note: you may need to restart the kernel to use updated packages.


You can safely remove it manually.
You can safely remove it manually.


In [19]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121


Looking in indexes: https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement torch (from versions: none)

[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for torch


In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)


2.9.1+cpu
False
None


In [8]:
processed_dataset

Dataset({
    features: ['id', 'dataset', 'task_style', 'prompt', 'context', 'knowledge', 'reference_answer', 'model_response', 'hallucination_label', 'dataset_source', 'reference'],
    num_rows: 100793
})

In [10]:
import random
import json
import torch
from tqdm import tqdm

# -----------------------------
# CONFIG (tuned for RTX 5050 laptop)
# -----------------------------
SAMPLE_SIZE = 1500
BATCH_SIZE = 8              # 6–8 is ideal for Qwen 2.5 on laptop GPUs
MAX_NEW_TOKENS = 256
OUTPUT_JSON = "qwen2.5_hallucination_sample_1500.json"

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# -----------------------------
# 1. Randomly sample datapoints
# -----------------------------
total_size = len(processed_dataset)
sample_indices = random.sample(range(total_size), SAMPLE_SIZE)
sampled_data = [processed_dataset[i] for i in sample_indices]

print(f"Sampled {len(sampled_data)} datapoints")

# -----------------------------
# 2. Prompt builder (unchanged)
# -----------------------------
def build_prompt(sample):
    task = sample["task_style"]

    if task == "qa":
        return sample["prompt"].strip()

    elif task == "knowledge_qa":
        return (
            "Use the following knowledge to answer the question.\n\n"
            f"Knowledge:\n{sample['knowledge']}\n\n"
            f"Question:\n{sample['prompt']}"
        )

    elif task == "dialogue":
        return (
            f"{sample['context']}\n\n"
            "Answer the last user query using only the provided knowledge.\n\n"
            f"Knowledge:\n{sample['knowledge']}"
        )

    else:
        raise ValueError(f"Unknown task_style: {task}")

# -----------------------------
# 3. Batched generation (optimized)
# -----------------------------
@torch.inference_mode()
def generate_batch(samples):
    prompts = [build_prompt(s) for s in samples]

    outputs = pipe(
        prompts,
        max_new_tokens=MAX_NEW_TOKENS,
        do_sample=False,          # deterministic
        temperature=0.0,
        top_p=1.0,
        repetition_penalty=1.0,
        return_full_text=False,
        batch_size=len(prompts),
        pad_token_id=pipe.tokenizer.eos_token_id,
        eos_token_id=pipe.tokenizer.eos_token_id,
    )

    return [out[0]["generated_text"].strip() for out in outputs]

# -----------------------------
# 4. Run generation (GPU-safe loop)
# -----------------------------
generated_rows = []

for i in tqdm(range(0, len(sampled_data), BATCH_SIZE), desc="Generating"):
    batch = sampled_data[i:i + BATCH_SIZE]

    try:
        responses = generate_batch(batch)
    except RuntimeError as e:
        print(f"\n⚠️ GPU error at batch {i}: {e}")
        torch.cuda.empty_cache()
        continue

    for sample, response in zip(batch, responses):
        row = dict(sample)
        row["model_response"] = response
        generated_rows.append(row)

print(f"\n✓ Generated responses for {len(generated_rows)} samples")

# -----------------------------
# 5. Save to JSON (stream-safe)
# -----------------------------
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
    json.dump(generated_rows, f, ensure_ascii=False, indent=2)

print(f"✓ Saved results to {OUTPUT_JSON}")

Sampled 1500 datapoints


Generating:   0%|          | 0/188 [00:00<?, ?it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Generating:   0%|          | 0/188 [02:17<?, ?it/s]


KeyboardInterrupt: 

# After performing DPO

In [28]:
import os
os.makedirs("./Overnorm", exist_ok=True)

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

base_model_id = "Qwen/Qwen2.5-1.5B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,  # ✅ FIXED
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

: 

In [None]:
from peft import PeftModel

model = PeftModel.from_pretrained(
    model,
    "DPO-qwen",   # folder with adapter_model.safetensors
)