# Hallucination Detection

1. Extractor Keywords
  2. Remove Stop Words:
         1. Chinese (Zh): jieba
         2. Arabic (Ar): Hugging Face (asafaya/bert-base-arabic)
         3. Hindi (Hi): indic-nlp-library
         4. Other Languages: spaCy model
  3. Recognize NER Entities:
         1. Hugging Face Models:
              1. Arabic (Ar): asafaya/bert-base-arabic
              2. Other Languages: FacebookAI/xlm-roberta-large-finetuned-conll03-english
         2. For Unrecognized Content, Perform Tokenization (Extract Key Nouns if Possible):
              1. Chinese (Zh): jieba (tfidf-keywords)
              2. Hindi (Hi): indic_tokenize
              3. Arabic (Ar): Hugging Face (asafaya/bert-base-arabic)
              4. Other Languages: spaCy tokenize
4. Acquire External Knowledge:
   1. Use Baidu Translate API to translate all extracted key phrases into English as a fallback mechanism for retrieval.
   2. Retrieval Rollback Mechanism:
      1. First, use the key phrases in the target language to search via the Wikipedia API.
         1. If the search fails, use the translated English phrases for retrieval.
            Note: During retrieval, there might be errors due to Traditional Chinese redirects. These need to be cleared, and results in Traditional Chinese should be forcefully converted.
   3. Extract the first 200 characters from the search results.
5. Use (model_input, model_output_text, context) to detect hallucination words and their probabilities via GPT-3.5.
6. Merge overlapping words and compute their probabilities using Exponentiation.
7. Create Soft Labels: Identify hallucination word positions in the model_output_text and combine them with their computed probabilities.

In [None]:
from openai import OpenAI
import requests
import httpx
import json
import pandas as pd
from tqdm import tqdm
import ast
from scorer import recompute_hard_labels, load_jsonl_file_to_records, score_iou, score_cor, main
import numpy as np
from langdetect import detect, LangDetectException
import re
import os

In [None]:
# set OpenAI API and proxies
api_key = "sk-pr"

proxies = {
    "http": "http://127.0.0.1:10809",
    "https": "http://127.0.0.1:10809"
}

In [None]:
prompt_template = """
You are an AI model output evaluation expert, responsible for detecting hallucinated words in model output and assigning accurate probability scores to each hallucination.

Below is the input information:
- **Language**: {language} (e.g., en(English), ar(Arabic), es(Spanish), etc.)
- **Question**: {question}
- **Model Output**: {output}
- **Background Knowledge** (if available): {context}

### **Task**:
Your task is to:
1. **Identify hallucinated words or phrases** in the model output based on the question and background knowledge.
   - A word or phrase is considered a hallucination if it:
     - Contradicts the background knowledge.
     - Is unverifiable or fabricated.
     - Contains logical inconsistencies.
2. **Assign a probability score** to each hallucinated word or phrase according to the following criteria:
   - **Probability > 0.7**: Severe factual errors or contradictions.
   - **Probability 0.5 - 0.7**: Unverifiable or speculative content.
   - **Probability 0.3 - 0.5**: Minor inconsistencies or unverifiable details.
   - **Probability 0.1 - 0.3**: Minor inaccuracies or vague ambiguities.
   - **Do not label words with probability ≤ 0.1** (i.e., verifiable facts).

### **Additional Instructions**:
- Do **not** mark redundant or overly generic words (e.g., "the", "a", "and") as hallucinations unless they introduce factual errors.
- Pay special attention to:
  - **Numerical data** (e.g., dates, quantities, percentages).
  - **Named entities** (e.g., people, organizations, locations).
  - **Logical contradictions** (e.g., self-contradictions within the text).
- If background knowledge is absent, base your judgment solely on internal consistency.

### **Example**:
#### Input:
- **Question**: "What year did Einstein win the Nobel Prize?"
- **Model Output**: "Einstein won the Nobel Prize in Physics in 1922 for his discovery of the photoelectric effect."
- **Background Knowledge**: "Einstein won the Nobel Prize in Physics in 1921."

#### Output:
[
    {{"word": "1922", "prob": 0.9}}
]

### **Output Format**:
Return the result as a JSON array:
[
    {{"word": <example_word>, "prob": <probability>}},
    {{"word": <another_word>, "prob": <probability>}}
]

### Important:
- Provide precise word-level annotations.
- Do not include any text or explanations outside the JSON array.
"""


In [None]:
def evaluate_with_selfcheck(question, output, context="", language="en", n=5, retries=3):

    if context is None:
        context = ""

    language = language.lower()

    prompt = prompt_template.format(question=question, output=output, context=context, language=language)

    for attempt in range(retries):
        try:
            response = requests.post(
                "https://api.openai.com/v1/chat/completions",
                headers={
                    "Authorization": f"Bearer {api_key}",
                    "Content-Type": "application/json"
                },
                json={
                    "model": "gpt-3.5-turbo",
                    "messages": [{"role": "user", "content": prompt}],
                    "n": n
                },
                proxies=proxies
            )

            if response.status_code == 200:
                content = response.json()["choices"][0]["message"]["content"]
                try:
                    return json.loads(content)
                except json.JSONDecodeError as e:
                    print(f"Failed to parse JSON content: {content}. Error: {e}")
                    return []
            else:
                print(f"Request failed with status code: {response.status_code}")
                print(f"Response: {response.text}")
                return []

        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            return []

    print("Retry limit exceeded, returning empty result")
    return []

In [None]:
# Locate word positions in the original text
def locate_word_positions(words_with_probs, model_output_text):
    ranges = []
    for item in words_with_probs:
        word = item["word"]
        prob = item["prob"]
        start_idx = model_output_text.find(word)
        while start_idx != -1:
            end_idx = start_idx + len(word)
            ranges.append((start_idx, end_idx, prob))
            start_idx = model_output_text.find(word, end_idx)
    return ranges

# Merge overlapping ranges
def merge_ranges(ranges):
    if not ranges:
        return []
    # Sort ranges by start position
    ranges.sort(key=lambda x: x[0])
    merged = [ranges[0]]
    for current in ranges[1:]:
        last = merged[-1]
        if current[0] <= last[1]:  # Overlapping
            new_end = max(last[1], current[1])
            new_prob = (last[2] + current[2]) / 2  # Average probabilities
            merged[-1] = (last[0], new_end, new_prob)
        else:
            merged.append(current)
    return merged

# Compute average probabilities with enhanced overlap weighting
def compute_average_probability_v3(merged_ranges, all_ranges):
    avg_probs = []
    for m_start, m_end, _ in merged_ranges:
        total_prob = 0
        total_overlap_weight = 0

        for r_start, r_end, prob in all_ranges:
            # Calculate overlap length
            overlap_start = max(m_start, r_start)
            overlap_end = min(m_end, r_end)
            overlap_length = max(0, overlap_end - overlap_start)

            # Add weighted contribution (consider overlap frequency)
            if overlap_length > 0:
                weight = overlap_length  # Base weight is overlap length
                total_prob += prob * weight
                total_overlap_weight += weight

        # Adjust probability by total weight (with enhancement factor)
        if total_overlap_weight > 0:
            final_prob = (total_prob / total_overlap_weight) ** 1.2  # Enhancing frequent overlaps
        else:
            final_prob = 0  # No overlap, probability is zero

        avg_probs.append(final_prob)
    return avg_probs

# Main function to process hallucination detection
def process_hallucination_detection(question, model_output_text, context, language):
    # Call GPT model to get hallucinated words and probabilities
    hallucinations = evaluate_with_selfcheck(question, model_output_text, context, language)
    # print("Hallucinations detected:", hallucinations)

    # Filter out hallucinations with probability <= 0.1
    hallucinations = [item for item in hallucinations if item["prob"] > 0.1]

    # Locate hallucination positions in the model output text
    hallucination_ranges = locate_word_positions(hallucinations, model_output_text)
    # print("Hallucination Ranges:", hallucination_ranges)

    # Merge overlapping ranges
    merged_ranges = merge_ranges(hallucination_ranges)
    # print("Merged Ranges:", merged_ranges)

    # Compute final probabilities for merged ranges
    final_probabilities = compute_average_probability_v3(merged_ranges, hallucination_ranges)

    # Prepare final output
    result = []
    for i, (start, end, _) in enumerate(merged_ranges):
        result.append({
            "start": start,
            "end": end,
            "prob": final_probabilities[i]
        })
    return result

In [None]:
def process_dataset(input_folder, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    input_files = glob.glob(os.path.join(input_folder, "*.jsonl"))

    with tqdm(total=len(input_files), desc="Processing Files", unit="file") as file_progress:
        for file_path in input_files:
            with open(file_path, 'r', encoding='utf-8') as f:
                data = [json.loads(line) for line in f]

            output_data = []

            with tqdm(total=len(data), desc=f"Processing {os.path.basename(file_path)}", unit="entry", leave=False) as entry_progress:
                for entry in data:
                    try:
                        question = entry.get("model_input", "")
                        model_output_text = entry.get("model_output_text", "")
                        context = entry.get("context_googlecse", "")
                        language = entry.get("lang", "").lower()

                        soft_labels = process_hallucination_detection(
                            question, model_output_text, context, language
                        )
                        hard_labels = recompute_hard_labels(soft_labels)

                        output_entry = {
                            "id": entry.get("id"),
                            "lang": entry.get("lang"),
                            "model_input": entry.get("model_input"),
                            "model_output_text": entry.get("model_output_text"),
                            "model_id": entry.get("model_id"),
                            "soft_labels": soft_labels,
                            "hard_labels": hard_labels,
                            "model_output_logits": entry.get("model_output_logits"),
                            "model_output_tokens": entry.get("model_output_tokens")
                        }

                        output_data.append(output_entry)

                    except Exception as e:
                        print(f"Error processing entry {entry.get('id')}: {e}")

                    entry_progress.update(1)

            output_file = os.path.join(output_folder, os.path.basename(file_path))
            with open(output_file, 'w', encoding='utf-8') as f:
                for item in output_data:
                    f.write(json.dumps(item, ensure_ascii=False) + '\n')

            file_progress.update(1)
            print(f"Processed and saved: {output_file}")

In [None]:
input_folder = "data/val/val_new/"
output_folder = "data/val/detect_gpt2/"

process_dataset(input_folder, output_folder)

Processing Files:   0%|          | 0/10 [00:00<?, ?file/s]
Processing mushroom.ar-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.ar-val.v2.jsonl:   2%|▏         | 1/50 [00:03<02:40,  3.28s/entry][A
Processing mushroom.ar-val.v2.jsonl:   4%|▍         | 2/50 [00:07<03:16,  4.10s/entry][A
Processing mushroom.ar-val.v2.jsonl:   6%|▌         | 3/50 [00:12<03:11,  4.08s/entry][A
Processing mushroom.ar-val.v2.jsonl:   8%|▊         | 4/50 [00:15<03:00,  3.93s/entry][A
Processing mushroom.ar-val.v2.jsonl:  10%|█         | 5/50 [00:18<02:45,  3.68s/entry][A
Processing mushroom.ar-val.v2.jsonl:  12%|█▏        | 6/50 [00:22<02:42,  3.69s/entry][A
Processing mushroom.ar-val.v2.jsonl:  14%|█▍        | 7/50 [00:25<02:31,  3.51s/entry][A
Processing mushroom.ar-val.v2.jsonl:  16%|█▌        | 8/50 [00:28<02:16,  3.26s/entry][A
Processing mushroom.ar-val.v2.jsonl:  18%|█▊        | 9/50 [00:34<02:44,  4.02s/entry][A
Processing mushroom.ar-val.v2.jsonl:  20%|██     

Processed and saved: data/val/detect_gpt2/mushroom.ar-val.v2.jsonl



Processing mushroom.de-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.de-val.v2.jsonl:   2%|▏         | 1/50 [00:02<01:39,  2.03s/entry][A
Processing mushroom.de-val.v2.jsonl:   4%|▍         | 2/50 [00:04<01:55,  2.40s/entry][A
Processing mushroom.de-val.v2.jsonl:   6%|▌         | 3/50 [00:09<02:38,  3.36s/entry][A

Failed to parse JSON content: []
Word: 200
Probability: 0.9

Word: 300
Probability: 0.9. Error: Extra data: line 2 column 1 (char 3)



Processing mushroom.de-val.v2.jsonl:   8%|▊         | 4/50 [00:11<02:12,  2.89s/entry][A
Processing mushroom.de-val.v2.jsonl:  10%|█         | 5/50 [00:15<02:23,  3.18s/entry][A
Processing mushroom.de-val.v2.jsonl:  12%|█▏        | 6/50 [00:19<02:31,  3.44s/entry][A
Processing mushroom.de-val.v2.jsonl:  14%|█▍        | 7/50 [00:22<02:25,  3.37s/entry][A
Processing mushroom.de-val.v2.jsonl:  16%|█▌        | 8/50 [00:25<02:20,  3.35s/entry][A
Processing mushroom.de-val.v2.jsonl:  18%|█▊        | 9/50 [00:27<02:01,  2.96s/entry][A
Processing mushroom.de-val.v2.jsonl:  20%|██        | 10/50 [00:31<02:06,  3.17s/entry][A
Processing mushroom.de-val.v2.jsonl:  22%|██▏       | 11/50 [00:37<02:37,  4.03s/entry][A
Processing mushroom.de-val.v2.jsonl:  24%|██▍       | 12/50 [00:41<02:37,  4.14s/entry][A
Processing mushroom.de-val.v2.jsonl:  26%|██▌       | 13/50 [00:44<02:19,  3.77s/entry][A
Processing mushroom.de-val.v2.jsonl:  28%|██▊       | 14/50 [00:46<01:54,  3.18s/entry][A
Proc

Processed and saved: data/val/detect_gpt2/mushroom.de-val.v2.jsonl



Processing mushroom.en-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.en-val.v2.jsonl:   2%|▏         | 1/50 [00:02<02:19,  2.84s/entry][A
Processing mushroom.en-val.v2.jsonl:   4%|▍         | 2/50 [00:05<02:16,  2.85s/entry][A
Processing mushroom.en-val.v2.jsonl:   6%|▌         | 3/50 [00:07<01:54,  2.44s/entry][A
Processing mushroom.en-val.v2.jsonl:   8%|▊         | 4/50 [00:09<01:41,  2.21s/entry][A
Processing mushroom.en-val.v2.jsonl:  10%|█         | 5/50 [00:16<02:55,  3.91s/entry][A
Processing mushroom.en-val.v2.jsonl:  12%|█▏        | 6/50 [00:19<02:38,  3.61s/entry][A
Processing mushroom.en-val.v2.jsonl:  14%|█▍        | 7/50 [00:21<02:13,  3.11s/entry][A
Processing mushroom.en-val.v2.jsonl:  16%|█▌        | 8/50 [00:23<01:53,  2.71s/entry][A
Processing mushroom.en-val.v2.jsonl:  18%|█▊        | 9/50 [00:25<01:42,  2.49s/entry][A
Processing mushroom.en-val.v2.jsonl:  20%|██        | 10/50 [00:28<01:51,  2.78s/entry][A
Processing mushr

Processed and saved: data/val/detect_gpt2/mushroom.en-val.v2.jsonl



Processing mushroom.es-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.es-val.v2.jsonl:   2%|▏         | 1/50 [00:02<02:14,  2.75s/entry][A
Processing mushroom.es-val.v2.jsonl:   4%|▍         | 2/50 [00:04<01:46,  2.21s/entry][A
Processing mushroom.es-val.v2.jsonl:   6%|▌         | 3/50 [00:07<02:05,  2.66s/entry][A
Processing mushroom.es-val.v2.jsonl:   8%|▊         | 4/50 [00:13<02:55,  3.81s/entry][A
Processing mushroom.es-val.v2.jsonl:  10%|█         | 5/50 [00:15<02:28,  3.29s/entry][A
Processing mushroom.es-val.v2.jsonl:  12%|█▏        | 6/50 [00:19<02:39,  3.61s/entry][A
Processing mushroom.es-val.v2.jsonl:  14%|█▍        | 7/50 [00:22<02:22,  3.30s/entry][A
Processing mushroom.es-val.v2.jsonl:  16%|█▌        | 8/50 [00:25<02:13,  3.18s/entry][A
Processing mushroom.es-val.v2.jsonl:  18%|█▊        | 9/50 [00:28<02:05,  3.07s/entry][A
Processing mushroom.es-val.v2.jsonl:  20%|██        | 10/50 [00:31<02:06,  3.17s/entry][A
Processing mushr

Request failed: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1149)')))



Processing mushroom.es-val.v2.jsonl:  32%|███▏      | 16/50 [01:21<05:18,  9.35s/entry][A
Processing mushroom.es-val.v2.jsonl:  34%|███▍      | 17/50 [01:23<04:00,  7.28s/entry][A
Processing mushroom.es-val.v2.jsonl:  36%|███▌      | 18/50 [01:26<03:07,  5.87s/entry][A
Processing mushroom.es-val.v2.jsonl:  38%|███▊      | 19/50 [01:29<02:30,  4.86s/entry][A
Processing mushroom.es-val.v2.jsonl:  40%|████      | 20/50 [01:34<02:29,  4.99s/entry][A
Processing mushroom.es-val.v2.jsonl:  42%|████▏     | 21/50 [01:37<02:12,  4.56s/entry][A
Processing mushroom.es-val.v2.jsonl:  44%|████▍     | 22/50 [02:09<05:58, 12.82s/entry][A

Request failed: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1149)')))



Processing mushroom.es-val.v2.jsonl:  46%|████▌     | 23/50 [02:14<04:43, 10.48s/entry][A
Processing mushroom.es-val.v2.jsonl:  48%|████▊     | 24/50 [02:19<03:43,  8.59s/entry][A
Processing mushroom.es-val.v2.jsonl:  50%|█████     | 25/50 [02:22<02:54,  6.99s/entry][A
Processing mushroom.es-val.v2.jsonl:  52%|█████▏    | 26/50 [02:30<02:52,  7.20s/entry][A
Processing mushroom.es-val.v2.jsonl:  54%|█████▍    | 27/50 [02:35<02:33,  6.66s/entry][A
Processing mushroom.es-val.v2.jsonl:  56%|█████▌    | 28/50 [02:40<02:14,  6.11s/entry][A
Processing mushroom.es-val.v2.jsonl:  58%|█████▊    | 29/50 [02:43<01:51,  5.29s/entry][A
Processing mushroom.es-val.v2.jsonl:  60%|██████    | 30/50 [02:46<01:27,  4.39s/entry][A
Processing mushroom.es-val.v2.jsonl:  62%|██████▏   | 31/50 [02:53<01:43,  5.47s/entry][A
Processing mushroom.es-val.v2.jsonl:  64%|██████▍   | 32/50 [03:03<02:00,  6.68s/entry][A
Processing mushroom.es-val.v2.jsonl:  66%|██████▌   | 33/50 [03:34<03:59, 14.07s/entry][

Request failed: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1149)')))



Processing mushroom.es-val.v2.jsonl:  68%|██████▊   | 34/50 [03:38<02:53, 10.82s/entry][A
Processing mushroom.es-val.v2.jsonl:  70%|███████   | 35/50 [03:42<02:15,  9.01s/entry][A
Processing mushroom.es-val.v2.jsonl:  72%|███████▏  | 36/50 [03:49<01:56,  8.33s/entry][A
Processing mushroom.es-val.v2.jsonl:  74%|███████▍  | 37/50 [03:53<01:29,  6.89s/entry][A
Processing mushroom.es-val.v2.jsonl:  76%|███████▌  | 38/50 [03:57<01:14,  6.24s/entry][A
Processing mushroom.es-val.v2.jsonl:  78%|███████▊  | 39/50 [04:01<01:00,  5.49s/entry][A
Processing mushroom.es-val.v2.jsonl:  80%|████████  | 40/50 [04:06<00:51,  5.18s/entry][A
Processing mushroom.es-val.v2.jsonl:  82%|████████▏ | 41/50 [04:11<00:47,  5.31s/entry][A
Processing mushroom.es-val.v2.jsonl:  84%|████████▍ | 42/50 [04:18<00:46,  5.76s/entry][A
Processing mushroom.es-val.v2.jsonl:  86%|████████▌ | 43/50 [04:22<00:36,  5.20s/entry][A
Processing mushroom.es-val.v2.jsonl:  88%|████████▊ | 44/50 [04:27<00:30,  5.11s/entry][

Request failed: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1149)')))



Processing mushroom.es-val.v2.jsonl:  94%|█████████▍| 47/50 [05:05<00:29,  9.93s/entry][A
Processing mushroom.es-val.v2.jsonl:  96%|█████████▌| 48/50 [05:10<00:16,  8.31s/entry][A
Processing mushroom.es-val.v2.jsonl:  98%|█████████▊| 49/50 [05:15<00:07,  7.48s/entry][A
Processing mushroom.es-val.v2.jsonl: 100%|██████████| 50/50 [05:19<00:00,  6.26s/entry][A
Processing Files:  40%|████      | 4/10 [13:39<22:31, 225.33s/file]                    [A

Processed and saved: data/val/detect_gpt2/mushroom.es-val.v2.jsonl



Processing mushroom.fi-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.fi-val.v2.jsonl:   2%|▏         | 1/50 [00:03<02:59,  3.66s/entry][A
Processing mushroom.fi-val.v2.jsonl:   4%|▍         | 2/50 [00:07<02:49,  3.52s/entry][A
Processing mushroom.fi-val.v2.jsonl:   6%|▌         | 3/50 [00:09<02:28,  3.16s/entry][A
Processing mushroom.fi-val.v2.jsonl:   8%|▊         | 4/50 [00:12<02:19,  3.03s/entry][A
Processing mushroom.fi-val.v2.jsonl:  10%|█         | 5/50 [00:17<02:40,  3.57s/entry][A
Processing mushroom.fi-val.v2.jsonl:  12%|█▏        | 6/50 [00:21<02:45,  3.77s/entry][A
Processing mushroom.fi-val.v2.jsonl:  14%|█▍        | 7/50 [00:52<09:08, 12.76s/entry][A

Request failed: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by SSLError(SSLZeroReturnError(6, 'TLS/SSL connection has been closed (EOF) (_ssl.c:1149)')))



Processing mushroom.fi-val.v2.jsonl:  16%|█▌        | 8/50 [00:56<07:02, 10.06s/entry][A
Processing mushroom.fi-val.v2.jsonl:  18%|█▊        | 9/50 [01:02<05:51,  8.58s/entry][A
Processing mushroom.fi-val.v2.jsonl:  20%|██        | 10/50 [01:05<04:38,  6.96s/entry][A
Processing mushroom.fi-val.v2.jsonl:  22%|██▏       | 11/50 [01:11<04:17,  6.59s/entry][A
Processing mushroom.fi-val.v2.jsonl:  24%|██▍       | 12/50 [01:15<03:41,  5.82s/entry][A
Processing mushroom.fi-val.v2.jsonl:  26%|██▌       | 13/50 [01:20<03:26,  5.58s/entry][A
Processing mushroom.fi-val.v2.jsonl:  28%|██▊       | 14/50 [01:23<02:51,  4.76s/entry][A
Processing mushroom.fi-val.v2.jsonl:  30%|███       | 15/50 [01:27<02:41,  4.61s/entry][A
Processing mushroom.fi-val.v2.jsonl:  32%|███▏      | 16/50 [01:30<02:19,  4.10s/entry][A
Processing mushroom.fi-val.v2.jsonl:  34%|███▍      | 17/50 [01:32<02:00,  3.64s/entry][A
Processing mushroom.fi-val.v2.jsonl:  36%|███▌      | 18/50 [01:37<02:03,  3.87s/entry][A


Processed and saved: data/val/detect_gpt2/mushroom.fi-val.v2.jsonl



Processing mushroom.fr-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.fr-val.v2.jsonl:   2%|▏         | 1/50 [00:04<03:16,  4.01s/entry][A
Processing mushroom.fr-val.v2.jsonl:   4%|▍         | 2/50 [00:05<02:13,  2.79s/entry][A
Processing mushroom.fr-val.v2.jsonl:   6%|▌         | 3/50 [00:08<01:58,  2.51s/entry][A
Processing mushroom.fr-val.v2.jsonl:   8%|▊         | 4/50 [00:10<01:46,  2.32s/entry][A
Processing mushroom.fr-val.v2.jsonl:  10%|█         | 5/50 [00:12<01:44,  2.32s/entry][A
Processing mushroom.fr-val.v2.jsonl:  12%|█▏        | 6/50 [00:16<02:02,  2.78s/entry][A
Processing mushroom.fr-val.v2.jsonl:  14%|█▍        | 7/50 [00:19<02:01,  2.84s/entry][A
Processing mushroom.fr-val.v2.jsonl:  16%|█▌        | 8/50 [00:22<02:04,  2.96s/entry][A
Processing mushroom.fr-val.v2.jsonl:  18%|█▊        | 9/50 [00:25<02:09,  3.16s/entry][A
Processing mushroom.fr-val.v2.jsonl:  20%|██        | 10/50 [00:27<01:50,  2.76s/entry][A
Processing mushr

Failed to parse JSON content: [
    {"word": "Cihuacoatl", "prob": 0.8},
    {"word": "érA",
    {"word": "déesse", "prob": 0.6},
    {"word": "mort", "prob": 0.4}
]. Error: Expecting property name enclosed in double quotes: line 4 column 5 (char 67)



Processing mushroom.fr-val.v2.jsonl: 100%|██████████| 50/50 [02:31<00:00,  2.66s/entry][A
Processing Files:  60%|██████    | 6/10 [19:46<13:12, 198.09s/file]                    [A

Processed and saved: data/val/detect_gpt2/mushroom.fr-val.v2.jsonl



Processing mushroom.hi-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.hi-val.v2.jsonl:   2%|▏         | 1/50 [00:04<03:47,  4.63s/entry][A
Processing mushroom.hi-val.v2.jsonl:   4%|▍         | 2/50 [00:07<02:51,  3.57s/entry][A
Processing mushroom.hi-val.v2.jsonl:   6%|▌         | 3/50 [00:10<02:32,  3.26s/entry][A
Processing mushroom.hi-val.v2.jsonl:   8%|▊         | 4/50 [00:13<02:20,  3.06s/entry][A
Processing mushroom.hi-val.v2.jsonl:  10%|█         | 5/50 [00:14<01:57,  2.62s/entry][A
Processing mushroom.hi-val.v2.jsonl:  12%|█▏        | 6/50 [00:18<02:02,  2.78s/entry][A
Processing mushroom.hi-val.v2.jsonl:  14%|█▍        | 7/50 [00:20<02:00,  2.81s/entry][A
Processing mushroom.hi-val.v2.jsonl:  16%|█▌        | 8/50 [00:23<01:51,  2.66s/entry][A
Processing mushroom.hi-val.v2.jsonl:  18%|█▊        | 9/50 [00:26<01:56,  2.84s/entry][A
Processing mushroom.hi-val.v2.jsonl:  20%|██        | 10/50 [00:28<01:44,  2.62s/entry][A
Processing mushr

Processed and saved: data/val/detect_gpt2/mushroom.hi-val.v2.jsonl



Processing mushroom.it-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.it-val.v2.jsonl:   2%|▏         | 1/50 [00:01<01:26,  1.77s/entry][A
Processing mushroom.it-val.v2.jsonl:   4%|▍         | 2/50 [00:04<01:44,  2.17s/entry][A
Processing mushroom.it-val.v2.jsonl:   6%|▌         | 3/50 [00:06<01:46,  2.26s/entry][A
Processing mushroom.it-val.v2.jsonl:   8%|▊         | 4/50 [00:08<01:40,  2.17s/entry][A
Processing mushroom.it-val.v2.jsonl:  10%|█         | 5/50 [00:11<01:51,  2.48s/entry][A
Processing mushroom.it-val.v2.jsonl:  12%|█▏        | 6/50 [00:15<02:02,  2.78s/entry][A
Processing mushroom.it-val.v2.jsonl:  14%|█▍        | 7/50 [00:16<01:46,  2.47s/entry][A
Processing mushroom.it-val.v2.jsonl:  16%|█▌        | 8/50 [00:19<01:47,  2.55s/entry][A
Processing mushroom.it-val.v2.jsonl:  18%|█▊        | 9/50 [00:21<01:37,  2.39s/entry][A
Processing mushroom.it-val.v2.jsonl:  20%|██        | 10/50 [00:26<02:08,  3.22s/entry][A
Processing mushr

Processed and saved: data/val/detect_gpt2/mushroom.it-val.v2.jsonl



Processing mushroom.sv-val.v2.jsonl:   0%|          | 0/49 [00:00<?, ?entry/s][A
Processing mushroom.sv-val.v2.jsonl:   2%|▏         | 1/49 [00:02<01:48,  2.27s/entry][A
Processing mushroom.sv-val.v2.jsonl:   4%|▍         | 2/49 [00:04<01:45,  2.25s/entry][A
Processing mushroom.sv-val.v2.jsonl:   6%|▌         | 3/49 [00:07<01:52,  2.44s/entry][A
Processing mushroom.sv-val.v2.jsonl:   8%|▊         | 4/49 [00:10<02:06,  2.82s/entry][A
Processing mushroom.sv-val.v2.jsonl:  10%|█         | 5/49 [00:13<02:05,  2.85s/entry][A
Processing mushroom.sv-val.v2.jsonl:  12%|█▏        | 6/49 [00:15<01:48,  2.52s/entry][A
Processing mushroom.sv-val.v2.jsonl:  14%|█▍        | 7/49 [00:19<02:05,  2.99s/entry][A
Processing mushroom.sv-val.v2.jsonl:  16%|█▋        | 8/49 [00:21<01:50,  2.69s/entry][A
Processing mushroom.sv-val.v2.jsonl:  18%|█▊        | 9/49 [00:23<01:42,  2.57s/entry][A
Processing mushroom.sv-val.v2.jsonl:  20%|██        | 10/49 [00:26<01:38,  2.52s/entry][A
Processing mushr

Processed and saved: data/val/detect_gpt2/mushroom.sv-val.v2.jsonl



Processing mushroom.zh-val.v2.jsonl:   0%|          | 0/50 [00:00<?, ?entry/s][A
Processing mushroom.zh-val.v2.jsonl:   2%|▏         | 1/50 [00:02<02:05,  2.56s/entry][A
Processing mushroom.zh-val.v2.jsonl:   4%|▍         | 2/50 [00:06<02:49,  3.54s/entry][A
Processing mushroom.zh-val.v2.jsonl:   6%|▌         | 3/50 [00:10<02:43,  3.48s/entry][A
Processing mushroom.zh-val.v2.jsonl:   8%|▊         | 4/50 [00:13<02:32,  3.31s/entry][A
Processing mushroom.zh-val.v2.jsonl:  10%|█         | 5/50 [00:17<02:41,  3.60s/entry][A
Processing mushroom.zh-val.v2.jsonl:  12%|█▏        | 6/50 [00:20<02:30,  3.43s/entry][A
Processing mushroom.zh-val.v2.jsonl:  14%|█▍        | 7/50 [00:23<02:20,  3.26s/entry][A
Processing mushroom.zh-val.v2.jsonl:  16%|█▌        | 8/50 [00:28<02:41,  3.85s/entry][A
Processing mushroom.zh-val.v2.jsonl:  18%|█▊        | 9/50 [00:36<03:27,  5.06s/entry][A
Processing mushroom.zh-val.v2.jsonl:  20%|██        | 10/50 [00:39<02:58,  4.45s/entry][A
Processing mushr

Failed to parse JSON content: [
    {"word": "生态城", "prob": 0.8},
    {"word": "太阳能", "prob": 0.6},
    {"word": "风能", "prob": 0.6},
    {"word": "7. 为了提高能源效率，采用节能建筑设计和绿色建筑材料。", "prob": 0.8},
    {"word": " "使用 Python 的 Flas", "prob": 0.9},
    {"word": "Flast", "prob": 0.8},
    {"word": "pipinstall", "prob": 0.7},
    {"word": "Flass", "prob": 0.6}
]. Error: Expecting ',' delimiter: line 6 column 17 (char 178)



Processing mushroom.zh-val.v2.jsonl:  80%|████████  | 40/50 [02:26<00:30,  3.04s/entry][A
Processing mushroom.zh-val.v2.jsonl:  82%|████████▏ | 41/50 [02:28<00:26,  2.94s/entry][A
Processing mushroom.zh-val.v2.jsonl:  84%|████████▍ | 42/50 [02:31<00:23,  2.90s/entry][A
Processing mushroom.zh-val.v2.jsonl:  86%|████████▌ | 43/50 [02:34<00:20,  2.94s/entry][A
Processing mushroom.zh-val.v2.jsonl:  88%|████████▊ | 44/50 [02:38<00:19,  3.24s/entry][A
Processing mushroom.zh-val.v2.jsonl:  90%|█████████ | 45/50 [02:42<00:16,  3.31s/entry][A
Processing mushroom.zh-val.v2.jsonl:  92%|█████████▏| 46/50 [02:46<00:14,  3.67s/entry][A
Processing mushroom.zh-val.v2.jsonl:  94%|█████████▍| 47/50 [02:49<00:10,  3.48s/entry][A
Processing mushroom.zh-val.v2.jsonl:  96%|█████████▌| 48/50 [02:54<00:07,  3.77s/entry][A
Processing mushroom.zh-val.v2.jsonl:  98%|█████████▊| 49/50 [02:56<00:03,  3.43s/entry][A
Processing mushroom.zh-val.v2.jsonl: 100%|██████████| 50/50 [02:59<00:00,  3.27s/entry][

Processed and saved: data/val/detect_gpt2/mushroom.zh-val.v2.jsonl





## Evaluation

In [None]:
def evaluate_iou_and_cor(val_dir, detect_dir, output_file):
    """
    Evaluate IoU and Spearman correlation between the reference (val) and detected (detect) files.

    :param val_dir: Directory containing the ground truth files (e.g., data/val/val/)
    :param detect_dir: Directory containing the detected files (e.g., data/detect/)
    :param output_file: Path to save the evaluation results (optional)
    """
    # List all files in the validation directory
    val_files = os.listdir(val_dir)
    detect_files = os.listdir(detect_dir)

    # Ensure that we are comparing the same files (same lang)
    for val_file in val_files:
        # Skip non-JSONL files
        if not val_file.endswith('.jsonl'):
            continue

        # Remove the first 'val/' part from the file path to match the structure of detect directory
        detect_file_name = val_file.replace('val/', '')  # Remove 'val/' from the file name

        # Check if the corresponding detect file exists
        detect_file_path = os.path.join(detect_dir, detect_file_name)

        if not os.path.exists(detect_file_path):
            print(f"Warning: {detect_file_path} not found, skipping.")
            continue

        # Load ground truth (val) and detected (detect) data
        ref_dicts = load_jsonl_file_to_records(os.path.join(val_dir, val_file))
        pred_dicts = load_jsonl_file_to_records(detect_file_path)

        # Calculate IoU and Spearman correlation
#        try:
        ious, cors = main(ref_dicts, pred_dicts)
#        except IndexError as e:
 #           print(f"IndexError occurred for file: {val_file}, skipping this file. Error: {e}")
  #          continue

        # Print or save the results
        print(f"Results for {val_file}:")
        print(f"  Mean IoU: {ious.mean():.8f}")
        print(f"  Mean Spearman Correlation: {cors.mean():.8f}")

        # Optionally, save the results to a file
        if output_file:
            with open(output_file, 'a', encoding='utf-8') as f:
                f.write(f"Results for {val_file}:\n")
                f.write(f"  Mean IoU: {ious.mean():.8f}\n")
                f.write(f"  Mean Spearman Correlation: {cors.mean():.8f}\n\n")

val_dir = 'data/val/val/'
detect_dir = 'data/val/detect_gpt2/'
output_file = 'evaluation_results.txt'
evaluate_iou_and_cor(val_dir, detect_dir, output_file)

Results for mushroom.ar-val.v2.jsonl:
  Mean IoU: 0.39275158
  Mean Spearman Correlation: 0.42841966
Results for mushroom.de-val.v2.jsonl:
  Mean IoU: 0.41210710
  Mean Spearman Correlation: 0.48361960
Results for mushroom.en-val.v2.jsonl:
  Mean IoU: 0.32727742
  Mean Spearman Correlation: 0.39862525
Results for mushroom.es-val.v2.jsonl:
  Mean IoU: 0.33665798
  Mean Spearman Correlation: 0.28163105
Results for mushroom.fi-val.v2.jsonl:
  Mean IoU: 0.28771503
  Mean Spearman Correlation: 0.33508468
Results for mushroom.fr-val.v2.jsonl:
  Mean IoU: 0.25366424
  Mean Spearman Correlation: 0.30476055
Results for mushroom.hi-val.v2.jsonl:
  Mean IoU: 0.51374336
  Mean Spearman Correlation: 0.58168839
Results for mushroom.it-val.v2.jsonl:
  Mean IoU: 0.47239423
  Mean Spearman Correlation: 0.53218903
Results for mushroom.sv-val.v2.jsonl:
  Mean IoU: 0.32676697
  Mean Spearman Correlation: 0.38114322
Results for mushroom.zh-val.v2.jsonl:
  Mean IoU: 0.16656293
  Mean Spearman Correlation: 0