# HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

#### Group Members:
1. Shyana Srikanthalingam (shyanasri@gmail.com)
2. Tanzell Go (tanzell.go@torontomu.ca)

# Introduction:

#### Problem Description:

This paper proposes a method (Haloscope) for training a hallucination detector using an unlabeled dataset of prompts and LLM-generated answers. By analyzing the latent activation space from which LLMs generate responses, the researchers aim to identify a subspace that captures the patterns associated with hallucinated outputs. This approach enables the automatic inference of labels—distinguishing hallucinated from truthful answers—based on structural properties of the latent space, allowing for scalable hallucination detection without manual annotation.


#### Context of the Problem:

Manual annotation of hallucinated text is labor-intensive and lacks scalability, limiting our ability to train effective detection models. To overcome this, alternative approaches are needed that leverage the abundance of unlabeled data. Expanding the usable data pool enables more flexible and robust hallucination detection.


#### Limitation About other Approaches:

The baseline methods compared to Haloscope either suffer from higher computational complexity, unreliable detection due to overconfident LLM-assisted mechanisms, or produce output distributions poorly suited for training hallucination detectors—all while underperforming Haloscope in accuracy.


#### Solution:

Haloscope maps LLM's output embeddings into the LLM's latent activation space and measures their distance from the origin—based on the assumption that truthful answers cluster near the origin, while hallucinations lie farther away. It then assigns a membership score to each sample, effectively labeling the previously unlabeled data, which is used to train a hallucination classifier.


# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Jie et al. [1] | (Perplexity) Method used average perplexity score to measure uncertainty for the generated tokens| See note below * | Higher time complexity during inference and lower accuracy
| Andrey et al. [2] | (Length-Normalized Entropy) Method used entropy of predicted token distribution for each token generated. Entropy value was normalized w.r.t. sequence length| See note below * | Higher time complexity during inference and lower accuracy
| Lorenz et al. [3] | (Semantic Entropy) Method utilized semantic relationship between predicted tokens to measure encertainty| See note below * | Higher time complexity during inference and lower accuracy
| Zhen et al. [4] | (Lexical Similarity) Method generated multiple outputs for a single input and measured lexical similarity amongst the outputs to classify hallucination| See note below * | Higher time complexity during inference and lower accuracy
| Potsawee et al. [5] | (SelfCKGPT) Method used multiple samples and benchmarked output with that of DeBERTa-v3-large to measure contradition indicating hallucination. | See note below * | Higher time complexity during inference and lower accuracy
| Chao et al. [6] | (EigenScore) Method used mulitple generations and eigendecomposition on activation covariance matrix to infer hallucination tendency | See note below * | Higher time complexity during inference and lower accuracy
| Stephanie et al. [7] | (Verbalize) Method used LLM to gauge confidence on a response from 0-100| See note below * | Not very reliable due to LLM's being too confident
| Saurav et al. [8] | (Self-evaluation) Method also used LLM to label a response as "True" or "False". | See note below * | Not very reliable due to LLM's being too confident
| Collin et al. [9] | (CCS*) Method used a binary truthfulness classifier trained on the premise a statement and its negation. Haloscope researchers re-approriated it on LLM Generated answers (initially it was based on human-written answers) | See note below * | Inputs don't accurate mirror that of an LLM's ouput
| Chaowei et al. [10] | (Haloscope) Method used membership scores dervied from a subspace (deemed to house hallucination) in the activate space to train a MLP classifier | See note below * | Still susceptible to distribution shifts form labled to unlabled dataset

\* All approaches were evaluated by the researches on the same dataset/input (LLM Generation form CoQA, TRUTHFULQA, TRIVIAQA, TYDIQA) to provide consistency in comparison


# Methodology

**Load Dataset:** The process begins with loading the chosen question and answer dataset namely -- TruthfulQA. Afterwhich further measures are take to properly format the data to fit a specified structure for LLM prompting.

**LLM Answer Generation:** After processing the chosen dataset, tokenizers and the chosen model (Llama) are initialized. The model or LLM then proceeds to iterate over the questions generating answers. The answers are saved in files for evaluation later on.

**Ground Truth with BLEURT:** Another model (BLEURT) is initialized and takes the generated answers form the prior step (Llama generated answers) to compute the BLUERT score whilst considering the answers from TruthfulQA as reference. The results are save in file for evaluation.

**Optimizing Hallucination Subspace:** Afterwhich, the solution delineates the best representation of the hallucination subspace in the LLM's latent space. There are three configuration used in defining this subspace: representation type (ie. embedding after the attention layer, embedding after the feed-forward layer in an attention layer, embedding before the feed-forward layer), specific layer's latent activation space to situate the target subspace , and the amount of subspace dimensions to keep. The best configuration combination is then used to create membership score (whether they are more on the truth side or the hallucinated side) for the generated answers. Finally, the generated answers and membership scores are then fed into a MLP for hallucination detection training. The best subspace configurations are chosen based on the best AUROC metric of the MLP predictions.

![haloscope flow](./haloscope.drawio.png)
<em>Simplified view of Haloscope</em>

**(BONUS IMPLEMENTATION) LSTM-Based Detection:** The MLP component is then swapped with a LSTM component to consider not only a single best layer (one of the subspace configuration) but all layers in sequential manner. Instead of using the membership score of the best layer as with the MLP setup, embeddings from each layer across all layer are fed into the LSTM model in a sequential manner thus producing comparable results (see implementation for more info).


# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this. (To keep the Notebook clean, do not display debugging output or thousands of print statements from hundreds of epochs. Make sure it is readable for others by reviewing it yourself carefully.)

In [None]:
#LLM Generations with TruthfulQA

In [None]:
#Initial BLUert Scores

In [None]:
#Linear Probe Classifier

In [None]:
#LSTM Classifier

# Conclusion and Future Direction

Write what you have learnt in this project. In particular, write few sentences about the results and their limitations, how they can be extended in future. Make sure your own inference/learnings are depicted here.

# References:

[1]:  Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshmi- narayanan, and Peter J Liu. Out-of-distribution detection and selective generation for con- ditional language models. In The Eleventh International Conference on Learning Representa- tions, 2023.

[2]:  Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured predic- tion. In International Conference on Learning Representations, 2021.

[3]:  Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023.

[4]:  Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023.

[5]:  Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black- box hallucination detection for generative large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.

[6]:  Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection. In International Conference on Learning Representations, 2024.

[7]:  Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022.

[8]:  Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.

[9]:  Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. International Conference on Learning Representations, 2023.

[9]:  Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. International Conference on Learning Representations, 2023.

[10]:  Xuefeng Du, Chaowei Xiao, and Yixuan Li. Haloscope: Harnessing unlabeled LLM generations for hallucination detection. In Advances in Neural Information Processing Systems, 2024.
