This repository contains research about the effect of prompt and generation occurrences within training data on LLM hallucinations. It uses Compute Canada for the experiments.
The main idea is to systematically search the n-gram occurrences of consecutive prompt and generation tokens within a LLM's training datasets. As the data is big (> 1 trillion tokens), a suffix array needs to be created to allow O(log N) search time. The creation of the suffix array relies on the SA-IS algorithm and Google's Rust implementation.
As it is prohibitively expensive to train a LLM from scratch, we used the open source RedPajama-INCITE-Instruct models and their associated pretraining dataset for experiments.
This code base is designed to run on Compute Canada, where you need to first load the necessary modules with module load StdEnv/2023 gcc arrow python rust, then create a virtual environment and install the python requirements with pip install -r requirements.txt, and finally convert the rust source code into a python module with maturin develop.
To download the RedPajama 1T data, go to a location with enough disk space (configurable in config.ini, the dataset is approximately 3TB to download compressed, but you need much more space to create data arrays), and run datasets/redpajama/download_data.sh.
To create searchable suffix arrays for the RedPajama data, run the following scripts in order:
-
To create the concatenated arrays of RedPajama 1T data for suffix array creation, run
slurm_scripts/redpajama_1t_concatenate_data_arrays.shwith enough parallel processes. -
To create the memory mapped arrays of all concatenated RedPajama 1T data arrays, run
slurm_scripts/redpajama_1t_mmap_data_arrays.sh. -
To create the final suffix arrays, run
slurm_scripts/redpajama_1t_make_suffix_arrays.shwith enough cpu cores, or preferably run scripts to create parts of suffix arrays separately due to time constraint (eg.slurm_scripts/redpajama_1t_make_suffix_arrays_cc2019.sh).
To download other test datasets, simply run the pipeline below. The datasets will be automatically downloaded from the Hugging Face Hub on first use, and loaded from disk on subsequent runs. Any analysis outputs or modifications to the datasets will also be saved to disk.
-
Follow instructions in
RedPajama 1Tto download the RedPajama dataset and create suffix arrays. This will take a long time. For testing purposes, you can first try with the much smallerOpenWebTextdataset withslurm_scripts/openwebtext_concatenate_data_array.shandslurm_scripts/openwebtext_make_suffix_arrays.shto have an idea about the process. -
Use the RedPajama model to generate answers w.r.t. a test dataset. Take TriviaQA for example, run
slurm_scripts/triviaqa_generate.sh. -
Count occurrences of prompts and generations of a test dataset. Take TriviaQA as example, run
slurm_scripts/triviaqa_count_occ.sh. -
Analyze Results: AUROC Curves and Classifier Performance
Use
analyses.pyto evaluate hallucination detection performance. This includes:- Plotting AUROC curves of individual features (e.g., log probabilities, n-gram statistics)
- Training classifiers (decision trees or neural networks) combining multiple features
The script supports several configuration options:
Option Description Values / Notes --metricMetric used to define hallucination rougeL,exact match--use_avg_occurrencesUse raw frequency model (average n-gram counts) instead of n-gram log probabilities Flag (omit for n-gram model) --filter_stopwordsFilter n-grams where >66% of tokens are stopwords for raw frequency model or ngrams ending in stopwords in ngram model. Flag --plot_aurocPlot AUROC curves for all features Flag --classifierClassifier type for hallucination detection "decision tree"or"neural network"--modelGeneration model used e.g., "redpajama-7b"--datasetDataset name e.g., triviaqa,nq_openpython analyses.py \ --dataset triviaqa \ --model redpajama-7b \ --metric rougeL \ --plot_auroc \ --use_avg_occurrences \ --classifier "decision tree"