Occurrences Score in Training Data (OSTD)

Overview

This repository contains research about the effect of prompt and generation occurrences within training data on LLM hallucinations. It uses Compute Canada for the experiments.

The main idea is to systematically search the n-gram occurrences of consecutive prompt and generation tokens within a LLM's training datasets. As the data is big (> 1 trillion tokens), a suffix array needs to be created to allow O(log N) search time. The creation of the suffix array relies on the SA-IS algorithm and Google's Rust implementation.

As it is prohibitively expensive to train a LLM from scratch, we used the open source RedPajama-INCITE-Instruct models and their associated pretraining dataset for experiments.

Installation

This code base is designed to run on Compute Canada, where you need to first load the necessary modules with module load StdEnv/2023 gcc arrow python rust, then create a virtual environment and install the python requirements with pip install -r requirements.txt, and finally convert the rust source code into a python module with maturin develop.

Datasets

RedPajama 1T

To download the RedPajama 1T data, go to a location with enough disk space (configurable in config.ini, the dataset is approximately 3TB to download compressed, but you need much more space to create data arrays), and run datasets/redpajama/download_data.sh.

To create searchable suffix arrays for the RedPajama data, run the following scripts in order:

To create the concatenated arrays of RedPajama 1T data for suffix array creation, run slurm_scripts/redpajama_1t_concatenate_data_arrays.sh with enough parallel processes.
To create the memory mapped arrays of all concatenated RedPajama 1T data arrays, run slurm_scripts/redpajama_1t_mmap_data_arrays.sh.
To create the final suffix arrays, run slurm_scripts/redpajama_1t_make_suffix_arrays.sh with enough cpu cores, or preferably run scripts to create parts of suffix arrays separately due to time constraint (eg. slurm_scripts/redpajama_1t_make_suffix_arrays_cc2019.sh).

Other datasets

To download other test datasets, simply run the pipeline below. The datasets will be automatically downloaded from the Hugging Face Hub on first use, and loaded from disk on subsequent runs. Any analysis outputs or modifications to the datasets will also be saved to disk.

Pipeline

Follow instructions in RedPajama 1T to download the RedPajama dataset and create suffix arrays. This will take a long time. For testing purposes, you can first try with the much smaller OpenWebText dataset with slurm_scripts/openwebtext_concatenate_data_array.sh and slurm_scripts/openwebtext_make_suffix_arrays.sh to have an idea about the process.
Use the RedPajama model to generate answers w.r.t. a test dataset. Take TriviaQA for example, run slurm_scripts/triviaqa_generate.sh.
Count occurrences of prompts and generations of a test dataset. Take TriviaQA as example, run slurm_scripts/triviaqa_count_occ.sh.

Analyze Results: AUROC Curves and Classifier Performance

Use analyses.py to evaluate hallucination detection performance. This includes:

Plotting AUROC curves of individual features (e.g., log probabilities, n-gram statistics)
Training classifiers (decision trees or neural networks) combining multiple features

The script supports several configuration options:

Option	Description	Values / Notes
`--metric`	Metric used to define hallucination	`rougeL`, `exact match`
`--use_avg_occurrences`	Use raw frequency model (average n-gram counts) instead of n-gram log probabilities	Flag (omit for n-gram model)
`--filter_stopwords`	Filter n-grams where >66% of tokens are stopwords for raw frequency model or ngrams ending in stopwords in ngram model.	Flag
`--plot_auroc`	Plot AUROC curves for all features	Flag
`--classifier`	Classifier type for hallucination detection	`"decision tree"` or `"neural network"`
`--model`	Generation model used	e.g., `"redpajama-7b"`
`--dataset`	Dataset name	e.g., `triviaqa`, `nq_open`

Example: Run Analysis on TriviaQA with RedPajama 7B and ROUGE-L

python analyses.py \
--dataset triviaqa \
--model redpajama-7b \
--metric rougeL \
--plot_auroc \
--use_avg_occurrences \
--classifier "decision tree"

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
models		models
plots		plots
slurm_outputs		slurm_outputs
slurm_scripts		slurm_scripts
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
analyses.py		analyses.py
config.ini		config.ini
count_dataset_occurrences.py		count_dataset_occurrences.py
generate.py		generate.py
load_config.py		load_config.py
metrics.py		metrics.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Occurrences Score in Training Data (OSTD)

Overview

Installation

Datasets

RedPajama 1T

Other datasets

Pipeline

Example: Run Analysis on TriviaQA with RedPajama 7B and ROUGE-L

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Occurrences Score in Training Data (OSTD)

Overview

Installation

Datasets

RedPajama 1T

Other datasets

Pipeline

Example: Run Analysis on TriviaQA with RedPajama 7B and ROUGE-L

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages