GraphEval: Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Specifically, the test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts. Unlike conventional methods that evaluate LLMs based on generated responses, GraphEval streamlines the evaluation process by creating a judge model to estimate the correctness of the answers given by the LLM. Our experiments demonstrate that the judge model's factuality assessment aligns closely with the correctness of the LLM's generated outputs, while also substantially reducing evaluation costs. Besides, our findings offer valuable insights into LLM performance across different metrics and highlight the potential for future improvements in ensuring the factual integrity of LLM outputs.

🔬 Dependencies

pip install -r requirement.txt

Details

Python (>= 3.7)
PyTorch (>= 1.13.1)
numpy (>= 1.19.2)
Transformers (== 4.38.2)

📚 Data Preparation

Please download mappingbased-objects_lang=en.ttl.bzip2 from the DBpedia dataset and unzip it. A program argument is provided to specify the path to the file.

DBpedia dataset can be downloaded from here.

🚀 Running the code

The 3 steps in the papers are implemented in the following files:

collect.py
train.py
eval.py

The code provides arguments to specify settings, paths, and hyperparameters. To see the arguments, run the following command:

python collect.py --help

Here, you can use any of the collect.py, train.py, and eval.py files to run the help command.

🤝 Cite:

Please consider citing this paper if you use the code or data from our work. Thanks a lot :)

@journal{liu2024evaluating,
      title={Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs}, 
      author={Xiaoze Liu and Feijie Wu and Tianyang Xu and Zhuo Chen and Yichi Zhang and Xiaoqian Wang and Jing Gao},
      year={2024},
      journal={arXiv preprint arXiv:2404.00942}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
args		args
assets		assets
cache/dbpedia-en		cache/dbpedia-en
data_gen		data_gen
evaluator		evaluator
trainer		trainer
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
collect.py		collect.py
collect_and_train.sh		collect_and_train.sh
eval.py		eval.py
eval_fill_gap.py		eval_fill_gap.py
requirements.txt		requirements.txt
train.py		train.py

License

xz-liu/GraphEval

Folders and files

Latest commit

History

Repository files navigation

GraphEval: Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

🔬 Dependencies

Details

📚 Data Preparation

🚀 Running the code

🤝 Cite:

About

Resources

License

Stars

Watchers

Forks

Languages