The data after evaluating n times will be placed in the "dataset_ntimes" folder, organized by model name, with the structure as follows:
dataset_ntimes/
├── model_1/
│ ├── model_1_nt.json
└── model_2/
├── model_2_nt.json
└── ...
Modify the dir_path=path/to/dataset_ntimes
in 1_Split_filename.py
and run the file to split different task files based on the "filename" field.
Modify the file_dir=path/to/dataset_ntimes
in 2_Extract.py
and run the file to extract the answers from the model responses and generate JSONL files, which will be placed in the result_ntimes
folder with the following structure:
result_ntimes/
├── model_1/
│ ├── 0shot/
│ │ ├── 1t/
│ │ │ ├── task_dir_1/
│ │ │ │ ├──task_1.json
│ │ │ │ ├──task_1.jsonl
│ │ │ │ ├──task_2.json
│ │ │ │ ├──task_2.jsonl
│ │ │ │ ├──...
│ │ ├── 2t/
│ │ │ ├── task_dir_1/
│ │ │ │ ├──task_1.json
│ │ │ │ ├──task_1.jsonl
│ │ │ │ ├──...
│ ├── 3shot/
│ │ ├──...
└── ...
Modify the root_path=path/to/result_ntimes
in 3_Evaluate.py
and run the file to obtain the evaluation results.
Modify the folder_path
and output_file
in 1_prompt_chem.py
, then run the file to submit for LLM evaluation.
Modify the file_path
in 2_L1_task_eval.py
and run to obtain metrics for multiple-choice, true/false, fill-in-the-blank, short answer, and calculation tasks. The results will be displayed in an Excel file.
Modify the foler_path
and excel_path
in 3_other_task_eval.py
,then run the file to obtain metrics for abstract writing, outlining, reaction intermediates, single-step synthesis, multi-step synthesis, and physicochemical property tasks. The results will be displayed in an Excel file.
Run 1_prompt_chem
to obtain the evaluation data from the LLM.
Run 2_LLM_evaluate
to obtain the evaluation results from the LLM.
Run 3_code_evaluate
to obtain the evaluation results using code.
https://huggingface.co/datasets/Ooo1/ChemEval
The ChemEval dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Please cite our paper if you use our dataset.
@article{huang2024chemeval,
title={ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models},
author={Huang, Yuqing and Zhang, Rongyang and He, Xuesong and Zhi, Xuyang and Wang, Hao and Li, Xin and Xu, Feiyang and Liu, Deguang and Liang, Huadong and Li, Yi and others},
journal={arXiv preprint arXiv:2409.13989},
year={2024}
}