Skip to content

ceval_en

Ziqing Yang edited this page Aug 28, 2023 · 3 revisions

Inference Script for C-Eval

This project tested the performance of the relevant models on the C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects.

In the following, we will introduce the prediction method for the C-Eval dataset.

Data Preparation

Download the dataset from the path specified in official C-Eval, and unzip the file to the data folder:

wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data

Move data to scripts/ceval directory of this project.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/your_output_dir

cd scripts/ceval
python eval.py \
    --model_path ${model_path} \
    --cot False \
    --few_shot False \
    --with_prompt True \
    --constrained_decoding True \
    --temperature 0.2 \
    --n_times 1 \
    --ntrain 5 \
    --do_save_csv False \
    --do_test False \
    --output_dir ${output_path} \

Arguments

  • model_path: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA)

  • cot: Whether to use chain-of-thought

  • few_shot: Whether to use few-shot

  • ntrain: Specifies the number of few-shot demos when few_shot=True (5-shot: ntrain=5); When few_shot=False, this argument does not have any effect

  • with_prompt: Whether input to the model contains the instruction prompt for Alpaca-2 models

  • constrained_decoding: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:

    • constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer

    • constrained_decoding=False: Extract the answer token from model's outputs with regular expressions

  • temperature: Temperature for decoding

  • n_times: The number of repeated evaluations. Folders will be generated under output_dir corresponding to the specified number of times

  • do_save_csv: Whether to save the model outputs, extracted answers, etc. in csv files

  • output_dir: Output path of results

  • do_test: Whether to evaluate on the valid or test set: evaluate on the valid set when do_test=False and on the test set when do_test=True

Evaluation Output

  • The evaluation script creates directories outputs\take* when finishing evaluation, where * is a number ranges from 0 to n_times-1, storing the results of the n_times repeated evaluations respectively.

  • In each outputs\take*, there will be a submission.json and a summary.json . If do_save_csv=True, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.

  • submission.json stores generated answers in the official submission form, and can be submitted for evaluation:
{
    "computer_network": {
        "0": "A",
        "1": "B",
        ...
    },
      "marxism": {
        "0": "B",
        "1": "A",
        ...
      },
    ...
}
  • summary.json stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:

      "All": {
        "score": 0.35958395,
        "num": 1346,
      "correct": 484.0
    }

where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.

⚠️ Note that when evaluating on the test set (do_test=True), score and correct are 0 since there are no labels available. The test set results require submitting the submission.json file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.

Clone this wiki locally