ceval_en

Inference Script for C-Eval

This project tested the performance of the relevant models on the C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects.

In the following, we will introduce the prediction method for the C-Eval dataset.

Data Preparation

Download the dataset from the path specified in official C-Eval, and unzip the file to the data folder:

wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data

Move data to scripts/ceval directory of this project.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/your_output_dir

cd scripts/ceval
python eval.py \
    --model_path ${model_path} \
    --cot False \
    --few_shot False \
    --with_prompt True \
    --constrained_decoding True \
    --temperature 0.2 \
    --n_times 1 \
    --ntrain 5 \
    --do_save_csv False \
    --do_test False \
    --output_dir ${output_path} \

Arguments

model_path: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA)
cot: Whether to use chain-of-thought
few_shot: Whether to use few-shot
ntrain: Specifies the number of few-shot demos when few_shot=True (5-shot: ntrain=5); When few_shot=False, this argument does not have any effect
with_prompt: Whether input to the model contains the instruction prompt for Alpaca-2 models
constrained_decoding: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:
- constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer
- constrained_decoding=False: Extract the answer token from model's outputs with regular expressions
temperature: Temperature for decoding
n_times: The number of repeated evaluations. Folders will be generated under output_dir corresponding to the specified number of times
do_save_csv: Whether to save the model outputs, extracted answers, etc. in csv files
output_dir: Output path of results
do_test: Whether to evaluate on the valid or test set: evaluate on the valid set when do_test=False and on the test set when do_test=True

Evaluation Output

The evaluation script creates directories outputs\take* when finishing evaluation, where * is a number ranges from 0 to n_times-1, storing the results of the n_times repeated evaluations respectively.
In each outputs\take*, there will be a submission.json and a summary.json . If do_save_csv=True, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.

submission.json stores generated answers in the official submission form, and can be submitted for evaluation:

{
    "computer_network": {
        "0": "A",
        "1": "B",
        ...
    },
      "marxism": {
        "0": "B",
        "1": "A",
        ...
      },
    ...
}

summary.json stores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:
```
  "All": {
    "score": 0.35958395,
    "num": 1346,
  "correct": 484.0
}
```

where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.

⚠️ Note that when evaluating on the test set (do_test=True), score and correct are 0 since there are no labels available. The test set results require submitting the submission.json file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
训练脚本
- 预训练脚本
- 指令精调脚本
基于人类反馈的强化学习
- 奖励模型
- 强化学习
常见问题

English Docs

Model Reconstruction
- Online Conversion (Colab)
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
Reinforcement Learning from Human Feedback
- Reward Modeling
- Reinforcement Learning
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly