Skip to content

longbench_en

iMountTai edited this page Jan 30, 2024 · 4 revisions

Inference Script for LongBench

LongBench is a benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This project tested the performance of the relevant models on the LongBench dataset.

Preparation

Environment setup

Set up the environment according to requirements.txt, the related dependencies are reproduced as follows:

datasets
tqdm
rouge
jieba
fuzzywuzzy
einops
torch>=2.0.1
transformers==4.37.2

Dataset Preparation

The inference script will automatically download the dataset from 🤗 Datasets.

Runing the Evaluation Script

Run the following script:

model_path=path/to/chinese_mixtral
output_path=path/to/output_dir
data_class=zh
with_inst="true" # or "false" or "auto"
max_length=32256

cd scripts/longbench
python pred_mixtral.py \
    --model_path ${model_path} \
    --predict_on ${data_class} \
    --output_dir ${output_dir} \
    --with_inst ${with_inst} \
    --max_length ${max_length} \
    --load_in_4bit \
    --use_flash_attention_2 \

Arguments

  • --model_path ${model_path}: Path to the model to be evaluated (the full Chinese-Mixtral model or Chinese-Mixtral-Instrcut model, not LoRA).

  • --predict_on {data_class}: The tasks to predict. Possible values are enzhcode or their combination such as en,zh,code.

  • --output_dir ${output_dir}:Output directory of the predictions and logs.

  • --max_length ${max_length}:Max length of the instructions. Notice that the lengths of system prompt and task-related prompt is not included.

  • --with_inst ${with_inst}:Whether use the system prompt and template of Chinese-Mixtral-Instruct when constructing the instructions:

    • true:Use the system prompt and template on all tasks
    • false:Use the system prompt and template on none of tasks
    • auto:Use the system prompt and template on some tasks (default strategy of LongBench official code)

    We suggest setting --with_inst to false.

  • --gpus ${gpus}:Specify GPUs with this argument, such as 0,1.

  • --e:Predict on the LongBench-E dataset. See the official documentation for details of LongBench-E.

  • --load_in_4bit : Loads the model in 4-bit quantization form

  • --use_flash_attention_2 : Use flash-attn2 to accelerate inference, otherwise use SDPA to accelerate.

When the script has finished running, the prediction files are stored under ${output_dir}/pred/ or ${output_dir}/pred_e/ (depends on if testing on LongBench-E). Run the following command to compute metrics:

python eval.py --output_dir ${output_dir}

If testing on LongBench-E, provide -e when computing metrics:

python eval.py --output_dir ${output_dir} -e

The results are stored in ${output_dir}/result.json or ${output_dir}/pred_e/result.json. For example, the results of Chinese-Mixtral-Instruct on LongBench Chinese tasks (--predict_on zh) are:

{
    "lsht": 42.0,
    "multifieldqa_zh": 50.28,
    "passage_retrieval_zh": 89.5,
    "vcsum": 16.41,
    "dureader": 34.15
}