Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (ICML 2024)

Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion (EPFL)

Paper: https://arxiv.org/abs/2402.04833 (accepted at ICML 2024)

TL;DR: We uncover the surprising effectiveness of fine-tuning only on the longest 1,000 instruction of large datasets to obtain aligned models.

With a lightweight refinement step, the quality of training instructions is notably improved, thus further enhancing the instruction-following capability of aligned models.

Through ablation studies and comprehensive evaluations, we demonstrate that the impressive performance of our method is not achieved by exploring length bias.

🎉 News

[May 02, 2024] Accepted at ICML 2024! See you in Vienna.
[Mar 08, 2024] The code and datasets for this project are released.
[Mar 04, 2024] Our paper was accepted to the ICLR 2024 Workshop on Data-centric Machine Learning Research.
[Feb 16, 2024] Thanks to @_lewtun, the idea of selecting 1,000 instructions with the longest response is demonstrated effective in the OpenHermes-2.5 dataset. Specifically, fine-tuning Mistral-7B on OpenHermes-2.5-1k-longest produces a chat model comparable in performance to training over the full ~1 million examples.

🚀 ToDo

Release the code
Release the data
Release the instruction fine-tuned models

Install

Training

The training code is mostly dependent on the FastChat platform. So we install the required packages via running

cd training
pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"
pip3 install -e ".[train]"

It may take a while to compile flash-attn. Alternatively, one can use xformers, which is also supported by FastChat. Xformers is seemed better because it supports more GPU architectures than flash-attention, including V100，while having similar memory footprint and flops compared to flash-attention.

💡 If you come across an issue of incompatible packages like us when utilizing fine-tuning on full model weights, we provide a particular version of transformers (v4.34.1) in training/transformers-4.34.1 that also supports NEFTune. To use it, please run

cd training/transformers-4.34.1
pip install -e .

Evaluation

One can find standardized code for evaluation we did in our paper from:

MT-Bench is a part of the FastChat platform. To support evaluation on MT-Bench, one can run:

cd training
pip3 install -e ".[llm_judge]"

Datasets

One can find instruction fine-tuning (IFT) datasets used to align base models under the data folder. In particular,

data/alpaca/filtered_alpaca_1k_longest.json is the filtered 1,000 training examples from the Alpaca-52k dataset. The format of data has been adjusted in accordance to the requirement of FastChat. We verified that examples selected by the length heuristics are largely different from that of selected by using ChatGPT as the quality evaluator.
data/alpaca/refined_alpaca_1k_longest.json is the refined version of filtered_alpaca_1k_longest.json after adopting our introspection-based refinement step. The format of data has been adjusted in accordance to the requirement of FastChat. A diverse set of open-ended evaluation results (e.g., pair-wise comparison, AlpacaEval 2.0, MT-Bench) show that instruction fine-tuning on refined_alpaca_1k_longest.json leads to a more powerful aligned model.
data/alpaca/filtered_alpaca_1k_score.json is the 1,000 training examples from the Alpaca-52k dataset filtered by the ChatGPT quality evaluator as used in AlpaGasus. The format of data has been adjusted in accordance to the requirement of FastChat. We used uniform sampling to help select examples with a score of 4.5.
data/evol-instruct/filtered_evol_instruct_1k_longest.json (data/evol-instruct/filtered_evol_instruct_1k_score.json) is the 1,000 training examples from the Evol-Instruct-70k dataset filtered by the length heuristics (ChatGPT quality evaluator). The format of data has been adjusted in accordance to the requirement of FastChat.

Experiments

We provide some scripts for training the open-sourced LLMs used in the paper. Working with FastChat enables us to conduct experiments with only 4x 80GB A100s.

This section contains commands for running training and inference on 7B and 13B models.

To do instruction fine-tuning on Llama2-7B, run the following command inside training:

cd training
wandb login
bash scripts/train_llama2_7b.sh

To do instruction fine-tuning on Llama2-13B, run the following command inside training:

cd training
wandb login
bash scripts/train_llama2_13b.sh

To do instruction fine-tuning on Mistral-7B, run the following command inside training:

cd training
wandb login
bash scripts/train_mistral_7b.sh

To generate new responses for questions from test datasets: Koala, LIMA, Self-Instruct, Vicuna, WizardLM, run the following command:

cd generation
python generate_test.py --model_name_or_path [MODEL-PATH] --test_path [DATA-PATH] --save_path [SAVE-PATH]

Before doing an evaluation with LLMs-as-a-judge, one should prepare data following a particular format: a list of dictionaries, where each dictionary contains an instruction and two responses generated by different models. This is a simple example:

[
      {
            "id": "XXX",
            "prompt": "XXX",
            "reference_model_name": "XXX",
            "target_model_name": "XXX",
      }
]

To carry out pair-wise comparisons with GPT4-as-a-judge or PaLM2-as-a-judge, run the following commands:

cd evaluation
# GPT4-as-a-judge
bash run_llm_eval_gpt4.sh

# PaLM2-as-a-judge
bash run_llm_eval_palm2.sh

To carry out MT-Bench evaluation, run the following commands(refer to the documentation in training/fastchat/llm_judge for more details):

# step 1: Generate model answers to MT-bench questions
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

# step 2: Generate GPT-4 judgments
export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

# step 3: Show MT-bench scores
python show_result.py --model-list [LIST-OF-MODEL-ID]

For more details about experimental setups, please refer to the Appendix A in our paper. Also, if you do not have enough computing resources, you could consider using deepspeed.

Citation

If you find this useful in your research, please consider citing:

@misc{zhao2024long,
      title={Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning}, 
      author={Hao Zhao and Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion},
      year={2024},
      eprint={2402.04833},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
evaluation		evaluation
figures		figures
generation		generation
refinement		refinement
training		training
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

evaluation

evaluation

figures

figures

generation

generation

refinement

refinement

training

training

utils

utils

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (ICML 2024)

🎉 News

🚀 ToDo

Install

Training

Evaluation

Datasets

Experiments

Citation

About

Releases

Packages

Contributors 2

Languages

tml-epfl/long-is-more-for-alignment

Folders and files

Latest commit

History

Repository files navigation

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (ICML 2024)

🎉 News

🚀 ToDo

Install

Training

Evaluation

Datasets

Experiments

Citation

About

Resources

Stars

Watchers

Forks

Languages