This code repository contains the code and models released for our paper SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization. We propose a novel framework for preference optimization (PO) in long-context scenarios, which decouples long-context PO into short-context PO and short-to-long reward alignment (SoLo-RA). On various long-context benchmarks, SoLoPO outperforms the vanilla PO algorithms and significantly improves the efficiency of data construction and the training process.
We use Qwen2.7-7B-Instruct as an example.
Our code is primarily based on RULER, LLaMA-Factory, and VLLM for data construction, model training, and capability evaluation.
conda create --prefix <path> python=3.11
conda activate <path>
bash ./scripts/0_environment.sh
NOTE🌲: You can find the preprocessed training data in
./data_construction/data/llamafactory_format, which has already been registered in LLaMA-Factory (located at./training/LLaMA-Factory-0.9.1/data/dataset_info.json). Please copy all the data to the path./training/LLaMA-Factory-0.9.1/data, and then begin the 2️⃣ Model Training.
Execute our data construction pipeline, including: (1) synthesizing short-context data, (2) sampling model responses based on short contexts, (3) filtering preference pairs, (4) synthesizing long-context data, (5) constructing the Short-to-Long Dataset, and (6)converting it to the LLaMA-Factory format.
Our training data is based on the Musique
dataset. Please download it and place the musique_ans_v1.0_train.jsonl file in ./data_construction/data/my_qa/raw.
Before executing the following code, please configure BASE_PATH, RAW_MUSIQUE_TRAIN_FILE_PATH and MODEL_DIR_PATH to your own paths. For more details, please refer to the .sh file.
# run our data construction pipeline
bash ./scripts/1_1_data_construction_from_short_context.sh
# cp the created long-context data to `./data_construction/data_example/my_qa/raw` and rename it (eg. 8k_musique_reallong_context_example.jsonl).
bash ./scripts/1_2_data_construction_from_long_context.sh
# Rename the data and register it in the data folder of LLaMA-Factory
bash ./scripts/2_preparation_before_training.sh
The final data used for training will be in Path ./data_construction/data/llamafactory_format and ./training/LLaMA-Factory-main/data, for instance:
# short_context + response_from_short_context
8k_sft_short_musique_qwen.json
8k_po_short_musique_qwen.json
# long_context + response_from_short_context
8k_sft_long_musique_qwen.json
8k_po_long_musique_qwen.json
# long_context + response_from_long_context
8k_sft_long_musique_qwen_reallong.json
8k_po_long_musique_qwen_reallong.json
# short-to-long dataset
8k_s2l_short2long_short_musique_qwen.json
Please first modify some content in the configuration files (in ./training/config/Qwen2.5-7B-Instruct) to your local path, such as [OUTPUT_DIR],[DATASTE_NAME] and [MODEL_PATH].
SoLoPO-Related Parameters:
-
add_short2long_loss: Whether to use SoLoPO. Default istrue. -
kl_part: Type of SoLo-RA loss. Default iss2l_chosen. Optionallys2l_bothfor ablation studies (corresponds to Experiment 1 in Section 4.2). -
kl_penalty: Type of KL divergence loss. Default isnull. Options includekl,abs,mse,low_var_kl, etc. -
kl_lambda: Coefficient for KL divergence loss. Default is1.- Both
kl_penaltyandkl_lambdaare used in Experiment 3 of Section 4.2.
- Both
-
beta_sla: SoLo-RA coefficient, corresponding to$\alpha$ in Equation (9). -
sft_part: Whether to apply SFT loss. Default isnull(only used with ORPO). -
efficiency: Whether to skip forward computation for(long-context, rejected_response)related information. Default istrue. Enabling this improves speed but disables observation ofrewardorlog prob. changes related torejected_response.
Start training
bash ./scripts/3_3_model_training_s2l.sh
You will receive the trained model at path ./output.
Please first download the required evaluation data to the specified path:
- Download LongBenchV1 and place it under path
./evaluation/eval_by_LongBenchV1andunzip data.zip. - Download LongBenchV2 and place it under path
./evaluation/eval_by_LongBenchV2/LongBench. - Download NIAH-PLUS and place it under path
./evaluation/eval_by_NIAH. - The data required by RULER has been placed in path
./evaluation/eval_by_ruler/RULER/scripts/data/synthetic/json.
Start evaluation
- LongBenchV1 (QAs)
bash ./scripts/4_1_eval_longbenchv1.sh - RULER (QAs)
- Configure your models in
./evaluation/eval_by_ruler/RULER/scripts/config_models.sh:[MODEL_ANME]) MODEL_PATH=[MODEL_PATH] MODEL_TEMPLATE_TYPE="qwen2.5_wo_sys" #or your own prompt template MODEL_FRAMEWORK="vllm" ;; - evaluation
bash ./scripts/4_2_eval_ruler.sh
- Configure your models in
For models with a pre-trained context size shorter than ./output_yarn/cp.sh to copy the model into ./output_yarn and then enable YARN.
- LongBenchV2
bash ./scripts/4_3_eval_longbenchv2.sh - NIAH-PLUS
bash ./scripts/4_4_eval_niah.sh
The evaluation results can be found in the results folder within the corresponding directory.
Please cite our paper if you find the repo helpful in your work:
@misc{sun2025solopounlockinglongcontextcapabilities,
title={SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization},
author={Huashan Sun and Shengyi Liao and Yansen Han and Yu Bai and Yang Gao and Cheng Fu and Weizhou Shen and Fanqi Wan and Ming Yan and Ji Zhang and Fei Huang},
year={2025},
eprint={2505.11166},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.11166},
}

