Short-to-Long Preference Optimization (SoLoPO) for Efficient Long-Context Alignment

This code repository contains the code and models released for our paper SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization. We propose a novel framework for preference optimization (PO) in long-context scenarios, which decouples long-context PO into short-context PO and short-to-long reward alignment (SoLo-RA). On various long-context benchmarks, SoLoPO outperforms the vanilla PO algorithms and significantly improves the efficiency of data construction and the training process.

🚀Tips for Runing SoLoPO

We use Qwen2.7-7B-Instruct as an example.

0️⃣ Environment

Our code is primarily based on RULER, LLaMA-Factory, and VLLM for data construction, model training, and capability evaluation.

conda create --prefix <path> python=3.11
conda activate <path>
bash ./scripts/0_environment.sh

1️⃣ Data Construction

NOTE🌲: You can find the preprocessed training data in ./data_construction/data/llamafactory_format, which has already been registered in LLaMA-Factory (located at ./training/LLaMA-Factory-0.9.1/data/dataset_info.json). Please copy all the data to the path ./training/LLaMA-Factory-0.9.1/data, and then begin the 2️⃣ Model Training.

Execute our data construction pipeline, including: (1) synthesizing short-context data, (2) sampling model responses based on short contexts, (3) filtering preference pairs, (4) synthesizing long-context data, (5) constructing the Short-to-Long Dataset, and (6)converting it to the LLaMA-Factory format.

Our training data is based on the Musique dataset. Please download it and place the musique_ans_v1.0_train.jsonl file in ./data_construction/data/my_qa/raw.

Before executing the following code, please configure BASE_PATH, RAW_MUSIQUE_TRAIN_FILE_PATH and MODEL_DIR_PATH to your own paths. For more details, please refer to the .sh file.

# run our data construction pipeline

bash ./scripts/1_1_data_construction_from_short_context.sh

# cp the created long-context data to `./data_construction/data_example/my_qa/raw` and rename it (eg. 8k_musique_reallong_context_example.jsonl).

bash ./scripts/1_2_data_construction_from_long_context.sh

# Rename the data and register it in the data folder of LLaMA-Factory
bash ./scripts/2_preparation_before_training.sh

The final data used for training will be in Path ./data_construction/data/llamafactory_format and ./training/LLaMA-Factory-main/data, for instance:

# short_context + response_from_short_context
8k_sft_short_musique_qwen.json
8k_po_short_musique_qwen.json

# long_context + response_from_short_context
8k_sft_long_musique_qwen.json
8k_po_long_musique_qwen.json

# long_context + response_from_long_context
8k_sft_long_musique_qwen_reallong.json
8k_po_long_musique_qwen_reallong.json

# short-to-long dataset
8k_s2l_short2long_short_musique_qwen.json

2️⃣ Model Training

Please first modify some content in the configuration files (in ./training/config/Qwen2.5-7B-Instruct) to your local path, such as [OUTPUT_DIR],[DATASTE_NAME] and [MODEL_PATH].

SoLoPO-Related Parameters:

add_short2long_loss: Whether to use SoLoPO. Default is true.
kl_part: Type of SoLo-RA loss. Default is s2l_chosen. Optionally s2l_both for ablation studies (corresponds to Experiment 1 in Section 4.2).
kl_penalty: Type of KL divergence loss. Default is null. Options include kl, abs, mse, low_var_kl, etc.
kl_lambda: Coefficient for KL divergence loss. Default is 1.
- Both kl_penalty and kl_lambda are used in Experiment 3 of Section 4.2.
beta_sla: SoLo-RA coefficient, corresponding to $\alpha$ in Equation (9).
sft_part: Whether to apply SFT loss. Default is null (only used with ORPO).
efficiency: Whether to skip forward computation for (long-context, rejected_response) related information. Default is true. Enabling this improves speed but disables observation of reward or log prob. changes related to rejected_response.

Start training

bash ./scripts/3_3_model_training_s2l.sh

You will receive the trained model at path ./output.

3️⃣ Evaluation

Please first download the required evaluation data to the specified path:

Download LongBenchV1 and place it under path ./evaluation/eval_by_LongBenchV1 and unzip data.zip.
Download LongBenchV2 and place it under path ./evaluation/eval_by_LongBenchV2/LongBench.
Download NIAH-PLUS and place it under path ./evaluation/eval_by_NIAH.
The data required by RULER has been placed in path ./evaluation/eval_by_ruler/RULER/scripts/data/synthetic/json.

Start evaluation

LongBenchV1 (QAs)
```
bash ./scripts/4_1_eval_longbenchv1.sh
```

RULER (QAs)

Configure your models in ./evaluation/eval_by_ruler/RULER/scripts/config_models.sh:

[MODEL_ANME])
    MODEL_PATH=[MODEL_PATH]
    MODEL_TEMPLATE_TYPE="qwen2.5_wo_sys" #or your own prompt template
    MODEL_FRAMEWORK="vllm"
    ;;

evaluation
```
bash ./scripts/4_2_eval_ruler.sh
```

For models with a pre-trained context size shorter than $32K$, such as Qwen2.5-7B-Instruct, in order to evaluate their performance on longer-context tasks, you need to further enable YARN. You can refer to ./output_yarn/cp.sh to copy the model into ./output_yarn and then enable YARN.

LongBenchV2
```
bash ./scripts/4_3_eval_longbenchv2.sh
```
NIAH-PLUS
```
bash ./scripts/4_4_eval_niah.sh
```

The evaluation results can be found in the results folder within the corresponding directory.

📷Citation

Please cite our paper if you find the repo helpful in your work:

@misc{sun2025solopounlockinglongcontextcapabilities,
      title={SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization}, 
      author={Huashan Sun and Shengyi Liao and Yansen Han and Yu Bai and Yang Gao and Cheng Fu and Weizhou Shen and Fanqi Wan and Ming Yan and Ji Zhang and Fei Huang},
      year={2025},
      eprint={2505.11166},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.11166}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data_construction		data_construction
evaluation		evaluation
output_yarn		output_yarn
pics		pics
scripts		scripts
training		training
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Short-to-Long Preference Optimization (SoLoPO) for Efficient Long-Context Alignment

🚀Tips for Runing SoLoPO

0️⃣ Environment

1️⃣ Data Construction

2️⃣ Model Training

3️⃣ Evaluation

📷Citation

About

Uh oh!

Releases

Packages

Languages

License

shs910/SoLoPO

Folders and files

Latest commit

History

Repository files navigation

Short-to-Long Preference Optimization (SoLoPO) for Efficient Long-Context Alignment

🚀Tips for Runing SoLoPO

0️⃣ Environment

1️⃣ Data Construction

2️⃣ Model Training

3️⃣ Evaluation

📷Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages