SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
SCAR is an innovative data selection method that enhances instruction-tuning for large language models. By leveraging style consistency-aware response ranking, SCAR identifies and selects the most beneficial training data for fine-tuning LLMs, ultimately improving their performance.
Ensure you have a Python 3.8+ environment. You can install SCAR using pip:
pip install scar-tool
SCAR requires the following dependencies: torch>=2.3
, transformers>=4.37
, huggingface_hub>=0.23
, scikit-learn
, tqdm
, nltk
and datasketch
. These will be automatically installed when you install SCAR via pip.
Here's a simple example of how to use the StyleRanker model with Hugging Face Transformers:
import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker
# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
model = StyleRanker.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem", "Quantum computing is a type of computation..."]
# Tokenize the inputs
max_length = 512
instruction_inputs = tokenizer(instructions, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
answer_inputs = tokenizer(answers, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
model.eval()
# Get the scores
with torch.no_grad():
scores = model(
instruction_inputs.input_ids,
instruction_inputs.attention_mask,
answer_inputs.input_ids,
answer_inputs.attention_mask
)
# Print the results
for instruction, answer, score in zip(instructions, answers, scores):
print(f"Instruction: {instruction}")
print(f"Answer: {answer}")
print(f"Score: {score.item()}")
print()
SCAR offers sophisticated capabilities for data filtering and ranking through its comprehensive pipeline. This allows you to fine-tune your selection process by choosing the top-k pairs with the highest scores, setting a ratio for selection, or applying a threshold for filtering.
The rank_and_filter
function provides a powerful way to rank and filter instruction-answer pairs. Here's an example
demonstrating its usage:
from style_ranker.rank import rank_and_filter
import torch
# Load the model and tokenizer
model_path = "lizhuang144/scar-gte-base"
# Prepare your data
instructions = ["Write a poem about spring", "Explain quantum computing", "Describe the water cycle"]
answers = ["I am sorry. Who are you? Why should I tell you anything about poem", "Quantum computing is a type of computation...",
"The water cycle, also known as..."]
device = "cuda" if torch.cuda.is_available() else "cpu"
# Example 1: Using topk
topk_pairs = rank_and_filter(model_path, instructions, answers, topk=2, device=device)
# Example 2: Using threshold
threshold_pairs = rank_and_filter(model_path, instructions, answers, threshold=-2.0, device=device)
# Example 3: Using ratio
ratio_pairs = rank_and_filter(model_path, instructions, answers, ratio=0.5, device=device)
# Print results for each method
print("Top-k results:")
for instruction, answer, score in topk_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
print("\nThreshold results:")
for instruction, answer, score in threshold_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
print("\nRatio results:")
for instruction, answer, score in ratio_pairs:
print(f"Score: {score:.2f} | Instruction: {instruction}")
Tip: Our models currently do not support non-English examples or automatically remove duplicates. To ensure better performance, please exclude non-English examples and perform de-duplication before applying our method.
We provide the following pre-trained SCAR models:
lizhuang144/scar-gte-base
: SCAR model trained usingAlibaba-NLP/gte-base-en-v1.5
as the representation encoder.lizhuang144/scar-gte-large
: SCAR model trained usingAlibaba-NLP/gte-large-en-v1.5
as the representation encoder.lizhuang144/scar-roberta-base
: SCAR model trained usingFacebookAI/roberta-base
as the representation encoder.
The models here are pre-trained on a dataset consisting primarily of open-domain data, available at data/ranker_data/mix_code_open/gpt_35
.
SCAR demonstrates significant improvements in LLM performance when used for data filtering and selection. We evaluated our method using two LLMs: Olmo and Starcoder.
Dataset Size | L.C. WinRate |
---|---|
Full dataset (320k) | 3.86 |
SCAR-filtered 10k | 5.37 |
SCAR-filtered 5k | 5.64 |
SCAR-filtered 2.5k | 4.08 |
The official checkpoint allenai/OLMo-7B-SFT is trained on 320k data from allenai/tulu-v2-sft-mixture. We evaluate the performance of models trained with SCAR-filtered data using 10k, 5k, and 2.5k instruction-answer pairs. The evaluation metric is L.C. WinRate, which compares our model outputs with 'gpt-4-1106-preview' outputs using meta-llama/Llama-3-70b-chat-hf from Together AI as the judger on the AlpacaEval benchmark.
Dataset Size | HumanEval (Python) | MultiPL-E (Java) | MultiPL-E (JavaScript) | MultiPL-E (C++) |
---|---|---|---|---|
Pass@1 / Pass@10 | Pass@1 / Pass@10 | Pass@1 / Pass@10 | Pass@1 / Pass@10 | |
Full dataset (13k) | 35.56/ 51.81 | 26.03 / 38.44 | 32.80 / 46.97 | 29.32 / 41.90 |
SCAR-filtered 10k | 36.29 / 53.99 | 28.29 / 39.58 | 33.22 / 49.79 | 30.17 / 46.20 |
SCAR-filtered 5k | 36.95 / 54.07 | 28.96 / 39.02 | 34.53 / 49.90 | 32.83 / 44.47 |
SCAR-filtered 2.5k | 37.57 / 55.65 | 29.29 / 41.06 | 34.09 / 49.47 | 31.19 / 42.83 |
The official checkpoint 'bigcode/octocoder' is the 'bigcode/starcoder' fine-tuned on 13k data from 'bigcode/guanaco-commits'. We evaluated the performance using the bigcode-evaluation-harness. The performance of 'bigcode/octocoder' is obtained from the 'bigcode/bigcode-models-leaderboard'. We evaluated models on four datasets in four programming languages (Python, Java, C++, and JavaScript) and reported two execution accuracies (Pass@1 and Pass@10) for each dataset. We evaluated the performance of the model trained with SCAR-filtered data with 10k, 5k, and 2.5k instruction-answer pairs.
- StyleRanker: A model for ranking instruction-answer pairs based on style consistency and data quality.
- Data Filtering: Scripts for filtering and selecting high-quality instruction-answer pairs.
- LLM Training: Scripts for fine-tuning large language models using the selected data.
The scripts/
directory contains bash scripts for various tasks:
data_synthesize.sh
: Synthesizes 'referenced' and 'direct' responses based on the human responses for training the ranker. Please adjust the script arguments as needed.quality_measure.sh
: Measures the quality of the collected responses using LLMs, utilized to train the ranker.train_ranker.sh
: Trains the SCAR style ranker model. Please update the script arguments as needed.data_filter.sh
: Ranks and filters instruction-answer pairs. Please update the script arguments as needed.train_llm.sh
: This script fine-tunes a large language model using the filtered data. Review and update the script arguments accordingly to ensure proper training. The following additional packages are required to train the LLM:peft
,trl
,accelerate
anddeepspeed
.
Ensure all dependencies are installed before running these scripts to achieve the best results.
The project is organized as follows:
data/
: Datasets for training and evaluationllm_sft_data/
: Training data for the large language model (code and open domain)ranker_data/
: Training data for the ranker (code and open domain)
style_ranker/
: Main packageconsts.py
dedup.py
: Near deduplicationllm/
: LLM training (train.py
)rank.py
: Ranking and filteringranker/
: StyleRanker implementationconfig.py
,dataset.py
,model.py
,quality.py
: Quality measure with LLMs like GPT-3.5-turbo- SCAR ranker training (
train.py
)
utils.py
examples/
: Example Python scriptsfilter_pipeline.py
,rank_pairs.py
,remove_dupes.py
,vicuna_converter.py
scripts/
: Example Bash scriptsdata_synthesize.sh
,data_filter.sh
,quality_measure.sh
,train_llm.sh
,train_ranker.sh
requirements.txt
: List of dependenciessetup.py
: Installation script
@article{li2024scar,
title={SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking},
author={Li, Zhuang and Hua, Yuncheng and Vu, Thuy-Trang and Zhan, Haolan and Qu, Lizhen and Haffari, Gholamreza},
journal={arXiv preprint arXiv:2406.10882},
year={2024}
}