Skip to content

sft_scripts_en

Xin Yao edited this page Mar 4, 2024 · 4 revisions

Instruction Fine-Tuning Script

Training Steps

Enter the scripts/training directory of the project and run bash run_sft.sh to fine-tune the instructions, using a single card by default. Users should modify the script and specify related parameters before running, the parameter values in the script are for debugging reference only. The content of run_sft.sh is as follows:

######## Parameters ########
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate,w1,w2,w3"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=path/to/hf/chinese-mixtral/dir/or/model_id
dataset_dir=path/to/sft/data/dir
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=1024
output_dir=output_dir
validation_file=validation_file_name

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_sft_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${pretrained_model} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 3 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.1 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --load_in_kbits 4 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --output_router_logits

Some parameters are self-explanatory. Partial parameter explanations are as follows:

  • --dataset_dir: Directory containing the instruction fine-tuning data, including one or more instruction fine-tuning data files in the Stanford Alpaca format ending with json.
  • --validation_file: The single instruction fine-tuning file used for the validation set, also in the Stanford Alpaca format and ending with json.
  • --use_flash_attention_2: FlashAttention-2 training enabled
  • --load_in_kbits: The selectable options are 16/8/4, which means using fp16 or 8-bit/4-bit quantization for model training. The default is fp16 training.

The other listed training-related hyperparameters (especially the learning rate, and parameters related to the total batch size) are for reference only. Please configure them according to the data situation and hardware conditions when actually using.

The Stanford Alpaca format is as follows:

[
  {"instruction" : ...,
   "input" : ...,
   "output" : ...},
  ...
]

Use Multiple Machines and Cards

Please refer to the following launch method:

torchrun \
  --nnodes ${num_nodes} \
  --nproc_per_node ${num_gpu_per_node} 
  --node_rank ${node_rank} \
  --master_addr ${master_addr} \
  --master_port ${master_port} \
  run_clm_sft_with_peft.py \
    ...