Skip to content

pt_scripts_en

iMountTai edited this page Mar 4, 2024 · 3 revisions

Pre-training Scripts

Training Steps

Training script: scripts/training/run_clm_pt_with_peft.py

Go to the scripts/training directory of the project and run bash run_pt.sh to fine-tune the instructions. Single card is used by default. Before running, users should modify the script and specify relevant parameters. The parameter values in the script are for debugging reference only. The content of run_pt.sh is as follows:

########Parameter settings########
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate,w1,w2,w3"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=path/to/hf/mixtral/dir
dataset_dir=path/to/pt/data/dir
data_cache=temp_data_cache_dir
per_device_train_batch_size=1
gradient_accumulation_steps=8
block_size=1024
output_dir=output_dir

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${pretrained_model} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.1 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size ${block_size} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --load_in_kbits 4 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --output_router_logits

The explanation of parts of the parameters is as follows:

  • --dataset_dir: Directory of pre-training data, which can contain multiple plain text files ending with txt
  • --data_cache_dir: Specify a directory for storing data cache files
  • --use_flash_attention_2: FlashAttention-2 training enabled
  • --load_in_kbits: The selectable options are [16,8,4], which means using fp16 or 8-bit/4-bit quantization for model training. The default is fp16 training. The other listed training-related hyperparameters, especially the learning rate and parameters related to the total batch size, are for reference only. Please configure them according to the data situation and hardware conditions in actual use.

Multi-machine Multi-card Training

Please refer to the following launch method:

torchrun \
  --nnodes ${num_nodes} \
  --nproc_per_node ${num_gpu_per_node} 
  --node_rank ${node_rank} \
  --master_addr ${master_addr} \
  --master_port ${master_port} \
  run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    ...