Open
Description
hi there, thanks for your excellent contribution to the community. I try to use qwen2.5-7B with ds_config/stage3.json
(i don't have device with large VRAM like A100) on L20, but it seems the saved output is empty:
{'model.layers.0.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.1.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.2.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.3.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.4.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.5.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.6.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.7.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.8.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.9.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.10.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.11.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.12.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.13.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.14.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.15.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.16.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.17.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.18.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.19.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.20.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.21.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.22.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.23.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.24.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.25.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.26.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.27.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16)}
Below is my debug scripts (change the step to 1; substitute model weight path, number of GPUs):
warmup_steps=${warmup_steps:-20}
training_max_length=${training_max_length:-32768}
lr=${lr:-1e-3}
weight_decay=${weight_decay:-0.0}
gate_type=${gate_type:-"Qavg_Kmaxminavg"}
bs=${bs:-16}
gpus=${gpus:-16}
total_data=${total_data:-524288000}
gate_loss_scale=${gate_loss_scale:-10.0}
steps=$(($total_data /$training_max_length / $bs))
steps=1
gradient_accumulation_steps=$(($bs / $gpus))
base_model=${base_model:-"/models/Qwen2.5-7B-Instruct"}
run_name="${gate_type}_lr${lr}_maxlen${training_max_length}_warmup${warmup_steps}_bs${bs}_steps${steps}_gatelossscale${gate_loss_scale}"
echo $run_name
torchrun --nproc_per_node=$gpus --master_port=10003 distillation.py \
--base_model ${base_model} \
--seerattn_gate_type $gate_type \
--bf16 True \
--output_dir models/seer_attn_qwen2.5-7B/$run_name \
--training_max_length $training_max_length \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps $gradient_accumulation_steps \
--gate_loss_scale $gate_loss_scale \
--evaluation_strategy "no" \
--save_strategy "no" \
--learning_rate $lr \
--weight_decay $weight_decay \
--warmup_steps $warmup_steps \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--deepspeed ds_config/stage3.json \
--max_steps $steps
Metadata
Metadata
Assignees
Labels
No labels