Skip to content

Question about training with zero3 #10

Open
@cat538

Description

@cat538

hi there, thanks for your excellent contribution to the community. I try to use qwen2.5-7B with ds_config/stage3.json (i don't have device with large VRAM like A100) on L20, but it seems the saved output is empty:

{'model.layers.0.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.1.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.2.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.3.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.4.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.5.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.6.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.7.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.8.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.9.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.10.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.11.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.12.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.13.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.14.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.15.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.16.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.17.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.18.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.19.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.20.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.21.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.22.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.23.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.24.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.25.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.26.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.27.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16)}

Below is my debug scripts (change the step to 1; substitute model weight path, number of GPUs):

warmup_steps=${warmup_steps:-20}
training_max_length=${training_max_length:-32768}
lr=${lr:-1e-3}
weight_decay=${weight_decay:-0.0}
gate_type=${gate_type:-"Qavg_Kmaxminavg"}
bs=${bs:-16}
gpus=${gpus:-16}
total_data=${total_data:-524288000}
gate_loss_scale=${gate_loss_scale:-10.0}
steps=$(($total_data /$training_max_length / $bs))
steps=1
gradient_accumulation_steps=$(($bs / $gpus))
base_model=${base_model:-"/models/Qwen2.5-7B-Instruct"}
run_name="${gate_type}_lr${lr}_maxlen${training_max_length}_warmup${warmup_steps}_bs${bs}_steps${steps}_gatelossscale${gate_loss_scale}"

echo $run_name

torchrun --nproc_per_node=$gpus --master_port=10003 distillation.py  \
        --base_model ${base_model} \
        --seerattn_gate_type $gate_type \
        --bf16 True \
        --output_dir models/seer_attn_qwen2.5-7B/$run_name      \
        --training_max_length $training_max_length \
        --per_device_train_batch_size 1     \
        --gradient_accumulation_steps $gradient_accumulation_steps     \
        --gate_loss_scale $gate_loss_scale \
        --evaluation_strategy "no"     \
        --save_strategy "no"     \
        --learning_rate $lr     \
        --weight_decay $weight_decay     \
        --warmup_steps $warmup_steps     \
        --lr_scheduler_type "cosine"     \
        --logging_steps 1     \
        --deepspeed ds_config/stage3.json \
        --max_steps $steps

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions