Question about training with zero3

hi there, thanks for your excellent contribution to the community. I try to use qwen2.5-7B with `ds_config/stage3.json` (i don't have device with large VRAM like A100) on L20, but it seems the saved output is empty:

```
{'model.layers.0.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.1.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.2.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.3.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.4.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.5.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.6.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.7.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.8.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.9.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.10.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.11.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.12.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.13.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.14.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.15.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.16.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.17.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.18.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.19.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.20.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.21.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.22.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.23.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.24.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.25.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.26.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16), 'model.layers.27.self_attn.attn_gate.mask_linear_k.weight': tensor([], device='cuda:0', dtype=torch.bfloat16)}
```

Below is my debug scripts (change the step to 1;  substitute model weight path, number of GPUs):

```
warmup_steps=${warmup_steps:-20}
training_max_length=${training_max_length:-32768}
lr=${lr:-1e-3}
weight_decay=${weight_decay:-0.0}
gate_type=${gate_type:-"Qavg_Kmaxminavg"}
bs=${bs:-16}
gpus=${gpus:-16}
total_data=${total_data:-524288000}
gate_loss_scale=${gate_loss_scale:-10.0}
steps=$(($total_data /$training_max_length / $bs))
steps=1
gradient_accumulation_steps=$(($bs / $gpus))
base_model=${base_model:-"/models/Qwen2.5-7B-Instruct"}
run_name="${gate_type}_lr${lr}_maxlen${training_max_length}_warmup${warmup_steps}_bs${bs}_steps${steps}_gatelossscale${gate_loss_scale}"

echo $run_name

torchrun --nproc_per_node=$gpus --master_port=10003 distillation.py  \
        --base_model ${base_model} \
        --seerattn_gate_type $gate_type \
        --bf16 True \
        --output_dir models/seer_attn_qwen2.5-7B/$run_name      \
        --training_max_length $training_max_length \
        --per_device_train_batch_size 1     \
        --gradient_accumulation_steps $gradient_accumulation_steps     \
        --gate_loss_scale $gate_loss_scale \
        --evaluation_strategy "no"     \
        --save_strategy "no"     \
        --learning_rate $lr     \
        --weight_decay $weight_decay     \
        --warmup_steps $warmup_steps     \
        --lr_scheduler_type "cosine"     \
        --logging_steps 1     \
        --deepspeed ds_config/stage3.json \
        --max_steps $steps
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about training with zero3 #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about training with zero3 #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions