Same model with/without deepspeed different results #7322

BernardoPalmer · 2025-05-30T12:34:11Z

BernardoPalmer
May 30, 2025

Maybe someone can enlighten me if im missing something.
Im using a single gpu RTX 4060 8gb vram to train a model, my intention with deepspeed is to use cpu offload and eventually ssd/nvme offload. I used my model without deepspeed and got for the top categories about 90% accuracy, then with deepspeed I wanted to test if I would get something similar using the exact same model architecture and hyper parameters but the results were considerably worse.
deepspeed version 0.6.19
torch 2.6.0
cuda 12
ubuntu 24

ds_zero3_cpu.json:
{
"train_batch_size": 128,
"gradient_accumulation_steps": 16,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"fp16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "cpu", "pin_memory": true },
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"overlap_comm": false,
"contiguous_gradients": true
}
}

running with: deepspeed train.py --deepspeed --deepspeed_config ds_zero3_cpu.json

please let me know if more context is needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Same model with/without deepspeed different results #7322

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Same model with/without deepspeed different results #7322

Uh oh!

Uh oh!

BernardoPalmer May 30, 2025

Replies: 0 comments

BernardoPalmer
May 30, 2025