Same model with/without deepspeed different results #7322
Unanswered
BernardoPalmer
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Maybe someone can enlighten me if im missing something.
Im using a single gpu RTX 4060 8gb vram to train a model, my intention with deepspeed is to use cpu offload and eventually ssd/nvme offload. I used my model without deepspeed and got for the top categories about 90% accuracy, then with deepspeed I wanted to test if I would get something similar using the exact same model architecture and hyper parameters but the results were considerably worse.
deepspeed version 0.6.19
torch 2.6.0
cuda 12
ubuntu 24
ds_zero3_cpu.json:
{
"train_batch_size": 128,
"gradient_accumulation_steps": 16,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"fp16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "cpu", "pin_memory": true },
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"overlap_comm": false,
"contiguous_gradients": true
}
}
running with: deepspeed train.py --deepspeed --deepspeed_config ds_zero3_cpu.json
please let me know if more context is needed
Beta Was this translation helpful? Give feedback.
All reactions