-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stucks at the beginning.. #4
Comments
It seems to be a pytorch lightning problem. We trained the model using batch_size=1 on A100 GPUs with 80G vram (usuing around 70+GB). Gradient checkpointing, amp, and 8-bit optimizer can greatly reduce the varm requirement, You can also set |
Nice! |
Yes, we try to freezing clip, blip and resent in our very early experiments and the performance is still acceptable, but we do not run whole experiment use this setting and test FID scores. |
Hi, I want to know how long it will take approximately if I set all three to |
@bibisbar Hi, for unfreeze setting, the forward time for a batch=1 in a single A100 GPU is 0.5s, and freeze will not change this time much. As for the backward, the original time is 0.5s too, and freeze the gradient will accelerate this process, I guess it may reduce 50% time cost at most. So freeze operation will slightly shorten the training time, but it can reduce the memory usage (which I think is more important, like param efficient tuning) |
Hi, but How to enable gradient checkpointing in pytorch lightning model? I think in the huggingface model, it's easy to model.enable_gradient_checkpointing(), but it seems no to work for ARLDM model.... |
@skywalker00001 Hi, I am sorry it seems pytorch lighting do not support this setting. Lightning-AI/pytorch-lightning#49 |
Thanks. And the other several approaches (freezing resnet, clip embedding, blip embedding), amp and 8-bit optimizer together helped reduce the vRAM to about 40GB on my A6000 for batch_size = 1 successfully. |
@skywalker00001 Great! |
freeze_clip=True, freeze_blip=True, reeze_resnet=True and V100 doesn't work, still cuda out of momory |
您好,我是侯翼,您的邮件已收到,祝您生活愉快~
|
Hi, I'm a beginner of deep learning and I would appreciate it if you could tell me how you can use amp and 8-bit optimizer to reduce VRAM usage. And I wonder, can I run this project on two 24G 3090s after optimization? Looking forward to hearing from you. |
I'm working on training this model on the FlintstonesSV dataset. I run the training script on a GPU server with 8x 3080ti (with 12GB ram each card). Is this server able to train this model? What's the maximun memory useage during training?
The training process seems to stuck at "trainer.fit(model, dataloader, ckpt_path=args.train_model_file)". Here is the log:
no signal after waiting for 30 min...
The config.yaml is:
The text was updated successfully, but these errors were encountered: