Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

默认配置24g显存还是会爆 #4

Closed
MEIZU16 opened this issue Mar 22, 2023 · 19 comments
Closed

默认配置24g显存还是会爆 #4

MEIZU16 opened this issue Mar 22, 2023 · 19 comments

Comments

@MEIZU16
Copy link

MEIZU16 commented Mar 22, 2023

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.87 GiB total capacity; 23.08 GiB already allocated; 9.38 MiB free; 23.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
改batch size好像没什么用,看监视是12g瞬间到24g然后就无了

@yuanzhoulvpi2017
Copy link
Owner

感谢反馈问题,把batch_size都改为1,context_length=32试一试。别的情况,我再试一试

@zhaodice
Copy link

zhaodice commented Mar 23, 2023

感谢反馈问题,把batch_size都改为1,context_length=32试一试。别的情况,我再试一试

改完了也是爆显存,显卡是RTX4090 24GB,配置(如果有需要我可以把ssh开放给你研究研究x)

context_length = 32

args = TrainingArguments(
    output_dir="test003",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=100,
    fp16=True,
    push_to_hub=False,
)
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB (GPU 0; 23.65 GiB total capacity; 21.84 GiB already allocated; 152.56 MiB free; 22.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                | 0/595 [00:01<?, ?it/s]

@zhaodice
Copy link

zhaodice commented Mar 23, 2023

我看了一下我torch是2.0,我改成1.13试试看(仍然爆显存

    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB (GPU 0; 23.65 GiB total capacity; 21.83 GiB already allocated; 52.56 MiB free; 22.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                          | 0/595 [00:01<?, ?it/s]
(venv) user@calculator:~/ext/zero_nlp/simple_thu_chatglm6b$ pip show torch
Name: torch
Version: 1.13.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/user/ext/zero_nlp/venv/lib/python3.10/site-packages
Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions
Required-by: accelerate, peft, pytorch-lightning, torchmetrics, torchvision, triton
(venv) user@calculator:~/ext/zero_nlp/simple_thu_chatglm6b$ 

@zhaodice
Copy link

用INT4量化后的模型可以大幅减少显存,有没有直接微调INT4模型的可能性?

@yuanzhoulvpi2017
Copy link
Owner

查看这个配置#5 (comment)

@Adherer
Copy link

Adherer commented Mar 23, 2023

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 22.38 GiB total capacity; 21.49 GiB already allocated; 87.94 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
我用P40跑的,22G显存,一样的问题,context_length设置成了32

@yuanzhoulvpi2017
Copy link
Owner

yuanzhoulvpi2017 commented Mar 23, 2023 via email

@zhaodice
Copy link

查看这个配置#5 (comment)

试过了,无效

@Adherer
Copy link

Adherer commented Mar 23, 2023

22g显寸不够 发自我的 iPhone 在 2023年3月23日,17:26,Adherer @.> 写道:  torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 22.38 GiB total capacity; 21.49 GiB already allocated; 87.94 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 我用P40跑的,22G显存,一样的问题,context_length设置成了32 — Reply to this email directly, view it on GitHub<#4 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHJRI6JAQJ6JUG2IWYTTU4TW5QJNHANCNFSM6AAAAAAWD3FFLE. You are receiving this because you commented.Message ID: @.>

我参考了一下其他repo,8bit量化可以16G的显存finetune,目前暂无支持多卡finetune的版本。因此,后续是否有如下两个优化方向:

  1. 8bit/4bit finetune优化;
  2. 单机多卡 or 多机多卡优化。
    若有相关优化计划,可合作

@yuanzhoulvpi2017
Copy link
Owner

目前在做两个方向:

  1. 使用torch.utils.checkpoint来降低显存压力。
  2. 单机多卡

花了一天了,还没什么进展😂,继续努力~

@Adherer
Copy link

Adherer commented Mar 23, 2023

目前在做两个方向:

  1. 使用torch.utils.checkpoint来降低显存压力。
  2. 单机多卡

花了一天了,还没什么进展😂,继续努力~

可以参考下这个代码:https://github.com/mymusise/ChatGLM-Tuning
我跑通了,正在训练中,明天有空我改下,改成中文训练的

@zhaodice
Copy link

目前在做两个方向:

  1. 使用torch.utils.checkpoint来降低显存压力。
  2. 单机多卡

花了一天了,还没什么进展😂,继续努力~

可以参考下这个代码:https://github.com/mymusise/ChatGLM-Tuning 我跑通了,正在训练中,明天有空我改下,改成中文训练的

这个我也跑通了,但不知道是不是方法有问题,训练效果并不理想,似乎是在胡说八道(

@yuanzhoulvpi2017
Copy link
Owner

yuanzhoulvpi2017 commented Mar 23, 2023 via email

@yuanzhoulvpi2017
Copy link
Owner

已经有不少人跑出来了。不知道你们这边是怎么回事,要求就是显存问题。#5 (comment)

可以看截图,跑起来的时候,显寸占用为24330MB

@zhaodice
Copy link

已经有不少人跑出来了。不知道你们这边是怎么回事,要求就是显存问题。#5 (comment)

可以看截图,跑起来的时候,显寸占用为24330MB

对windows没啥好感,刚双系统打开windows,弹窗问我是否创建GPT分区表,我点了一下确定,训练集LVM分区炸了得重新配置了,正好重试一下(

@Adherer
Copy link

Adherer commented Mar 24, 2023

已经有不少人跑出来了。不知道你们这边是怎么回事,要求就是显存问题。#5 (comment)

可以看截图,跑起来的时候,显寸占用为24330MB

相关问题已解决,模型裁剪即可,现可用13G+显存即可finetune:

image

@Adherer
Copy link

Adherer commented Mar 24, 2023

已经有不少人跑出来了。不知道你们这边是怎么回事,要求就是显存问题。#5 (comment)
可以看截图,跑起来的时候,显寸占用为24330MB

对windows没啥好感,刚双系统打开windows,弹窗问我是否创建GPT分区表,我点了一下确定,训练集LVM分区炸了得重新配置了,正好重试一下(

相关问题已解决,模型裁剪即可,现可用13G+显存即可finetune:

image

@zhaodice
Copy link

已经有不少人跑出来了。不知道你们这边是怎么回事,要求就是显存问题。#5 (comment)
可以看截图,跑起来的时候,显寸占用为24330MB

对windows没啥好感,刚双系统打开windows,弹窗问我是否创建GPT分区表,我点了一下确定,训练集LVM分区炸了得重新配置了,正好重试一下(

相关问题已解决,模型裁剪即可,现可用13G+显存即可finetune:

image

怎么裁剪…咕

@yuanzhoulvpi2017
Copy link
Owner

最新工作汇报

  1. 使用torch.utils.checkpoint + lora 方法,在fp16的情况下、在batch_size=1的时候,显存降低到15G左右。正在整理代码,后面会放出来。
  2. 这个工作,可以让很多卡跑起来了,甚至batchsize可以提高。

截屏2023-03-24 21 32 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants