Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

24G GPU 炸显存了 #15

Closed
JohnZhuYX opened this issue Aug 3, 2023 · 19 comments
Closed

24G GPU 炸显存了 #15

JohnZhuYX opened this issue Aug 3, 2023 · 19 comments

Comments

@JohnZhuYX
Copy link

JohnZhuYX commented Aug 3, 2023

用你们的DEMO,结果跑不起来,炸显存了,难道只能用量化的吗?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@JustinLin610
Copy link
Member

再尝试一遍?刚更新了代码

@Louis-y-nlp
Copy link

Louis-y-nlp commented Aug 3, 2023

用新代码显存占用: 17150MiB / 32510MiB

@hutianyu2006
Copy link

关键是官方也没给量化模型。。。我单独开了个#18,希望官方有看到。。。

@logicwong
Copy link
Member

用你们的DEMO,结果跑不起来,炸显存了,难道只能用量化的吗? torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

您好,可能是默认使用了fp32精度导致OOM?可以试试拉取我们的最新代码,然后使用fp16精度来加载模型?方法如下:

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()

@JustinLin610
Copy link
Member

关键是官方也没给量化模型。。。我单独开了个#18,希望官方有看到。。。

刚才王鹏提了精度问题,打开fp16是一种。量化部分在README有说,看量化章节,只需要加入quantization_config就行

@JohnZhuYX
Copy link
Author

JohnZhuYX commented Aug 4, 2023

模型重新下载了一边,还是不行啊,难道只能用量化的?。。。。。
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2737, in sample
outputs = self(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 842, in forward
lm_logits = self.lm_head(hidden_states)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 286, in pre_forward
set_module_tensor_to_device(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 298, in set_module_tensor_to_device
new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.83 GiB already allocated; 1.18 GiB free; 20.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Louis-y-nlp
Copy link

初始化模型的时候加上这个试试:torch_dtype=torch.float16

@JohnZhuYX
Copy link
Author

如果加参数fp=16,像上面的
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
也会报错:
Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample
next_token_scores = logits_processor(input_ids, next_token_logits)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call
scores = processor(input_ids, scores)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(2
30)
RuntimeError: value cannot be converted to type at::Half without overflow

@Louis-y-nlp
Copy link

你是不是改config.json文件了?我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

@Louis-y-nlp
Copy link

17250MiB / 32510MiB

@jackaihfia2334
Copy link

同样报错 ,已经下载了huggingface上最新的config.json
仍然报错 RuntimeError: value cannot be converted to type at::Half without overflow

@trexliu
Copy link

trexliu commented Aug 4, 2023

同样是楼上的错误,也是24G显存,各种方法都用了。不是OOM就是 overflow

@JohnZhuYX
Copy link
Author

我的config.json
你们看一下
{
"activation": "swiglu",
"apply_residual_connection_post_layernorm": false,
"architectures": [
"QWenLMHeadModel"
],
"auto_map": {
"AutoConfig": "configuration_qwen.QWenConfig",
"AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
},
"attn_pdrop": 0.0,
"bf16": false,
"bias_dropout_fusion": true,
"bos_token_id": 151643,
"embd_pdrop": 0.1,
"eos_token_id": 151643,
"ffn_hidden_size": 22016,
"fp16": false,
"initializer_range": 0.02,
"kv_channels": 128,
"layer_norm_epsilon": 1e-05,
"model_type": "qwen",
"n_embd": 4096,
"n_head": 32,
"n_layer": 32,
"n_positions": 6144,
"no_bias": true,
"onnx_safe": null,
"padded_vocab_size": 151936,
"params_dtype": "torch.bfloat16",
"pos_emb": "rotary",
"resid_pdrop": 0.1,
"rotary_emb_base": 10000,
"rotary_pct": 1.0,
"scale_attn_weights": true,
"seq_length": 2048,
"tie_word_embeddings": false,
"tokenizer_type": "QWenTokenizer",
"transformers_version": "4.31.0",
"use_cache": true,
"use_flash_attn": true,
"vocab_size": 151936,
"use_dynamic_ntk": false,
"use_logn_attn": false
}

@Louis-y-nlp
Copy link

你把 bf16 改成true可能就能跑了,我刚测试了一下 指定torch_dtype=torch.float16没用,加载的参数还是bf16的。但是奇怪的是v100是不支持bf16的,不知道我这里怎么跑起来的。

@jackaihfia2334
Copy link

你是不是改config.json文件了?我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

能否分享一下您正确的config.json

@Louis-y-nlp
Copy link

他这个模型好像只能在bf16下跑,所以要么在config里把fp16设置成false,bf16设置成true,初始化的时候什么都不加,要么两个都设置成false,初始化时加上 torch_dtype=torch.bfloat16 ,我试了一下两种方法都能跑,显存占用都小于20G。

{
  "activation": "swiglu",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "QWenLMHeadModel"
  ],  
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },  
  "attn_pdrop": 0.0,
  "bf16": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 151643,
  "embd_pdrop": 0.1,
  "eos_token_id": 151643,
  "ffn_hidden_size": 22016,
  "fp16": false,
  "initializer_range": 0.02,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-05,
  "model_type": "qwen",
  "n_embd": 4096,
  "n_head": 32, 
  "n_layer": 32, 
  "n_positions": 6144,
  "no_bias": true,
  "onnx_safe": null,
  "padded_vocab_size": 151936,
  "params_dtype": "torch.bfloat16",
  "pos_emb": "rotary",
  "resid_pdrop": 0.1,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 2048,
  "tie_word_embeddings": false,
  "tokenizer_type": "QWenTokenizer",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "use_flash_attn": true,
  "vocab_size": 151936,
  "use_dynamic_ntk": false,
  "use_logn_attn": false
}

用这个应该就能跑

@sevenold
Copy link

sevenold commented Aug 4, 2023

拉取最新的仓库

显卡:4090 24G
use fp32: OOM
use fp16:
'''
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

'''
use bf16: 正常没问题 17031MiB / 23.99GiB

@JohnZhuYX
Copy link
Author

确实只能用bf16=True或者量化的,fp32是不行了

@logicwong
Copy link
Member

如果加参数fp=16,像上面的 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval() 也会报错: Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention Traceback (most recent call last): File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in response, history = model.chat(tokenizer, "你好", history=None) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat outputs = self.generate( File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate return super().generate( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate return self.sample( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample next_token_scores = logits_processor(input_ids, next_token_logits) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call scores = processor(input_ids, scores) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call scores[i, self.eos_token_id] = float(230) RuntimeError: value cannot be converted to type at::Half without overflow

感谢各位同学的反馈,这个bug是因为float(2**30)超过了fp16的范围,最新代码修复了这个bug。可以再尝试下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants