24G GPU 炸显存了 #15

JohnZhuYX · 2023-08-03T12:09:33Z

用你们的DEMO，结果跑不起来，炸显存了，难道只能用量化的吗？
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

JustinLin610 · 2023-08-03T12:43:09Z

再尝试一遍？刚更新了代码

Louis-y-nlp · 2023-08-03T12:54:52Z

用新代码显存占用： 17150MiB / 32510MiB

hutianyu2006 · 2023-08-03T12:59:03Z

关键是官方也没给量化模型。。。我单独开了个#18，希望官方有看到。。。

logicwong · 2023-08-03T15:11:32Z

用你们的DEMO，结果跑不起来，炸显存了，难道只能用量化的吗？ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

您好，可能是默认使用了fp32精度导致OOM？可以试试拉取我们的最新代码，然后使用fp16精度来加载模型？方法如下：

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()

JustinLin610 · 2023-08-03T15:19:13Z

关键是官方也没给量化模型。。。我单独开了个#18，希望官方有看到。。。

刚才王鹏提了精度问题，打开fp16是一种。量化部分在README有说，看量化章节，只需要加入quantization_config就行

JohnZhuYX · 2023-08-04T02:29:35Z

模型重新下载了一边，还是不行啊，难道只能用量化的？。。。。。
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2737, in sample
outputs = self(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 842, in forward
lm_logits = self.lm_head(hidden_states)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 286, in pre_forward
set_module_tensor_to_device(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 298, in set_module_tensor_to_device
new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.83 GiB already allocated; 1.18 GiB free; 20.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Louis-y-nlp · 2023-08-04T02:32:55Z

初始化模型的时候加上这个试试：torch_dtype=torch.float16

JohnZhuYX · 2023-08-04T02:37:50Z

如果加参数fp=16,像上面的
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
也会报错：
Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample
next_token_scores = logits_processor(input_ids, next_token_logits)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call
scores = processor(input_ids, scores)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(230)
RuntimeError: value cannot be converted to type at::Half without overflow

Louis-y-nlp · 2023-08-04T02:43:17Z

你是不是改config.json文件了？我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

Louis-y-nlp · 2023-08-04T02:44:34Z

17250MiB / 32510MiB

jackaihfia2334 · 2023-08-04T02:46:52Z

同样报错，已经下载了huggingface上最新的config.json
仍然报错 RuntimeError: value cannot be converted to type at::Half without overflow

trexliu · 2023-08-04T02:48:36Z

同样是楼上的错误，也是24G显存，各种方法都用了。不是OOM就是 overflow

JohnZhuYX · 2023-08-04T02:49:01Z

我的config.json
你们看一下
{
"activation": "swiglu",
"apply_residual_connection_post_layernorm": false,
"architectures": [
"QWenLMHeadModel"
],
"auto_map": {
"AutoConfig": "configuration_qwen.QWenConfig",
"AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
},
"attn_pdrop": 0.0,
"bf16": false,
"bias_dropout_fusion": true,
"bos_token_id": 151643,
"embd_pdrop": 0.1,
"eos_token_id": 151643,
"ffn_hidden_size": 22016,
"fp16": false,
"initializer_range": 0.02,
"kv_channels": 128,
"layer_norm_epsilon": 1e-05,
"model_type": "qwen",
"n_embd": 4096,
"n_head": 32,
"n_layer": 32,
"n_positions": 6144,
"no_bias": true,
"onnx_safe": null,
"padded_vocab_size": 151936,
"params_dtype": "torch.bfloat16",
"pos_emb": "rotary",
"resid_pdrop": 0.1,
"rotary_emb_base": 10000,
"rotary_pct": 1.0,
"scale_attn_weights": true,
"seq_length": 2048,
"tie_word_embeddings": false,
"tokenizer_type": "QWenTokenizer",
"transformers_version": "4.31.0",
"use_cache": true,
"use_flash_attn": true,
"vocab_size": 151936,
"use_dynamic_ntk": false,
"use_logn_attn": false
}

Louis-y-nlp · 2023-08-04T02:52:46Z

你把 bf16 改成true可能就能跑了，我刚测试了一下指定torch_dtype=torch.float16没用，加载的参数还是bf16的。但是奇怪的是v100是不支持bf16的，不知道我这里怎么跑起来的。

jackaihfia2334 · 2023-08-04T02:54:21Z

你是不是改config.json文件了？我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

能否分享一下您正确的config.json

Louis-y-nlp · 2023-08-04T03:02:47Z

他这个模型好像只能在bf16下跑，所以要么在config里把fp16设置成false，bf16设置成true，初始化的时候什么都不加，要么两个都设置成false，初始化时加上 torch_dtype=torch.bfloat16 ，我试了一下两种方法都能跑，显存占用都小于20G。

{
  "activation": "swiglu",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "QWenLMHeadModel"
  ],  
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },  
  "attn_pdrop": 0.0,
  "bf16": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 151643,
  "embd_pdrop": 0.1,
  "eos_token_id": 151643,
  "ffn_hidden_size": 22016,
  "fp16": false,
  "initializer_range": 0.02,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-05,
  "model_type": "qwen",
  "n_embd": 4096,
  "n_head": 32, 
  "n_layer": 32, 
  "n_positions": 6144,
  "no_bias": true,
  "onnx_safe": null,
  "padded_vocab_size": 151936,
  "params_dtype": "torch.bfloat16",
  "pos_emb": "rotary",
  "resid_pdrop": 0.1,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 2048,
  "tie_word_embeddings": false,
  "tokenizer_type": "QWenTokenizer",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "use_flash_attn": true,
  "vocab_size": 151936,
  "use_dynamic_ntk": false,
  "use_logn_attn": false
}

用这个应该就能跑

sevenold · 2023-08-04T03:05:16Z

拉取最新的仓库

显卡：4090 24G
use fp32: OOM
use fp16:
'''
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

'''
use bf16: 正常没问题 17031MiB / 23.99GiB

JohnZhuYX · 2023-08-04T03:41:13Z

确实只能用bf16=True或者量化的，fp32是不行了

logicwong · 2023-08-04T03:43:12Z

如果加参数fp=16,像上面的 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval() 也会报错： Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention Traceback (most recent call last): File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in response, history = model.chat(tokenizer, "你好", history=None) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat outputs = self.generate( File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate return super().generate( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate return self.sample( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample next_token_scores = logits_processor(input_ids, next_token_logits) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call scores = processor(input_ids, scores) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call scores[i, self.eos_token_id] = float(230) RuntimeError: value cannot be converted to type at::Half without overflow

感谢各位同学的反馈，这个bug是因为float(2**30)超过了fp16的范围，最新代码修复了这个bug。可以再尝试下

logicwong mentioned this issue Aug 3, 2023

直接用提供的transformers例子跑不起来 #13

Closed

JohnZhuYX closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24G GPU 炸显存了 #15

24G GPU 炸显存了 #15

JohnZhuYX commented Aug 3, 2023 •

edited

JustinLin610 commented Aug 3, 2023

Louis-y-nlp commented Aug 3, 2023 •

edited

hutianyu2006 commented Aug 3, 2023

logicwong commented Aug 3, 2023

JustinLin610 commented Aug 3, 2023

JohnZhuYX commented Aug 4, 2023 •

edited

Louis-y-nlp commented Aug 4, 2023

JohnZhuYX commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

jackaihfia2334 commented Aug 4, 2023

trexliu commented Aug 4, 2023

JohnZhuYX commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

jackaihfia2334 commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

sevenold commented Aug 4, 2023 •

edited

JohnZhuYX commented Aug 4, 2023

logicwong commented Aug 4, 2023

24G GPU 炸显存了 #15

24G GPU 炸显存了 #15

Comments

JohnZhuYX commented Aug 3, 2023 • edited

JustinLin610 commented Aug 3, 2023

Louis-y-nlp commented Aug 3, 2023 • edited

hutianyu2006 commented Aug 3, 2023

logicwong commented Aug 3, 2023

JustinLin610 commented Aug 3, 2023

JohnZhuYX commented Aug 4, 2023 • edited

Louis-y-nlp commented Aug 4, 2023

JohnZhuYX commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

jackaihfia2334 commented Aug 4, 2023

trexliu commented Aug 4, 2023

JohnZhuYX commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

jackaihfia2334 commented Aug 4, 2023

Louis-y-nlp commented Aug 4, 2023

sevenold commented Aug 4, 2023 • edited

JohnZhuYX commented Aug 4, 2023

logicwong commented Aug 4, 2023

JohnZhuYX commented Aug 3, 2023 •

edited

Louis-y-nlp commented Aug 3, 2023 •

edited

JohnZhuYX commented Aug 4, 2023 •

edited

sevenold commented Aug 4, 2023 •

edited