Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Training on 13B causes loss to be 0, while 7B works fine #170

Open
NanoCode012 opened this issue Mar 25, 2023 · 28 comments
Open

[Bug] Training on 13B causes loss to be 0, while 7B works fine #170

NanoCode012 opened this issue Mar 25, 2023 · 28 comments

Comments

@NanoCode012
Copy link
Contributor

NanoCode012 commented Mar 25, 2023

Hello,

Thank you for your work.

I encountered a weird issue with training LORA. I used the default settings with cleaned dataset and can successfully train the 7B one. However, if I were to change to use the 13B model (and tokenizer) by updating the base_model from 7b -> 13b, it would cause the output train loss to be 0. The only thing that changed was the model path/name.

This was tested on the latest commit d358124 and an older one e04897b

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}

I have tried to change the micro batch size and context length, but the issue persists.

Is there something else I need to change?

Env:

  • torch 2.0
  • cuda: 11.2
  • python 3.8.8

Edit: I have asked this in Discord. A supposed reasoning was due to padding. However, I did not modify the original code. It was run with default setting.

@ElleLeonne
Copy link
Contributor

image
Loss and learning rate go sub-0 pretty quick. Usually they're expressed using e-x.
Are you sure your terminal isn't just truncating the output?

@NanoCode012
Copy link
Contributor Author

NanoCode012 commented Mar 26, 2023

@ElleLeonne , thank you for answering. I also see your loss being 0. Isn't that incorrect? I don't think it should go that low right?

I attached a sample training loss. The minimum is 0 and also some spike to some random high value.

It should not be truncated for the plot since it’s not being logged to terminal.
image

Edit: eval loss is NAN
image

@ElleLeonne
Copy link
Contributor

ElleLeonne commented Mar 26, 2023

image
So I discovered an issue where, when switching to a new dataset, the attention mask actually just sets the dictionary key for the Output to be "", before calling the generate_prompt function. If you're using a different dictionary to train, then all it's doing is adding a blank key that never gets called, resulting in the model learning that every reply should be empty, as the attention mask always equals the full prompt.

I tried running another pass, but was still unsuccessful. Will continue until I resolve my specific problem.

@NanoCode012
Copy link
Contributor Author

Hello @ElleLeonne , thanks for the reply.

when switching to a new dataset

I noticed this issue originally with a custom dataset but was also able to reproduce it with the cleaned dataset in this repo. Does it work for you on the original cleaned dataset?

@ElleLeonne
Copy link
Contributor

Yes, the original cleaned version worked fine. After fixing the problem, Loss appears to stay steady for a single epoch.

@NanoCode012
Copy link
Contributor Author

Yes, the original cleaned version worked fine. After fixing the problem, Loss appears to stay steady for a single epoch.

@ElleLeonne , may I clarify which model size you used?

If you're using a different dictionary to train, then all it's doing is adding a blank key that never gets called

Do you mean with a custom dataset in the same instruction/response format or do you mean with a completely new key format?

@ElleLeonne
Copy link
Contributor

ElleLeonne commented Mar 26, 2023 via email

@NanoCode012
Copy link
Contributor Author

7bn works with the cleaned alpaca dataset, and Another dataset of mine that uses a similar, yet not identical, format, with different key names.

@ElleLeonne , have you tried the 13B on either dataset? Does it work?

@randerzander
Copy link

randerzander commented Mar 26, 2023

I had similar issues w/ training 13b with the cleaned dataset. Loss was zero immediately and eval_loss is nan. After the first epoch, loss dropped to zero again.

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                                                                    
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.13}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                                                                                   {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.18}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.21}                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.23}                                                
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.26}                                                
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.28}                                                
{'loss': 1.8864, 'learning_rate': 2.6999999999999996e-05, 'epoch': 0.31}                          {'loss': 2.2428, 'learning_rate': 5.6999999999999996e-05, 'epoch': 0.33}                          
{'loss': 2.5093, 'learning_rate': 8.699999999999999e-05, 'epoch': 0.36}                           
{'loss': 2.7587, 'learning_rate': 0.000117, 'epoch': 0.38}                                        
{'loss': 1.7657, 'learning_rate': 0.000147, 'epoch': 0.41}                                        
{'loss': 1.9844, 'learning_rate': 0.00017699999999999997, 'epoch': 0.44}                          
{'loss': 2.2091, 'learning_rate': 0.00020699999999999996, 'epoch': 0.46}                          
{'loss': 2.4903, 'learning_rate': 0.000237, 'epoch': 0.49}                                        
{'loss': 2.7598, 'learning_rate': 0.000267, 'epoch': 0.51}                                        {'eval_loss': nan, 'eval_runtime': 303.7866, 'eval_samples_per_second': 6.584, 'eval_steps_per_second': 0.823, 'epoch': 0.51}

Finished result:

{'train_runtime': 55139.3637, 'train_samples_per_second': 2.717, 'train_steps_per_second': 0.021, 'train_loss': 0.1778865871266422, 'epoch': 3.0}                                                    
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1170/1170 [15:18:59<00:00, 47.13s/it]
                                                                                                                                                                                                     
 If there's a warning about missing keys above, please disregard :) 

But, when running generate.py and attempting to pass any prompt I get errors:

(alpaca-lora) dev@desktop:~/projects/alpaca-lora$ python generate.py     --load_8bit     --base_model 'decapoda-research/llama-13b-hf'     --lora_weights 'lora-alpaca-13b'                          
                                                                                                                                                                                                     
===================================BUG REPORT===================================                                                                                                                     
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues                                                                      
================================================================================                                                                                                                     
/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/dev/miniconda/envs/alpaca-lora did not contain libcudart.so as expected! Se
arching further paths...                                                                                                                                                                             
  warn(msg)                                                                                                                                                                                          
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so                                                                                                                              
CUDA SETUP: Highest compute capability among GPUs detected: 7.0                                                                                                                                      
CUDA SETUP: Detected CUDA version 116                                                                                                                                                                /home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported f
or your GPU!                                                                                                                                                                                         
  warn(msg) 
CUDA SETUP: Loading binary /home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so...                                                   The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.                                          
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.                                                                                                                               
The class this function is called from is 'LlamaTokenizer'.                                                                                                                                          
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:29<00:00,  1.39it/s]
/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/inputs.py:27: UserWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your
 component from gradio.components                                                                                                                                                                      warnings.warn(                                                                                                                                                                                     /home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `optional` parameter is deprecated, and it has no effect                                    
  warnings.warn(value)                                                                                                                                                                               
/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: `numeric` parameter is deprecated, and it has no effect                                     
  warnings.warn(value)                                                                                                                                                                               
Running on local URL:  http://0.0.0.0:7860                                                                                                                                                           
                                                                                                                                                                                                     
To create a public link, set `share=True` in `launch()`.                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                   
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict                                                                                
    output = await app.get_blocks().process_api(                                                                                                                                                     
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/blocks.py", line 1069, in process_api                                                                               
    result = await self.call_function(                                                                                                                                                               
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/gradio/blocks.py", line 878, in call_function                                                                              
    prediction = await anyio.to_thread.run_sync(                                                                                                                                                     
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync                                                                                  
    return await get_asynclib().run_sync_in_worker_thread(                                                                                                                                           
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread                                                       
    return await future 
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run    result = context.run(func, *args)
  File "/home/dev/projects/alpaca-lora/generate.py", line 103, in evaluate
    generation_output = model.generate(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context    return func(*args, **kwargs)  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/generation/utils.py", line 1490, in generate
    return self.beam_search(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/generation/utils.py", line 2749, in beam_search
    outputs = self(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/dev/miniconda/envs/alpaca-lora/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 360, in forward
    outliers = state.CB[:, state.idx.long()].clone()
TypeError: 'NoneType' object is not subscriptable

Is anyone able to successfully generate w/ 13b "out of the box"?

@NanoCode012
Copy link
Contributor Author

@randerzander , I'm surprised your loss ended up well. Mine reached 0, sometimes spiked to some few thousand then back to zero with eval loss being NAN.

@Charleshhy
Copy link

Charleshhy commented Mar 28, 2023

Facing the same issue, the loss is 0. for the 13B model and extremely large (more than 10,000 after 0.5 epoch) for the 7B model on the cleaned dataset. I am using V100 32G. When training with a single RTX 3090 GPU, the loss seems fine for now.

Follow-up: RTX 3090 and A100 both work for me. But once using V100 32G, the same issue appears...

@xiaoyangmai
Copy link

Facing the same issue, the loss is 0. for the 13B model and extremely large (more than 10,000 after 0.5 epoch) for the 7B model on the cleaned dataset. I am using V100 32G. When training with a single GTX 3090 GPU, the loss seems fine for now.

I encountered the same problem with V100 32G.

@jaideep11061982
Copy link

@ElleLeonne what is the loss used inside the model,
what is evaluation metrics that one can use here.

@BeHappyForMe
Copy link

yeah, the sample problem!!

@Jiuzhouh
Copy link

Bitsandbytes can be run on 8-bit tensor core-supported hardware, which are Turing and Ampere GPUs (RTX 20s, RTX 30s, A40-A100, T4+). I guess V100 doesn't support for INT8.

@ghqing0310
Copy link

image So I discovered an issue where, when switching to a new dataset, the attention mask actually just sets the dictionary key for the Output to be "", before calling the generate_prompt function. If you're using a different dictionary to train, then all it's doing is adding a blank key that never gets called, resulting in the model learning that every reply should be empty, as the attention mask always equals the full prompt.

I tried running another pass, but was still unsuccessful. Will continue until I resolve my specific problem.

I maybe met the same problem. How do you fix this code?

@ElleLeonne
Copy link
Contributor

ElleLeonne commented Apr 1, 2023 via email

@Dr-Corgi
Copy link

Dr-Corgi commented Apr 7, 2023

hi, I faced the same problem and I fixed it by reinstalling the peft and transformers package.

@LiuPearl1
Copy link

hi, I faced the same problem and I fixed it by reinstalling the peft and transformers package.

@Dr-Corgi Hi, my problem is training loss always 0.0 when finetuning llama-13B on V100, is same with you?
image

@zhihui-shao
Copy link

My training loss on the V100 starts very erratically and then goes to 0

@NanoCode012
Copy link
Contributor Author

Could one reasoning be that fp16=True caused overflow? Was wondering if anyone could test by turning it off?

@qwjaskzxl
Copy link

Facing the same issue, the loss is 0. for the 13B model and extremely large (more than 10,000 after 0.5 epoch) for the 7B model on the cleaned dataset. I am using V100 32G. When training with a single RTX 3090 GPU, the loss seems fine for now.

Follow-up: RTX 3090 and A100 both work for me. But once using V100 32G, the same issue appears...

I encountered the same problem with A100 80G.

@lucasjinreal
Copy link

The loss 0 not comes from int8, but huggingface default fp16 set to True. However, if you using int8 training on V100, it was extremly slow both caused by hf optimization (just ver slow on v100 int8) && v100 itself didn't have int8 tensor cores for speedup.

You have to enable deepspeed for loss scale and trim and have to using fp16 training, otherwise it's very slow.

@streamride
Copy link

Sometimes the issue is too low lora rank.

@TomasAndersonFang
Copy link

Same problem

@TomasAndersonFang
Copy link

Could one reasoning be that fp16=True caused overflow? Was wondering if anyone could test by turning it off?

I tested LLaMa-7b with bf16 on A100 and I also met the same problem.

@lyccyl1
Copy link

lyccyl1 commented Dec 11, 2023

This may be due to hardware reasons. On some hardware, the quantization model is not compatible with fp16. You can try set fp16=False. It works for me.

@zjulgc
Copy link

zjulgc commented Jan 12, 2024

Facing the same issue, the loss is 0. for the 13B model and extremely large (more than 10,000 after 0.5 epoch) for the 7B model on the cleaned dataset. I am using V100 32G. When training with a single RTX 3090 GPU, the loss seems fine for now.
Follow-up: RTX 3090 and A100 both work for me. But once using V100 32G, the same issue appears...

I encountered the same problem with A100 80G.

I solve it by setting bf16=True instead of fp16=True in TrainingArguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests