Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss not matching #344

Open
ghost opened this issue Apr 17, 2024 · 2 comments
Open

Loss not matching #344

ghost opened this issue Apr 17, 2024 · 2 comments

Comments

@ghost
Copy link

ghost commented Apr 17, 2024

Hi team,
I tried to do QLora for 30B llama with unsloth. I found that there is no much improvement on speed and memory usage. The detaild are as following.
seq_length=8192
batch size=1
use flash attn=true
gradient_checkpointing=true

With unslosh:

0%|          | 0/52 [00:00<?, ?it/s]
  2%|▏         | 1/52 [00:38<32:45, 38.54s/it]
  4%|▍         | 2/52 [01:15<31:25, 37.71s/it]
  6%|▌         | 3/52 [01:52<30:37, 37.50s/it]
  8%|▊         | 4/52 [02:30<29:54, 37.39s/it]
 10%|▉         | 5/52 [03:07<29:13, 37.31s/it]
{'loss': **4.7581**, 'grad_norm': 3.063769578933716, 'learning_rate': 9.911436253643445e-05, 'epoch': 0.1, 'num_input_tokens_seen': 162198}

without unslosh:

0%|          | 0/52 [00:00<?, ?it/s]
  2%|▏         | 1/52 [00:41<35:08, 41.35s/it]
  4%|▍         | 2/52 [01:21<33:59, 40.79s/it]
  6%|▌         | 3/52 [02:02<33:13, 40.69s/it]
  8%|▊         | 4/52 [02:42<32:30, 40.63s/it]
 10%|▉         | 5/52 [03:23<31:48, 40.60s/it]
{'loss': **0.8759**, 'grad_norm': 0.32929742336273193, 'learning_rate': 9.911436253643445e-05, 'epoch': 0.1, 'num_input_tokens_seen': 162198}

1. The speed has only increased by about 3s, which is very different from the acceleration ratio mentioned in the document.
2. nvidia-smi: with unsloth: 35G, w/o unsloth: 39G. Only 10% less.
3. The value of the loss is abnormal.

here is the code:

model, _ = FastLanguageModel.from_pretrained(
            model_name=model_kwargs['model_id_or_path'],
            max_seq_length=8192,
            dtype=None,
            load_in_4bit=True,
            low_cpu_mem_usage = True,
            device_map ='auto',
            trust_remote_code=True,
            attn_implementation="flash_attention_2",
        )
model = FastLanguageModel.get_peft_model(
            model,
            lora_alpha=model_args.lora_alpha,
            lora_dropout=model_args.lora_dropout,
            r=model_args.lora_r,
            target_modules=model_args.lora_target_modules.split(",")
            use_gradient_checkpointing=True,
            random_state=training_args.seed,
            max_seq_length=8192,
        )
trainer = SFTTrainer(
        model=model,
        ...
    )

Is there some setting I'm missing? Looking forward to your reply.

@shimmyshimmer
Copy link
Collaborator

Hey @mxjyst. Do you have a reproducible example for non unsloth? Have you tried our Colab notebooks to confirm?

Also did you do benchmarking with unsloth first then hf in one script since unsloth first patches it.

All our benchmarking code is public for everyone to confirm ie see our HF blog post https://huggingface.co/blog/unsloth-trl in which HF did 3rd party benchmarking. Likewise llama factory and many others have confirmed our benchmarking

See llama factory's research paper: https://twitter.com/danielhanchen/status/1770870732475469926 in which it shows the OSS is the world's fastest by a large margin.

In terms of the loss diverging thats very abnormal. Can you reproduce this vai a Colab notebook?

@danielhanchen danielhanchen changed the title No performance improvement with unsloth Loss not matching Apr 17, 2024
@danielhanchen
Copy link
Contributor

@mxjyst Interesting on the loss not matching - would you be able to provide a reproducible example via Colab?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants