Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTK RoPE scaling. #115

Closed
alkeryn opened this issue Jun 29, 2023 · 24 comments
Closed

NTK RoPE scaling. #115

alkeryn opened this issue Jun 29, 2023 · 24 comments

Comments

@alkeryn
Copy link

alkeryn commented Jun 29, 2023

According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling:
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

the code can be found in this notebook :
https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=b80b3f37

and the code for it seem to be a small change :

 #The method is just these three lines
    max_position_embeddings = 16384
    a = 8 #Alpha value
    base = base * a ** (dim / (dim-2)) #Base change formula

maybe it would be nice to add that option to exllama as well, with this technique finetuning for higher context may not even be necessary.

@Panchovix
Copy link
Contributor

Panchovix commented Jun 29, 2023

This sounds pretty good! But wondering how it would be implemented on exllama. compress_pos_emb is already a RoPE scaler.

There's

rotary_embedding_base

But it seems to be used for training purposes.

@alkeryn
Copy link
Author

alkeryn commented Jun 29, 2023

@Panchovix someone posted this code on 4chan, i haven't had the time to verify it as I'm on the move but maybe that's it.
https://boards.4chan.org/g/thread/94354163#p94356720

@Panchovix
Copy link
Contributor

Panchovix commented Jun 29, 2023

@alkeryn Thanks! It seems to work.

a = 4 # Similar to RoPE, higher is more perplex but more ctx
self.rotary_embedding_base = self.rotary_embedding_base * a ** (self.head_dim / (self.head_dim -2 ))

max_seq_len should be set as the same as you have to do with SuperHOT models (via -l)
Maybe it can be set like this on model.py:

self.alpha_value = 1.0 # Similar to RoPE, higher is more perplex but more ctx

And like this on model_init.py

parser.add_argument("-a", "--alpha", type = float, help = "alpha for context size extension via embedding extension")

...

if args.alpha:
    model_config.alpha_value = args.alpha # not exactly like this, but with this logic

@Panchovix
Copy link
Contributor

Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.

#118

@turboderp
Copy link
Owner

I'd like to see some results from finetuning before I go and add even more config options. If I built out ExLlama every time someone had an interesting idea on reddit it'd be an unmaintainable behemoth by now. It's already kind of unwieldy.

@laoda513
Copy link

Okay, I did a experimental PR to see if turbo wants to add it, or maybe testing it via other way.

#118

so for using this feature, we should first tune the model with lora or whatever first?
since exllama does not support turning now, should I first using auto-gptq lora ?

@Panchovix
Copy link
Contributor

Panchovix commented Jun 30, 2023

@laoda513 For NTK RoPE scaling, finetuning it is not needed. But based on my tests, superhot models works better with both RoPE scaling + comb scaling.

For now, no loader supports NTK RoPE.

That PR adds experimental supports only for exllama at the moment.

@alkeryn
Copy link
Author

alkeryn commented Jun 30, 2023

@Panchovix i don't quite understand how it would work better with rope + comb scaling but that's interesting, so you put 4 for each ?
though i think once we have comb finetunes, it'll probably outperform superhot + rope scaling or even the mix of both.
still being able to use any model at any context length without a finetune is already great !

@ottobunge
Copy link

I have tested the change and get better results with compression at 4 and alpha at for.

Using TheBloke_nous-hermes-13b-superhot-8k-GPTQ-4bit-128g, if I only have either compression or NTK Rope enabled, it tells me it cannot find the secret messages I left embedded in the paper, but with alpha 4 and compression at 4 it retrieves correctly

@alkeryn
Copy link
Author

alkeryn commented Jun 30, 2023

@ottobunge interesting, have you tried alpha 8 or more with no compression on a normal model ?
would still be interesting to see finetunes made for ntk.

@ottobunge
Copy link

at 8k on neko institute llama 13b 4bit 32g at alpha 8 and compression 1 I get nonsense.

image

@ottobunge
Copy link

trying alpha 10 and then alpha 4 compression 4 on this same model, to see differences

@ottobunge
Copy link

image
Alpha 10

@ottobunge
Copy link

Failure mode is worse at compression 4 alpha 4 on plain llama.
this model is probably not great at the task xD

image

@alkeryn
Copy link
Author

alkeryn commented Jun 30, 2023

@ottobunge that makes sense since the model was trained for 8k rope.
but i was asking about alpha 8 on a non 8k finetuned model with no compression.

@ottobunge
Copy link

That would be this
#115 (comment)

@ottobunge
Copy link

ottobunge commented Jun 30, 2023

I'm downloading a non fine tuned version, but on the fined tuned I can run no compression at alpha 10 and get good results.

in fact it follows the formatting on the prompt better than compression 4 alpha 4

@ottobunge
Copy link

TheBloke_airoboros-13B-gpt4-1.4-GPTQ so a non fine tuned model at alpha 10
it got 3/4 pass phrases in the wrong order.

The correct order is on the second image

image
image

@ottobunge
Copy link

The best answer i got like this.

If I change the proportion more to one or another it start by misspelling milkshake to milshake or fails altogether if I change the proportion too much, and starts guessing cherry as the 4th, banana as the third and missing milkshake

image

image

@Panchovix
Copy link
Contributor

I have updated the PR.

Before, the alpha value wasn't being applied correctly. (It was at 1.0) Now, it does it correctly, and thus, just by setting alpha for NTK RoPE scaling would be enough (without the need to set compress_pos_emb to the same value)

@ottobunge @alkeryn Can you guys test and see how it goes now? Results are WAY different, and IMO, better.

@Panchovix
Copy link
Contributor

For tulu-30B-GPTQ (non-SuperHOT)

  • Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
  • Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
  • Perplexity at 8192 ctx, alpha = 4: 5.3534
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406

For Tulu-30B-SuperHOT-8K-4bit-32g:

  • Perplexity at 2048 ctx (compress_pos_emb = 1, no alpha RoPE): 53.2788 (Basically, for <2048 ctx don't use SuperHOT models)
  • Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
  • Perplexity at 8192 ctx, alpha = 4: 7.5073
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903

Basically, it seems that NTK RoPE scaling is better that we expected.

@laoda513
Copy link

laoda513 commented Jul 1, 2023

how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..

and i think this is very exciting and interesting!When i think more on it, if we can easily extend a model trained with 2k to 8k.
Is that mean we can extend a model with 512 to 2k?,
And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....

@Panchovix
Copy link
Contributor

Panchovix commented Jul 1, 2023

how about the mem cost increase for inference and training? it is linear? for example 1 for 2k and 2 for 4k..

and i think this is very exciting and interesting!When i think more on it, if we can easily extend a model trained with 2k to 8k. Is that mean we can extend a model with 512 to 2k?, And I think this doe not really exntend the 'attention', it just using the same volume of attention on a longer context right? is kind like....a human reading fastily....

For training itself, sadly I'm not sure how it would be applied :(.

Also, thanks turbo for the PR merge!

Now NTK RoPE scaling can be used on exllama.

@alkeryn
Copy link
Author

alkeryn commented Jul 7, 2023

thank you everyone, i'm closing the issue ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants