Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans to support phi-2 finetuning? #85

Open
asmith26 opened this issue Jan 11, 2024 · 8 comments
Open

Any plans to support phi-2 finetuning? #85

asmith26 opened this issue Jan 11, 2024 · 8 comments
Labels
currently fixing Am fixing now! on roadmap Feature request on roadmap

Comments

@asmith26
Copy link

Hi there,

Just thought I'd ask if there were any plans to support phi-2 finetuning, https://huggingface.co/microsoft/phi-2?

Many thanks for any help, and this amazing lib!

@danielhanchen
Copy link
Contributor

@asmith26 On it! :)

@imrankh46
Copy link

Hi there,

Just thought I'd ask if there were any plans to support phi-2 finetuning, https://huggingface.co/microsoft/phi-2?

Many thanks for any help, and this amazing lib!

I tried, fine tuning of phi-2. Which is not good.
I just use hugging face transformer code not with unsloth..

@danielhanchen
Copy link
Contributor

@imrankh46 Ye smaller models can sometimes not follow instructions well. Could you approx give an example maybe to show Phi not following instructions?

@imrankh46
Copy link

@imrankh46 Ye smaller models can sometimes not follow instructions well. Could you approx give an example maybe to show Phi not following instructions?

here is the code.

https://colab.research.google.com/drive/1a7rL3UzWfo5I7OPyVmTEnR6_tRqIOblg?usp=sharing

@cm2435
Copy link

cm2435 commented Jan 16, 2024

@danielhanchen I'm also interesting in contributing this, let me know if you have space on your PR for another helping hand.

@danielhanchen
Copy link
Contributor

@cm2435 Oh more than happy to collab if you're into it!! I actually took a look at Phi the other day:

  1. Phi seems to not use swiglu but just general f(gate) @ down. Swiglu is (f(gate) * up) @ down. This does mean the Triton kernels backprop has to be derived.
  2. Phi is using gelu instead of swish, and I think the first step is to derive the gradient for gelu. Gelu is f(x) = 0.5 * x* (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x+ 0.044715 * torch.pow(x, 3.0)))) Maybe using Desmos / hand derived
  3. Phi uses a general Layernom, not a RMS Layernorm this means again new kernels have to be written - although I think Triton's tutorials has this already done for us: https://triton-lang.org/main/getting-started/tutorials/05-layer-norm.html
  4. There is some dropout (0.1) after each attention and MLP module I think. This shouldn't be hard to implement hopefully.
  5. I don't know what partial_rotary_factor is in https://huggingface.co/microsoft/phi-2/blob/main/config.json. Need to investigate.

In general Phi is possible, just there are a few blockers - esp (2) and (3). But again - if you want to take a crack at this @cm2435 - I'll be super grateful + I'll collab with you! :) Possibly taking a stab at some steps like (2) finding the derivative might be better as a first step :)

@cm2435
Copy link

cm2435 commented Jan 18, 2024

@danielhanchen Started a fork and opened a staging PR to work off. I can start by adding some unit test coverage around the kernels if we need to test their accuracy, something simple like

assert torch.allclose(triton_out, torch_out)

or I can try to contribute some of the other steps you mentioned- what's your preference?

@danielhanchen danielhanchen added currently fixing Am fixing now! on roadmap Feature request on roadmap labels Jan 27, 2024
@danielhanchen
Copy link
Contributor

@cm2435 Oh lmao just noticed i never responded on this thread WHOOPS! Well I responded on your other threads already I guess - sorry!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
currently fixing Am fixing now! on roadmap Feature request on roadmap
Projects
None yet
Development

No branches or pull requests

4 participants