Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training configuration and hardware spec #8

Open
bangawayoo opened this issue Apr 18, 2024 · 3 comments
Open

Training configuration and hardware spec #8

bangawayoo opened this issue Apr 18, 2024 · 3 comments

Comments

@bangawayoo
Copy link

Hi!
Congratulation on a very interesting work and thank you for releasing the code :)

I am running some experiments and would like to reproduce some results.
I had some questions regarding the training configurations.

  1. I assume you did full finetuning when reading the instructions. Would you confirm this?

  2. When training the 7B model using LMFlow, I am faced with CPU OOM with a server with 220GB RAM. I believe this is abnormal and may be a problem on my side. If you recall how many CPU memory were required, can you tell me?

  3. Which LLaMA weights did you use? If you used the ones in hugginface, can you tell me the repo id?

Thanks.

@hanningzhang
Copy link
Contributor

Thank you for your questions.

  1. Yes, we are doing full finetuning for all the models.
  2. Finetuning 7B models usually consumes about 215GB CPU memory. Therefore, it may be hard to finetune with 220GB memory. You may try ZeRO-2 if ZeRO-3 is consuming too much CPU memory.
  3. We are using huggyllama/llama-7b, huggyllama/llama-13b, and openlm-research/open_llama_3b on HuggingFace.

@bangawayoo
Copy link
Author

Thanks for the reply.
I got full finetuning possible by using Zero2 without offload!

@bangawayoo
Copy link
Author

Hi,
I am trying to replicate the results following your reply, but I still need some help.

All the models were trained for 1 epoch using full-finetuning on lr=2r-5.
For the 7b model on ParaRel-ID, I obtained an AP score of 0.84.

For the 3b model on the same dataset, the results were closer to that of the paper, with a score of .90.

Oddly, the data distribution obtained from the supervised identification strategy (Figure 6) seemed correct for the 3b model but slightly off for the 7b model. For the 7b-ParaRel, I obtained 40.4% of certain data, which is slightly lower than the 42% reported in the figure.

To estimate the confidence, the paper mentions a weighted average of the "{sure, unsure}" token probability and the token probability of the answer prediction. The calculate_ap.py uses an average (0.5sample[1] + 0.5sample[2]).
Is this the correct implementation?

Can you have a guess what might be the cause?
I really appreciate the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants