Training configuration and hardware spec #8

bangawayoo · 2024-04-18T07:20:29Z

Hi!
Congratulation on a very interesting work and thank you for releasing the code :)

I am running some experiments and would like to reproduce some results.
I had some questions regarding the training configurations.

I assume you did full finetuning when reading the instructions. Would you confirm this?
When training the 7B model using LMFlow, I am faced with CPU OOM with a server with 220GB RAM. I believe this is abnormal and may be a problem on my side. If you recall how many CPU memory were required, can you tell me?
Which LLaMA weights did you use? If you used the ones in hugginface, can you tell me the repo id?

Thanks.

hanningzhang · 2024-04-21T06:45:01Z

Thank you for your questions.

Yes, we are doing full finetuning for all the models.
Finetuning 7B models usually consumes about 215GB CPU memory. Therefore, it may be hard to finetune with 220GB memory. You may try ZeRO-2 if ZeRO-3 is consuming too much CPU memory.
We are using huggyllama/llama-7b, huggyllama/llama-13b, and openlm-research/open_llama_3b on HuggingFace.

bangawayoo · 2024-04-22T01:11:33Z

Thanks for the reply.
I got full finetuning possible by using Zero2 without offload!

bangawayoo · 2024-04-23T01:15:44Z

Hi,
I am trying to replicate the results following your reply, but I still need some help.

All the models were trained for 1 epoch using full-finetuning on lr=2r-5.
For the 7b model on ParaRel-ID, I obtained an AP score of 0.84.

For the 3b model on the same dataset, the results were closer to that of the paper, with a score of .90.

Oddly, the data distribution obtained from the supervised identification strategy (Figure 6) seemed correct for the 3b model but slightly off for the 7b model. For the 7b-ParaRel, I obtained 40.4% of certain data, which is slightly lower than the 42% reported in the figure.

To estimate the confidence, the paper mentions a weighted average of the "{sure, unsure}" token probability and the token probability of the answer prediction. The calculate_ap.py uses an average (0.5sample[1] + 0.5sample[2]).
Is this the correct implementation?

Can you have a guess what might be the cause?
I really appreciate the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training configuration and hardware spec #8

Training configuration and hardware spec #8

bangawayoo commented Apr 18, 2024

hanningzhang commented Apr 21, 2024

bangawayoo commented Apr 22, 2024

bangawayoo commented Apr 23, 2024

Training configuration and hardware spec #8

Training configuration and hardware spec #8

Comments

bangawayoo commented Apr 18, 2024

hanningzhang commented Apr 21, 2024

bangawayoo commented Apr 22, 2024

bangawayoo commented Apr 23, 2024