-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peft sentiment #1335
Peft sentiment #1335
Conversation
…arlm. This is hard to diagnose for the models which were not previously saved with this information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some clarification questions. See comments below; thanks in advance!
target_modules=["query", "value", "output.dense", "intermediate.dense"], # self.config.lora_targets, | ||
lora_alpha=128, #self.config.lora_alpha, | ||
lora_dropout=0.1, #self.config.lora_dropout, | ||
modules_to_save=["pooler"], # self.config.lora_fully_tune, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't exist for all languages' BERTs; and the NER results I reported didn't use this. how much improvement /size tradeoff does this confer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I can run that experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the experiments were run with electra, which doesn't have a pooler. listing it as fully trained doesn't make a difference there. can rerun the trial with roberta. i remember finetuning the pooler to be important with coref, although i don't have the results in front of me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Averaged over 4 runs with roberta-large, I got 0.7391 F1 w/ the pooler and 0.7422 w/o. Fully training the pooler increased the size from 163M to 167M. My conclusion is that we don't train the pooler for sentiment. I can always try pefting it instead of fully training it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind on that, can't peft a pooler
…ft-based test for sentiment
Works quite well on English, actually, even without splitting the optimizer or implementing any form of scheduling. With no finetuning, adding electra-large to the 3 class English dataset (SST plus a few other pieces) gets 70 Macro F1. The base finetuning gets between 74-75 macro F1 on sstplus, but frequently fails to successfully train, getting somewhere around 60 F1 Training with PEFT gets in the 74-75 F1 range each time, with no failures observed so far. Adds a chunk of test to the sentiment training which starts the Pipeline with a peft-trained model
8e68ffa
to
9240f37
Compare
9240f37
to
ffe6e1d
Compare
Ran an experiment with 4x models where finetuning the pooler for roberta-large on sstplus got 0.7391 average F1 whereas not finetuning got 0.7422 average. Considering the model size difference (163M -> 167M) it seems not worthwhile to finetune this layer.
Add a PEFT wrapper for the Sentiment training.
Works quite well on English, actually, even without splitting the optimizer or implementing any form of scheduling.
With no finetuning, adding electra-large to the 3 class English dataset (SST plus a few other pieces) gets 70 Macro F1.
The base finetuning gets between 74-75 macro F1 on sstplus, but frequently fails to successfully train, getting somewhere around 60 F1
Training with PEFT gets in the 74-75 F1 range each time, with no failures observed so far.
Adds a chunk of test to the sentiment training which starts the Pipeline with a peft-trained model
Also included is adding a uses-charlm flag to the config, so that inadvertently passing a charlm (such as via Pipeline) to the sentiment model doesn't blow up if it was trained w/o a charlm