PEFT integration
Integrating PEFT into several different annotators
We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate
model.
The biggest gains observed are with the constituency parser and the sentiment classifier.
Previously, the default_accurate
package used transformers where the head was trained but the transformer itself was not finetuned.
Model improvements
- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
- Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
- NER also trained with peft: unfortunately, no consistent improvements to scores #1336
- depparse includes peft: no consistent improvements yet #1337 #1344
- Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345
Features
- Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
- Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
- wandb support for coref #1338
- Coref annotator breaks length ties using POS if available #1326 c4c3de5
Bugfixes
- Using a proxy with
download_resources_json
was broken: #1318 #1317 Thank you @ider-zh - Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
- Coref training rounding error #1342
- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
- Clarify error when a language is only partially handled: da01644 #1310