More transformer models & depparse improvements
New transformer packages used
New transformer packages, POS and depparse, for multiple languages. More can be added if a language you want to use does not already have a default_accurate package! Please file an issue on our github for that.
| Lang Code | Language | Transformer Model |
|---|---|---|
| bg | Bulgarian | rmihaylov/bert-base-bg |
| el | Greek | nlpaueb/bert-base-greek-uncased-v1 |
| hr | Croatian | classla/bcms-bertic |
| ka | Georgian | xlm-roberta-large |
| mt | Maltese | MaCoCu/MaltBERTa |
| nl | Dutch | DTAI-KULeuven/robbert-2023-dutch-large |
| ru | Russian | DeepPavlov/rubert-base-cased |
| sl | Slovenian | EMBEDDIA/crosloengual-bert |
| sr | Serbian | classla/bcms-bertic |
| sv | Swedish | KBLab/bert-base-swedish-cased |
Dependency parser improvements
-
Depparse bug fix: incorporate the bias in the biaffine model. Also, properly transpose the inputs. This actually did not change scores on average, weirdly enough. #308
- Please note that this update invalidates all locally trained depparse models. If you need help rebuilding a model, or want your model added to our distribution, please let us know.
-
Depparse can train with silver dataset: 1f2828d This can also be used to train with two different datasets in equal weights or weighted via the
--silver_weightflag -
Depparse training option: can finetune only the last N layers of a transformer e7245e3
-
Improved depparse optimizer scheduling - 0cf8654 The code was previously released in v1.11.1, but now all of the models are retrained with the new training scheme. A small sample of the results tested on a few transformer based depparse models, either with less strict stopping threshold or the two stage optimization, shows clear improvement (similar improvement in test scores):
| 5 model dev avg LAS | 1 stage | 1 stage 2k | 1 stage 4k | 2 stage |
|---|---|---|---|---|
| de_gsd | 89.03 | 89.50 | 89.71 | 89.83 |
| en_ewt | 93.47 | 93.69 | 93.74 | 93.89 |
| fi_tdt | 92.16 | 92.56 | 92.69 | 93.15 |
| it_vit | 90.12 | 90.37 | 90.44 | 90.60 |
| ta_ttb | 71.26 | 71.39 | 71.45 | 72.19 |
| zh-hans_gsdsimp | 85.47 | 85.69 | 85.76 | 85.89 |
MWT and Lemmatizer improvements
-
"Smooth" MWT training by including a small fraction of non-MWT words in the training. #1568 Solves the problem of Finnish MWT having "t" at the end, but not at the start or middle, so natural words with "t" at the start would lead to the seq2seq model going haywire. #1562
-
Include in the default Finnish models a small snippet of sentences with non-MWT tokenization for certain non-MWT words. Addresses that some words such as
tolleiwere treated as MWT 380aecd -
The lemmatizer trains with silver tags (lower scores on gold, but better performance against real world text) #1567 829f22e 14a9739 This update will be used for retraining against UD 2.18 when it is available.
Bugfixes
-
Bugfix for training new MWT models from scratch - UD to internal format converter was not working 31df8e3
-
Make it so download etc. don't automatically reset the logging level. That only happens if the user specifically sets the logging level in the function call #1551 #1418 #1569 Thank you @haoyu-haoyu
Interface improvements
-
Update usage of morphseg: the latest version has a cleaner interface to the underlying model #1550 Thank you @TheWelcomer
-
Multi-doc wrapper to
bulk_processThank you @Rakshitha-Ireddi #1570 -
Utils for processing the coref output format Thank you @Rakshitha-Ireddi #1571