Skip to content

More transformer models & depparse improvements

Choose a tag to compare

@AngledLuffa AngledLuffa released this 14 May 23:26
· 93 commits to main since this release

New transformer packages used

New transformer packages, POS and depparse, for multiple languages. More can be added if a language you want to use does not already have a default_accurate package! Please file an issue on our github for that.

Lang Code Language Transformer Model
bg Bulgarian rmihaylov/bert-base-bg
el Greek nlpaueb/bert-base-greek-uncased-v1
hr Croatian classla/bcms-bertic
ka Georgian xlm-roberta-large
mt Maltese MaCoCu/MaltBERTa
nl Dutch DTAI-KULeuven/robbert-2023-dutch-large
ru Russian DeepPavlov/rubert-base-cased
sl Slovenian EMBEDDIA/crosloengual-bert
sr Serbian classla/bcms-bertic
sv Swedish KBLab/bert-base-swedish-cased

Dependency parser improvements

  • Depparse bug fix: incorporate the bias in the biaffine model. Also, properly transpose the inputs. This actually did not change scores on average, weirdly enough. #308

    • Please note that this update invalidates all locally trained depparse models. If you need help rebuilding a model, or want your model added to our distribution, please let us know.
  • Depparse can train with silver dataset: 1f2828d This can also be used to train with two different datasets in equal weights or weighted via the --silver_weight flag

  • Depparse training option: can finetune only the last N layers of a transformer e7245e3

  • Improved depparse optimizer scheduling - 0cf8654 The code was previously released in v1.11.1, but now all of the models are retrained with the new training scheme. A small sample of the results tested on a few transformer based depparse models, either with less strict stopping threshold or the two stage optimization, shows clear improvement (similar improvement in test scores):

5 model dev avg LAS 1 stage 1 stage 2k 1 stage 4k 2 stage
de_gsd 89.03 89.50 89.71 89.83
en_ewt 93.47 93.69 93.74 93.89
fi_tdt 92.16 92.56 92.69 93.15
it_vit 90.12 90.37 90.44 90.60
ta_ttb 71.26 71.39 71.45 72.19
zh-hans_gsdsimp 85.47 85.69 85.76 85.89

MWT and Lemmatizer improvements

  • "Smooth" MWT training by including a small fraction of non-MWT words in the training. #1568 Solves the problem of Finnish MWT having "t" at the end, but not at the start or middle, so natural words with "t" at the start would lead to the seq2seq model going haywire. #1562

  • Include in the default Finnish models a small snippet of sentences with non-MWT tokenization for certain non-MWT words. Addresses that some words such as tollei were treated as MWT 380aecd

  • The lemmatizer trains with silver tags (lower scores on gold, but better performance against real world text) #1567 829f22e 14a9739 This update will be used for retraining against UD 2.18 when it is available.

Bugfixes

  • Bugfix for training new MWT models from scratch - UD to internal format converter was not working 31df8e3

  • Make it so download etc. don't automatically reset the logging level. That only happens if the user specifically sets the logging level in the function call #1551 #1418 #1569 Thank you @haoyu-haoyu

Interface improvements