Release More transformer models & depparse improvements · stanfordnlp/stanza

New transformer packages used

New transformer packages, POS and depparse, for multiple languages. More can be added if a language you want to use does not already have a default_accurate package! Please file an issue on our github for that.

Lang Code	Language	Transformer Model
bg	Bulgarian	`rmihaylov/bert-base-bg`
el	Greek	`nlpaueb/bert-base-greek-uncased-v1`
hr	Croatian	`classla/bcms-bertic`
ka	Georgian	`xlm-roberta-large`
mt	Maltese	`MaCoCu/MaltBERTa`
nl	Dutch	`DTAI-KULeuven/robbert-2023-dutch-large`
ru	Russian	`DeepPavlov/rubert-base-cased`
sl	Slovenian	`EMBEDDIA/crosloengual-bert`
sr	Serbian	`classla/bcms-bertic`
sv	Swedish	`KBLab/bert-base-swedish-cased`

Dependency parser improvements

Depparse bug fix: incorporate the bias in the biaffine model. Also, properly transpose the inputs. This actually did not change scores on average, weirdly enough. #308
- Please note that this update invalidates all locally trained depparse models. If you need help rebuilding a model, or want your model added to our distribution, please let us know.
Depparse can train with silver dataset: 1f2828d This can also be used to train with two different datasets in equal weights or weighted via the --silver_weight flag
Depparse training option: can finetune only the last N layers of a transformer e7245e3
Improved depparse optimizer scheduling - 0cf8654 The code was previously released in v1.11.1, but now all of the models are retrained with the new training scheme. A small sample of the results tested on a few transformer based depparse models, either with less strict stopping threshold or the two stage optimization, shows clear improvement (similar improvement in test scores):

5 model dev avg LAS	1 stage	1 stage 2k	1 stage 4k	2 stage
de_gsd	89.03	89.50	89.71	89.83
en_ewt	93.47	93.69	93.74	93.89
fi_tdt	92.16	92.56	92.69	93.15
it_vit	90.12	90.37	90.44	90.60
ta_ttb	71.26	71.39	71.45	72.19
zh-hans_gsdsimp	85.47	85.69	85.76	85.89

MWT and Lemmatizer improvements

"Smooth" MWT training by including a small fraction of non-MWT words in the training. #1568 Solves the problem of Finnish MWT having "t" at the end, but not at the start or middle, so natural words with "t" at the start would lead to the seq2seq model going haywire. #1562
Include in the default Finnish models a small snippet of sentences with non-MWT tokenization for certain non-MWT words. Addresses that some words such as tollei were treated as MWT 380aecd
The lemmatizer trains with silver tags (lower scores on gold, but better performance against real world text) #1567 829f22e 14a9739 This update will be used for retraining against UD 2.18 when it is available.

Bugfixes

Bugfix for training new MWT models from scratch - UD to internal format converter was not working 31df8e3
Make it so download etc. don't automatically reset the logging level. That only happens if the user specifically sets the logging level in the function call #1551 #1418 #1569 Thank you @haoyu-haoyu

Interface improvements

Update usage of morphseg: the latest version has a cleaner interface to the underlying model #1550 Thank you @TheWelcomer
Multi-doc wrapper to bulk_process Thank you @Rakshitha-Ireddi #1570
Utils for processing the coref output format Thank you @Rakshitha-Ireddi #1571
Add a human-readable coref output: #1560 19c2b07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More transformer models & depparse improvements

Choose a tag to compare

Sorry, something went wrong.