Stanza v1.13.0 Release Notes
Download Improvements
- Switch model downloads from raw
requestscalls to thehuggingface_hublibrary when downloading from Hugging Face. This should be more reliable, take advantage of XET when available, and benefit from the HF local cache. Addresses #1619, a report of intermittent download failures. #1614
New Models
-
Add a Slovenian NER model trained on the UNER dataset, along with the scripts needed to process UNER data into Stanza's internal NER format. #1615
-
Update the default word vectors for Erzya (MYV) to use rootroo embeddings, and rebuild all MYV models accordingly. The rootroo vectors show clear improvements on POS (dev UPOS 90.81 vs 90.21 with mokha vectors) with a small gain on depparse as well. #1606
Constituency Parser
-
Fix a longstanding bug in the constituency parser output layer: the nonlinearity was missing between the last two linear layers. The buggy forward pass made those two layers mathematically fusable, so existing models have been condensed to 2 output layers with no loss in accuracy. #1610
-
Set the default number of output layers to 2. Experiments showed that 2- and 3-layer configurations perform equivalently (once the nonlinearity bug above is corrected), so models will now train with the smaller default. #1611
-
At the end of training, automatically condense output layer rows that weight decay has trained toward zero, shrinking the saved model and improving inference speed slightly. #1613
-
Several efficiency improvements to the parser state representation, improving throughput by roughly 20%. Changes include using
type()instead ofisinstance()for type checks (with appropriate guards), switchingTreeStackfrom a namedtuple to__slots__, and storing transition scheme information as attributes on transition objects to avoid repeated accessor calls. #1603 #1612 -
Add a script for visualizing constituency parser model weights: outputs heatmaps of linear layer weights and time-series plots of LSTM gate statistics, useful for analyzing training behavior. #1605
-
Add support for controlling forget gate initialization and applying a separate weight decay to LSTM biases, based on Jozefowicz et al. Experiments showed a small improvement; this is now the default going forward. New models will be trained using this configuration. Future work: apply these LSTM training changes to other models, especially depparse #1609
Dependency Parser
-
Add a
--gradient_checkpointingflag to the dependency parser training script, allowing fine-tuning of larger transformers under tighter memory constraints. #1592 -
Add a freeze → warmup → plateau learning rate scheduler (
WarmupThenPlateauScheduler) for use in the dependency parser. This gives finer control over the stages of transformer fine-tuning. #1589 -
Detach transformer embeddings from the computation graph whenever the transformer is not actively being trained (e.g. during the frozen stage when
bert_finetune=True). This reduces memory usage and speeds up the stages of training that don't update the transformer. #1590
Training Infrastructure
-
Update language codes and treebank names throughout the codebase to align with UD 2.18. #1599
-
Add a
--additional_filesflag for building combined training datasets, and add the ability to construct a combined dataset for any language that has train/dev/test splits. #1598 -
Refactor
bert_layer_mixinto a trainable parameter that is passed directly into thebert_embeddingsfunction, removing the need for each model to process the returned embeddings separately. #1597 -
Unify word embedding storage across models: models that were storing word vectors directly now use
PretrainedWordVocabfrom the sharedPretrainobject instead. This removes redundant storage and makes embedding handling consistent across annotators. #1600