Skip to content

v1.13.0 - Use huggingface_hub for downloads, conparser efficiency gains

Latest

Choose a tag to compare

@AngledLuffa AngledLuffa released this 18 Jun 19:14

Stanza v1.13.0 Release Notes

Download Improvements

  • Switch model downloads from raw requests calls to the huggingface_hub library when downloading from Hugging Face. This should be more reliable, take advantage of XET when available, and benefit from the HF local cache. Addresses #1619, a report of intermittent download failures. #1614

New Models

  • Add a Slovenian NER model trained on the UNER dataset, along with the scripts needed to process UNER data into Stanza's internal NER format. #1615

  • Update the default word vectors for Erzya (MYV) to use rootroo embeddings, and rebuild all MYV models accordingly. The rootroo vectors show clear improvements on POS (dev UPOS 90.81 vs 90.21 with mokha vectors) with a small gain on depparse as well. #1606

Constituency Parser

  • Fix a longstanding bug in the constituency parser output layer: the nonlinearity was missing between the last two linear layers. The buggy forward pass made those two layers mathematically fusable, so existing models have been condensed to 2 output layers with no loss in accuracy. #1610

  • Set the default number of output layers to 2. Experiments showed that 2- and 3-layer configurations perform equivalently (once the nonlinearity bug above is corrected), so models will now train with the smaller default. #1611

  • At the end of training, automatically condense output layer rows that weight decay has trained toward zero, shrinking the saved model and improving inference speed slightly. #1613

  • Several efficiency improvements to the parser state representation, improving throughput by roughly 20%. Changes include using type() instead of isinstance() for type checks (with appropriate guards), switching TreeStack from a namedtuple to __slots__, and storing transition scheme information as attributes on transition objects to avoid repeated accessor calls. #1603 #1612

  • Add a script for visualizing constituency parser model weights: outputs heatmaps of linear layer weights and time-series plots of LSTM gate statistics, useful for analyzing training behavior. #1605

  • Add support for controlling forget gate initialization and applying a separate weight decay to LSTM biases, based on Jozefowicz et al. Experiments showed a small improvement; this is now the default going forward. New models will be trained using this configuration. Future work: apply these LSTM training changes to other models, especially depparse #1609

Dependency Parser

  • Add a --gradient_checkpointing flag to the dependency parser training script, allowing fine-tuning of larger transformers under tighter memory constraints. #1592

  • Add a freeze → warmup → plateau learning rate scheduler (WarmupThenPlateauScheduler) for use in the dependency parser. This gives finer control over the stages of transformer fine-tuning. #1589

  • Detach transformer embeddings from the computation graph whenever the transformer is not actively being trained (e.g. during the frozen stage when bert_finetune=True). This reduces memory usage and speeds up the stages of training that don't update the transformer. #1590

Training Infrastructure

  • Update language codes and treebank names throughout the codebase to align with UD 2.18. #1599

  • Add a --additional_files flag for building combined training datasets, and add the ability to construct a combined dataset for any language that has train/dev/test splits. #1598

  • Refactor bert_layer_mix into a trainable parameter that is passed directly into the bert_embeddings function, removing the need for each model to process the returned embeddings separately. #1597

  • Unify word embedding storage across models: models that were storing word vectors directly now use PretrainedWordVocab from the shared Pretrain object instead. This removes redundant storage and makes embedding handling consistent across annotators. #1600