This is the same as v1.12.1, with the following important update: all model now enforce weights_only=True when loading. This may obsolete some old models (all models distributed with Stanza are already patched). If that happens, please load and then resave your model with an older version of Stanza, such as v1.12.1.
Security / PyTorch compatibility
-
Enforce
weights_only=Truewhen loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584 -
All models now have the code path which allows for
weights_only=Falseremoved. Instead, attempting to load a legacy model will throw an exception. If that happens, please load your model with an older version (such as 1.12.1) and resave it before proceeding. #1587
Tokenizer improvements
-
Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257
-
Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573
-
Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573
Lemmatizer improvements
-
Enforce
weights_only=Truewhen loading the lemma classifier, avoiding a possible security risk. #1584 -
The lemma classifier for
ja_gsdis now also attached toja_combined. #1584 -
Train and attach two lemma classifiers to
en_combined— both'sandhercan be reliably classified from the available data. #1584 -
Add end-to-end unit tests for
run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586
Spanish model improvements
-
Add a silver dataset covering
como_VERBin Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g.--upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3 -
Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef
Italian model improvements
- Rebuild Italian models with additional training data to fix incorrect lemmatization of common words including "violino" (was incorrectly mapped to "violare") #1563 stanfordnlp/handparsed-treebank@9c46db1, and "diversi" (was incorrectly split and mapped to "dire") — resolved by retraining with the more accurate models #1564
English model improvements
- The long-standing issue of "can" being tagged as a modal verb (MD) rather than a noun (NN) in noun phrases like "trash can" and "soda can" is now resolved with the combined English models. #408
New / updated models
- Odia (Oriya) now uses the ODTB package as the default. Mixed POS and depparse training data is constructed from the Odia dataset combined with related Indic languages present in MuRIL-Large, following the approach used for Sindhi. The Odia NER model is now also connected to the default package. #1583
Demo and visualization improvements
-
Rewrite
stanza-parseviewer.jsto use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358 -
Various small improvements to the web demo: route all responses to
/; templatizestanza-brat.htmlso the version number is sourced from_version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582
Documentation and code quality
- Add docstrings and unit tests for the confusion matrix module. #1577