Release v1.12.2 - weights_only security fix · stanfordnlp/stanza

This is the same as v1.12.1, with the following important update: all model now enforce weights_only=True when loading. This may obsolete some old models (all models distributed with Stanza are already patched). If that happens, please load and then resave your model with an older version of Stanza, such as v1.12.1.

Security / PyTorch compatibility

Enforce weights_only=True when loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584
All models now have the code path which allows for weights_only=False removed. Instead, attempting to load a legacy model will throw an exception. If that happens, please load your model with an older version (such as 1.12.1) and resave it before proceeding. #1587

Tokenizer improvements

Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257
Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573
Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573

Lemmatizer improvements

Enforce weights_only=True when loading the lemma classifier, avoiding a possible security risk. #1584
The lemma classifier for ja_gsd is now also attached to ja_combined. #1584
Train and attach two lemma classifiers to en_combined — both 's and her can be reliably classified from the available data. #1584
Add end-to-end unit tests for run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586

Spanish model improvements

Add a silver dataset covering como_VERB in Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g. --upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3
Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef

Italian model improvements

Rebuild Italian models with additional training data to fix incorrect lemmatization of common words including "violino" (was incorrectly mapped to "violare") #1563 stanfordnlp/handparsed-treebank@9c46db1, and "diversi" (was incorrectly split and mapped to "dire") — resolved by retraining with the more accurate models #1564

English model improvements

The long-standing issue of "can" being tagged as a modal verb (MD) rather than a noun (NN) in noun phrases like "trash can" and "soda can" is now resolved with the combined English models. #408

New / updated models

Odia (Oriya) now uses the ODTB package as the default. Mixed POS and depparse training data is constructed from the Odia dataset combined with related Indic languages present in MuRIL-Large, following the approach used for Sindhi. The Odia NER model is now also connected to the default package. #1583

Demo and visualization improvements

Rewrite stanza-parseviewer.js to use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358
Various small improvements to the web demo: route all responses to /; templatize stanza-brat.html so the version number is sourced from _version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582

Documentation and code quality

Add docstrings and unit tests for the confusion matrix module. #1577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.12.2 - weights_only security fix

Choose a tag to compare

Sorry, something went wrong.