Skip to content

v1.12.2 - weights_only security fix

Latest

Choose a tag to compare

@AngledLuffa AngledLuffa released this 03 Jun 21:10

This is the same as v1.12.1, with the following important update: all model now enforce weights_only=True when loading. This may obsolete some old models (all models distributed with Stanza are already patched). If that happens, please load and then resave your model with an older version of Stanza, such as v1.12.1.

Security / PyTorch compatibility

  • Enforce weights_only=True when loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584

  • All models now have the code path which allows for weights_only=False removed. Instead, attempting to load a legacy model will throw an exception. If that happens, please load your model with an older version (such as 1.12.1) and resave it before proceeding. #1587

Tokenizer improvements

  • Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257

  • Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573

  • Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573

Lemmatizer improvements

  • Enforce weights_only=True when loading the lemma classifier, avoiding a possible security risk. #1584

  • The lemma classifier for ja_gsd is now also attached to ja_combined. #1584

  • Train and attach two lemma classifiers to en_combined — both 's and her can be reliably classified from the available data. #1584

  • Add end-to-end unit tests for run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586

Spanish model improvements

  • Add a silver dataset covering como_VERB in Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g. --upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3

  • Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef

Italian model improvements

  • Rebuild Italian models with additional training data to fix incorrect lemmatization of common words including "violino" (was incorrectly mapped to "violare") #1563 stanfordnlp/handparsed-treebank@9c46db1, and "diversi" (was incorrectly split and mapped to "dire") — resolved by retraining with the more accurate models #1564

English model improvements

  • The long-standing issue of "can" being tagged as a modal verb (MD) rather than a noun (NN) in noun phrases like "trash can" and "soda can" is now resolved with the combined English models. #408

New / updated models

  • Odia (Oriya) now uses the ODTB package as the default. Mixed POS and depparse training data is constructed from the Odia dataset combined with related Indic languages present in MuRIL-Large, following the approach used for Sindhi. The Odia NER model is now also connected to the default package. #1583

Demo and visualization improvements

  • Rewrite stanza-parseviewer.js to use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358

  • Various small improvements to the web demo: route all responses to /; templatize stanza-brat.html so the version number is sourced from _version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582

Documentation and code quality

  • Add docstrings and unit tests for the confusion matrix module. #1577