v1.12.2 - weights_only security fix #1601
AngledLuffa
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is the same as v1.12.1, with the following important update: all model now enforce
weights_only=Truewhen loading. This may obsolete some old models (all models distributed with Stanza are already patched). If that happens, please load and then resave your model with an older version of Stanza, such as v1.12.1.Security / PyTorch compatibility
Enforce
weights_only=Truewhen loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584All models now have the code path which allows for
weights_only=Falseremoved. Instead, attempting to load a legacy model will throw an exception. If that happens, please load your model with an older version (such as 1.12.1) and resave it before proceeding. #1587Tokenizer improvements
Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257
Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573
Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573
Lemmatizer improvements
Enforce
weights_only=Truewhen loading the lemma classifier, avoiding a possible security risk. #1584The lemma classifier for
ja_gsdis now also attached toja_combined. #1584Train and attach two lemma classifiers to
en_combined— both'sandhercan be reliably classified from the available data. #1584Add end-to-end unit tests for
run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586Spanish model improvements
Add a silver dataset covering
como_VERBin Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g.--upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef
Italian model improvements
English model improvements
New / updated models
Demo and visualization improvements
Rewrite
stanza-parseviewer.jsto use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358Various small improvements to the web demo: route all responses to
/; templatizestanza-brat.htmlso the version number is sourced from_version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582Documentation and code quality
This discussion was created from the release v1.12.2 - weights_only security fix.
Beta Was this translation helpful? Give feedback.
All reactions