Skip to content

v1.12.1 - PyTorch compatibility, Spanish and Lemmatizer improvements

Choose a tag to compare

@AngledLuffa AngledLuffa released this 28 May 02:10
· 36 commits to main since this release

PyTorch compatibility, Spanish and Lemmatizer improvements

Security / PyTorch compatibility

  • Enforce weights_only=True when loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584

Tokenizer improvements

  • Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257

  • Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573

  • Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573

Lemmatizer improvements

  • Enforce weights_only=True when loading the lemma classifier, avoiding a possible security risk. #1584

  • The lemma classifier for ja_gsd is now also attached to ja_combined. #1584

  • Train and attach two lemma classifiers to en_combined — both 's and her can be reliably classified from the available data. #1584

  • Add end-to-end unit tests for run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586

Spanish model improvements

  • Add a silver dataset covering como_VERB in Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g. --upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3

  • Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef

Italian model improvements

  • Rebuild Italian models with additional training data to fix incorrect lemmatization of common words including "violino" (was incorrectly mapped to "violare") #1563 stanfordnlp/handparsed-treebank@9c46db1, and "diversi" (was incorrectly split and mapped to "dire") — resolved by retraining with the more accurate models #1564

English model improvements

  • The long-standing issue of "can" being tagged as a modal verb (MD) rather than a noun (NN) in noun phrases like "trash can" and "soda can" is now resolved with the combined English models. #408

New / updated models

  • Odia (Oriya) now uses the ODTB package as the default. Mixed POS and depparse training data is constructed from the Odia dataset combined with related Indic languages present in MuRIL-Large, following the approach used for Sindhi. The Odia NER model is now also connected to the default package. #1583

Demo and visualization improvements

  • Rewrite stanza-parseviewer.js to use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358

  • Various small improvements to the web demo: route all responses to /; templatize stanza-brat.html so the version number is sourced from _version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582

Documentation and code quality

  • Add docstrings and unit tests for the confusion matrix module. #1577