v1.12.1 - PyTorch compatibility, Spanish and Lemmatizer improvements
PyTorch compatibility, Spanish and Lemmatizer improvements
Security / PyTorch compatibility
- Enforce
weights_only=Truewhen loading the lemma classifier, addressing part of the security advisory GHSA-v5jw-96jm-7h2c. This should already be the default in later versions of PyTorch, but is now explicitly enforced. #1584
Tokenizer improvements
-
Add control characters to the set of characters treated as whitespace when tokenizing, fixing a bug where certain Unicode control characters (such as "region end" markers) were incorrectly attached to words. #1573 Addresses #1257
-
Add tokenizer augmentation that occasionally replaces commas with en-dashes or em-dashes, so that models trained on datasets that lack those characters learn to treat them similarly to commas. #1573
-
Add regression tests for Spanish tokenization errors reported in #1257 and tests for the whitespace/control-character handling and tokenizer augmentations. #1573
Lemmatizer improvements
-
Enforce
weights_only=Truewhen loading the lemma classifier, avoiding a possible security risk. #1584 -
The lemma classifier for
ja_gsdis now also attached toja_combined. #1584 -
Train and attach two lemma classifiers to
en_combined— both'sandhercan be reliably classified from the available data. #1584 -
Add end-to-end unit tests for
run_lemma.py, including training a lemmatizer and attaching multiple lemma classifiers. #1586
Spanish model improvements
-
Add a silver dataset covering
como_VERBin Spanish to the combined Spanish training data, addressing #1440. Also adds a utility to print a confusion matrix of tagging results filtered by a word regex (e.g.--upos_word_regex "^(?i:como)$"), making it easier to isolate the effects of annotation changes. #1579 stanfordnlp/handparsed-treebank@d0c29a3 -
Add silver training sentences covering unknown Spanish VERB lemmas to the combined Spanish lemmatizer, addressing #1255. Also includes a script to check lemmatizer results for a batch of word/POS combinations. #1580 stanfordnlp/handparsed-treebank@11327ef
Italian model improvements
- Rebuild Italian models with additional training data to fix incorrect lemmatization of common words including "violino" (was incorrectly mapped to "violare") #1563 stanfordnlp/handparsed-treebank@9c46db1, and "diversi" (was incorrectly split and mapped to "dire") — resolved by retraining with the more accurate models #1564
English model improvements
- The long-standing issue of "can" being tagged as a modal verb (MD) rather than a noun (NN) in noun phrases like "trash can" and "soda can" is now resolved with the combined English models. #408
New / updated models
- Odia (Oriya) now uses the ODTB package as the default. Mixed POS and depparse training data is constructed from the Odia dataset combined with related Indic languages present in MuRIL-Large, following the approach used for Sindhi. The Odia NER model is now also connected to the default package. #1583
Demo and visualization improvements
-
Rewrite
stanza-parseviewer.jsto use a proper constituency parse visualizer instead of a repurposed dependency parse visualizer, fixing the broken vertical striping. Also adds a table of morphological features to the visualization. #1581 Addresses #1358 -
Various small improvements to the web demo: route all responses to
/; templatizestanza-brat.htmlso the version number is sourced from_version.py; move the logo to the demo directory for easier serving; add favicon support to the pipeline demo; guard against empty POST requests. #1582
Documentation and code quality
- Add docstrings and unit tests for the confusion matrix module. #1577