Skip to content

Stanza v1.2.0

Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 29 Jan 20:05
9aa915e

Overview

All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

New features and enhancements

  • Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.

  • NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)

  • Multi-document support The Stanza Pipeline now supports multi-Document input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza Document objects into the Pipeline. (#70 #577)

  • Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use token.sent or word.sent. (#533 #554)

  • New external tokenizer for Thai with PyThaiNLP Try it out with, for example, stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None). (#567)

  • Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)

  • Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try stanza.resources.common.list_available_languages(). (#511 fa52f85)

  • Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your Pipeline instantiation, and remembering to include it is a pain -- until now. (#516 #515 and many others)

  • Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from stanza.download("vi"). Enjoy!

  • More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (#437 #430 ... #324 #438 ... #529 9539665 ... #575 #578)

Bugfixes

  • Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (4ee9f12 #559)

  • External tokenization library integration respects no_ssplit so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (#523 #556)

  • Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (#524 ba0aec3)

  • Spanish model would not tokenize foo,bar Now fixed (#528 123d502)

  • Arabic model would not tokenize asdf . Now fixed (#545 03b7cea)

  • Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (#539 #588)

  • Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (#471 #488 #491)

  • Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (#305 UniversalDependencies/UD_Norwegian-Bokmaal#5)

  • Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (#484 #483, thanks @m0re4u )

  • Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (#458 274474c)

  • Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (#434 #431)

  • Misc fixes for training, concurrency, and edge cases in basic Pipeline usage

    • Fix for mwt training (#446)
    • Fix for race condition in seq2seq models (#463 #462)
    • Fix for race condition in CRF (#566 #561)
    • Fix for empty text in pipeline (#475 #474)
    • Fix for resources not freed when downloading (#502 #503)
    • Fix for vietnamese pipeline not working (#531 #535)

BREAKING CHANGES

  • Renamed stanza.models.tokenize -> stanza.models.tokenization #452 This stops the tokenize directory shadowing a built in library