Navigation Menu

Skip to content

Stanza v1.2.1

Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 17 Jun 17:12
· 2365 commits to main since this release
68aa426

Overview

All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

Model improvements

  • Add Bulgarian, Finnish, Hungarian, Vietnamese NER models

    • The Bulgarian model is trained on BSNLP 2019 data.
    • The Finnish model is trained on the Turku NER data.
    • The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
    • The Vietnamese model is trained on the VLSP 2018 data.
    • Furthermore, the script for preparing the lang-uk NER data has been integrated (c1f0bee)
  • Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(d9e8301)

  • Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (#692 #684)

  • Fix Spanish POS and depparse mishandling a leading ¿ missing (#699 #698)

  • Fix tokenization breaking when a newline splits a Chinese token(#632 #531)

  • Fix tokenization of parentheses in Chinese(452d842)

  • Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe
    (db05552 f01a142 85898c5)

  • Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (3ccb132)

  • Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(1972122)

  • Include UD_English-GUMReddit in the GUM models(9e6367c)

  • Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(f74bef7)

Interface enhancements

  • Add ability to pass a Document to the pipeline in pretokenized mode(f88cd8c #696)

  • Track comments when reading and writing conll files (#676 originally from @danielhers in #155)

  • Add a proxy parameter for downloads to pass through to the requests module (#638)

  • add sent_idx to tokens (ee6135c)

Bugfixes

  • Fix Windows encoding issues when reading conll documents from @yanirmr (b40379e #695)

  • Fix tokenization breaking when second batch is exactly eval_length(7263686 #634 #631)

Efficiency improvements

  • Bulk process for tokenization - greatly speeds up the use case of many small docs (5d2d39e)

  • Optimize MWT usage in pipeline & fix MWT bulk_process (#642 #643 #644)

CoreNLP integration

  • Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer (#675)

  • Add an interface to CoreNLP tokensregex using stanza tokenization (#659)