Skip to content

Stanza v1.4.1

Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 14 Sep 16:41
· 1308 commits to main since this release

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

  • New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
    #1070
    #1110

  • Make GermEval2014 the default German NER model, including an optional Bert version
    #1018
    #1022

  • Japanese conversion of GSD by Megagon
    #1038

  • Marathi NER dataset from L3Cube. Includes a Sentiment model as well
    #1043

  • Thai conversion of LST20
    555fc03

  • Kazakh conversion of KazNERD
    de6cd25

Other new models

  • Sentiment conversion of Tass2020 for Spanish
    #1104

  • VIT constituency dataset for Italian
    149f144
    ... and many subsequent updates

  • Combined UD models for Hebrew
    #1109
    e4fcf00

  • For UD models with small train dataset & larger test dataset, flip the datasets
    UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
    #1030
    9618d60

  • Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
    47740c6

Model improvements

  • Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
    #1086

  • Pretrained charlm integrated into Sentiment. Improves English, others not so much
    #1025

  • LSTM, 2d maxpool as optional items in the Sentiment
    from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
    #1098

  • First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
    b1d10d3

  • Grad clipping in conparse training
    365066a

Pipeline interface improvements

  • GPU memory savings: charlm reused between different processors in the same pipeline
    #1028

  • Word vectors not saved in the NER models. Saves bandwidth & disk space
    #1033

  • Functions to return tagsets for NER and conparse models
    #1066
    #1073
    36b84db
    2db43c8

  • displaCy integration with NER and dependency trees
    2071413

Bugfixes

  • Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
    TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
    #1056

  • Starting a new corenlp client w/o server shouldn't wait for the server to be available
    TY to Mariano Crosetti
    #1059
    #1061

  • Read raw glove word vectors (they have no header information)
    #1074

  • Ensure that illegal languages are not chosen by the LangID model
    #1076
    #1077

  • Fix cache in Multilingual pipeline
    #1115
    cdf18d8

  • Fix loading of previously unseen languages in Multilingual pipeline
    #1101
    e551ebe

  • Fix that conparse would occasionally train to NaN early in the training
    c4d7857

Improved training tools