Stanza 1.3.0: LangID and Constituency Parser

AngledLuffa released this 06 Oct 06:28

· 2418 commits to main since this release

Overview

Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.

New features

Langid model and multilingual pipeline
Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021
(154b0e8)
Constituency parser
Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently an en_wsj model available, with more to come.
(9031802)
Evalb interface to CoreNLP
Useful for evaluating the parser - requires CoreNLP 4.3.0 or later
Dictonary tokenizer feature
Noticeably improved performance for ZH, VI, TH
(#776)

Bugfixes / Reliability

HuggingFace integration
No more git issues complaining about unavailable models! (Hopefully)
(f7af504)
Sentiment processor crashes on certain inputs
(issue #804, fixed by e232f67)

Assets 2