Release Stanford CoreNLP 4.0.0 · stanfordnlp/CoreNLP

Overview

The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

Enhancements

UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
Have WhitespaceTokenizer support same newline processing as PTBTokenizer
New mwt annotator for handling multiword tokens in French, German, and Spanish.
New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
Add French NER
New Chinese segmentation based off CTB9
Improved handling of double codepoint characters
Easier syntax for specifying language specific pipelines and NER pipeline properties
Improved CoNLL-U processing
Improved speed and memory performance for CRF training
Tregex support in CoreSentence
Updated library dependencies

Fixes

NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
NPE in EntityMentionsAnnotator during language check
NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
NPE in NERCombinerAnnotator in certain configurations of models on/off
Incorrect handling of eolonly option in ArabicSegmenterAnnotator
Apply named entity granularity change prior to coref mention detection
Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
Incorrect handling of reading in German treebank files
SR parser crashes when given bad training input
New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
Fix ancient bug in printing constituency tree with multiple roots.
Fix parser from failing on word "STOP" because it treated it as a special word

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanford CoreNLP 4.0.0

Overview

Enhancements

Fixes