Skip to content

Stanford CoreNLP 4.0.0

Compare
Choose a tag to compare
@J38 J38 released this 04 May 02:47

Overview

The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

Enhancements

  • UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
  • Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
  • Have WhitespaceTokenizer support same newline processing as PTBTokenizer
  • New mwt annotator for handling multiword tokens in French, German, and Spanish.
  • New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
  • Add French NER
  • New Chinese segmentation based off CTB9
  • Improved handling of double codepoint characters
  • Easier syntax for specifying language specific pipelines and NER pipeline properties
  • Improved CoNLL-U processing
  • Improved speed and memory performance for CRF training
  • Tregex support in CoreSentence
  • Updated library dependencies

Fixes

  • NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
  • NPE in EntityMentionsAnnotator during language check
  • NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
  • NPE in NERCombinerAnnotator in certain configurations of models on/off
  • Incorrect handling of eolonly option in ArabicSegmenterAnnotator
  • Apply named entity granularity change prior to coref mention detection
  • Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
  • Incorrect handling of reading in German treebank files
  • SR parser crashes when given bad training input
  • New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
  • Fix ancient bug in printing constituency tree with multiple roots.
  • Fix parser from failing on word "STOP" because it treated it as a special word