Unexpected error thrown on tokenize #1298

yakivy · 2022-08-17T10:48:53Z

Description:
edu.stanford.nlp.pipeline.StanfordCoreNLP throws an error if you try to tokenize a string with all possible characters ("... a b c d ...") divided by space. Probably it's also worth to mention that string without space between characters ("...abcd...") is tokenized successfully.

Prerequisites:

java openjdk 17.0.2 2022-01-18
scala 2.13.8
lib ivy"edu.stanford.nlp:stanford-corenlp:4.5.0"

Minimal example:

import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
val pipeline = {
    val props = new Properties()
    props.setProperty("annotators", "tokenize")
    new StanfordCoreNLP(props)
}
val text = (Char.MinValue to Char.MaxValue).mkString(" ")
pipeline.processToCoreDocument(text)

Error:

java.lang.Error: Error: could not match input
  at edu.stanford.nlp.process.PTBLexer.zzScanError(PTBLexer.java:61605)
  at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:63479)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185)
  at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69)
  at edu.stanford.nlp.process.AbstractTokenizer.tokenize(AbstractTokenizer.java:111)
  at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:420)
  at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:744)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.process(StanfordCoreNLP.java:793)
  ...

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-08-18T05:58:37Z

To reproduce in Java:

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import java.util.*;

public class foo {
  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize");
    StanfordCoreNLP pipe = new StanfordCoreNLP(props);
    StringBuilder builder = new StringBuilder();
    for (int i = Character.MIN_VALUE; i <= Character.MAX_VALUE; ++i) {
      builder.append((char) i);
    }
    String text = builder.toString();
    System.out.println(text.length());
    Annotation ann = new Annotation(text);
    pipe.annotate(ann);
  }
}

However, is this a problem you have run into in the wild? Some of the characters you are adding with this are not valid text characters.

Which version CoreNLP, anyway?

AngledLuffa · 2022-08-18T06:02:57Z

55296 in particular

AngledLuffa · 2022-08-18T06:16:36Z

0xd800 is not supposed to be a legal text character, so I suppose it's not too surprising. I'm not sure this will be easily fixed

yakivy · 2022-08-18T11:19:15Z

@AngledLuffa
Thanks for your response.
Check prerequisites please, lib version is 4.5.0.
Is there a finite set of rules for characters that can cause an error on tokenization, so I can mask them before tokenization?

0xd800 is not supposed to be a legal text character

By that you mean the surrogate prepended by space is not a valid character for CoreNLP, correct?

…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency Addresses #1298

AngledLuffa · 2022-08-18T18:11:12Z

By that you mean the surrogate prepended by space is not a valid character for CoreNLP, correct?

Well, two things... I don't think this character actually means anything by itself in any context, and more relevantly, it's currently not a valid character for CoreNLP considering it causes a crash.

However, I did just make a branch which I think has the fix to the problem. It doesn't crash any more, at least

…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency Addresses #1298 Add a debug line for the fallthrough rule Add a couple tests of the half codepoint fix

AngledLuffa · 2022-09-06T22:23:11Z

4.5.1 is available on github and maven

yakivy · 2022-09-07T09:01:01Z

@AngledLuffa awesome, thank you!

AngledLuffa mentioned this issue Aug 18, 2022

Make the fallthrough character tokenization also capture unpaired sur… #1299

Merged

AngledLuffa added the fixed on dev label Aug 24, 2022

AngledLuffa closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected error thrown on tokenize #1298

Unexpected error thrown on tokenize #1298

yakivy commented Aug 17, 2022 •

edited

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

yakivy commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Sep 6, 2022

yakivy commented Sep 7, 2022

Unexpected error thrown on tokenize #1298

Unexpected error thrown on tokenize #1298

Comments

yakivy commented Aug 17, 2022 • edited

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

yakivy commented Aug 18, 2022

AngledLuffa commented Aug 18, 2022

AngledLuffa commented Sep 6, 2022

yakivy commented Sep 7, 2022

yakivy commented Aug 17, 2022 •

edited