Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error thrown on tokenize #1298

Closed
yakivy opened this issue Aug 17, 2022 · 7 comments
Closed

Unexpected error thrown on tokenize #1298

yakivy opened this issue Aug 17, 2022 · 7 comments

Comments

@yakivy
Copy link

yakivy commented Aug 17, 2022

Description:
edu.stanford.nlp.pipeline.StanfordCoreNLP throws an error if you try to tokenize a string with all possible characters ("... a b c d ...") divided by space. Probably it's also worth to mention that string without space between characters ("...abcd...") is tokenized successfully.

Prerequisites:

  • java openjdk 17.0.2 2022-01-18
  • scala 2.13.8
  • lib ivy"edu.stanford.nlp:stanford-corenlp:4.5.0"

Minimal example:

import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
val pipeline = {
    val props = new Properties()
    props.setProperty("annotators", "tokenize")
    new StanfordCoreNLP(props)
}
val text = (Char.MinValue to Char.MaxValue).mkString(" ")
pipeline.processToCoreDocument(text)

Error:

java.lang.Error: Error: could not match input
  at edu.stanford.nlp.process.PTBLexer.zzScanError(PTBLexer.java:61605)
  at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:63479)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301)
  at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185)
  at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69)
  at edu.stanford.nlp.process.AbstractTokenizer.tokenize(AbstractTokenizer.java:111)
  at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:420)
  at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:744)
  at edu.stanford.nlp.pipeline.StanfordCoreNLP.process(StanfordCoreNLP.java:793)
  ...
@AngledLuffa
Copy link
Contributor

To reproduce in Java:

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import java.util.*;

public class foo {
  public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize");
    StanfordCoreNLP pipe = new StanfordCoreNLP(props);
    StringBuilder builder = new StringBuilder();
    for (int i = Character.MIN_VALUE; i <= Character.MAX_VALUE; ++i) {
      builder.append((char) i);
    }
    String text = builder.toString();
    System.out.println(text.length());
    Annotation ann = new Annotation(text);
    pipe.annotate(ann);
  }
}

However, is this a problem you have run into in the wild? Some of the characters you are adding with this are not valid text characters.

Which version CoreNLP, anyway?

@AngledLuffa
Copy link
Contributor

55296 in particular

@AngledLuffa
Copy link
Contributor

0xd800 is not supposed to be a legal text character, so I suppose it's not too surprising. I'm not sure this will be easily fixed

@yakivy
Copy link
Author

yakivy commented Aug 18, 2022

@AngledLuffa
Thanks for your response.
Check prerequisites please, lib version is 4.5.0.
Is there a finite set of rules for characters that can cause an error on tokenization, so I can mask them before tokenization?

0xd800 is not supposed to be a legal text character

By that you mean the surrogate prepended by space is not a valid character for CoreNLP, correct?

AngledLuffa added a commit that referenced this issue Aug 18, 2022
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency

Addresses #1298
@AngledLuffa
Copy link
Contributor

By that you mean the surrogate prepended by space is not a valid character for CoreNLP, correct?

Well, two things... I don't think this character actually means anything by itself in any context, and more relevantly, it's currently not a valid character for CoreNLP considering it causes a crash.

However, I did just make a branch which I think has the fix to the problem. It doesn't crash any more, at least

AngledLuffa added a commit that referenced this issue Aug 24, 2022
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency

Addresses #1298

Add a debug line for the fallthrough rule

Add a couple tests of the half codepoint fix
AngledLuffa added a commit that referenced this issue Aug 24, 2022
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency

Addresses #1298

Add a debug line for the fallthrough rule

Add a couple tests of the half codepoint fix
@AngledLuffa
Copy link
Contributor

4.5.1 is available on github and maven

@yakivy
Copy link
Author

yakivy commented Sep 7, 2022

@AngledLuffa awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants