Tokenizer splitHyphenated regression #1289

gangeli · 2022-07-27T18:51:20Z

The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?

public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}

Old output: [year, -, end]
New output: [year-end]

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2022-07-27T21:27:01Z

My man I do not see any issue here

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}

java foo
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[year, -, end]

AngledLuffa · 2022-07-27T21:28:16Z

That is what happens if I use v4.4.0 via git checkout v4.4.0 or if I use v4.5.0 in my git clone

gangeli · 2022-07-27T23:15:47Z

Well that's strange. Maybe some library interference? I've tried isolating the error as best as I can, and still get it:

# lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford 
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar


$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP


$ cat foo.java                                                                      
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}


$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.


$ "$JAVA_HOME/bin/java" foo      
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end]

Maybe it's an Antlr version issue? We have Antlr Runtime 4.7.2

gangeli · 2022-07-28T00:02:53Z

One step closer: apparently if I remove ptb3Escaping=true from the options then it works as expected. Gonna dig into the Lexer more, but it looks like ptb3Escaping has its own opinions about hyphenation, and there's some ordering indeterminacy around whose opinions matter more.

gangeli · 2022-07-28T00:13:47Z

Blarg, the decompiler is barfing on PTBLexer and not letting me set breakpoints, but I have pretty good evidence that this is indeed the case.

Consider the block of code in the PTBLexer.flex constructor starting here:

        Properties prop = StringUtils.stringToProperties(options);
        Set<Map.Entry<Object,Object>> props = prop.entrySet();
        for (Map.Entry<Object,Object> item : props) {
          String key = (String) item.getKey();
          String value = (String) item.getValue();
          boolean val = Boolean.parseBoolean(value);
          if ("".equals(key)) {
            // allow an empty item
//...
          } else if ("ptb3Escaping".equals(key)) {
//...
            splitHyphenated = ! val;
//...
          } else if ("ud".equals(key)) {
//...
            splitHyphenated=val;
//...
          } else if ("splitHyphenated".equals(key)) {
            splitHyphenated = val;
          }

If I inspect props (fortunately, StringUtils still decompiles) via props.entrySet().iterator().next() I get splitHyphenated -> true, which suggests that ptb3Escaping comes later in the property set and thus overwrites the splitHyphenated value.

Are ptb3Escaping and splitHyphenated truly incompatible or is this accidental?

AngledLuffa · 2022-07-28T01:11:43Z

Well this might wind up being horrible. I tried on a couple different Java 8 installs and got the desired behavior in both, but with a Java 11 and a Java 14 install I got the same error you did. What java version are you running? Maybe the string hash function changed between versions, and thus the keys are iterated in a different order? I guess the simplest fix in that case would be to make the later keys override the earlier ones in a deterministic order.

AngledLuffa · 2022-07-28T02:15:19Z

Am certain now that it is the key order in the Properties object causing this problem

While we come up with some sort of fix, in the meantime, you could always set the splitHyphenated property of the Lexer to whatever value you need...

AngledLuffa · 2022-07-28T21:36:35Z

So, to what extent is this an issue where you would need a quick fix, versus being able to work around it (such as by setting the appropriate option in the Lexer after creating it) until the next release is made?

AngledLuffa · 2022-08-04T20:12:18Z

The fix for the tokenizer is now in dev branch. I would like to fix this in the Parser as well, but that requires serializing all the models again. Please leave this open in the meantime!

AngledLuffa added a commit that referenced this issue Jul 28, 2022

LinkedHashMap instead of Properties. Addresses #1289

011863a

AngledLuffa added a commit that referenced this issue Aug 4, 2022

LinkedHashMap instead of Properties. Addresses #1289

6550188

AngledLuffa added cleanup fixed on dev labels Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer splitHyphenated regression #1289

Tokenizer splitHyphenated regression #1289

gangeli commented Jul 27, 2022

AngledLuffa commented Jul 27, 2022

AngledLuffa commented Jul 27, 2022

gangeli commented Jul 27, 2022

gangeli commented Jul 28, 2022

gangeli commented Jul 28, 2022

AngledLuffa commented Jul 28, 2022 via email

AngledLuffa commented Jul 28, 2022

AngledLuffa commented Jul 28, 2022

AngledLuffa commented Aug 4, 2022

Tokenizer splitHyphenated regression #1289

Tokenizer splitHyphenated regression #1289

Comments

gangeli commented Jul 27, 2022

AngledLuffa commented Jul 27, 2022

AngledLuffa commented Jul 27, 2022

gangeli commented Jul 27, 2022

gangeli commented Jul 28, 2022

gangeli commented Jul 28, 2022

AngledLuffa commented Jul 28, 2022 via email

AngledLuffa commented Jul 28, 2022

AngledLuffa commented Jul 28, 2022

AngledLuffa commented Aug 4, 2022