Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer splitHyphenated regression #1289

Open
gangeli opened this issue Jul 27, 2022 · 9 comments
Open

Tokenizer splitHyphenated regression #1289

gangeli opened this issue Jul 27, 2022 · 9 comments

Comments

@gangeli
Copy link
Member

gangeli commented Jul 27, 2022

The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?

public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}

Old output: [year, -, end]
New output: [year-end]

@AngledLuffa
Copy link
Contributor

My man I do not see any issue here

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}
java foo
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[year, -, end]

@AngledLuffa
Copy link
Contributor

That is what happens if I use v4.4.0 via git checkout v4.4.0 or if I use v4.5.0 in my git clone

@gangeli
Copy link
Member Author

gangeli commented Jul 27, 2022

Well that's strange. Maybe some library interference? I've tried isolating the error as best as I can, and still get it:

# lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford 
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar


$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP


$ cat foo.java                                                                      
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}


$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.


$ "$JAVA_HOME/bin/java" foo      
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end]

Maybe it's an Antlr version issue? We have Antlr Runtime 4.7.2

@gangeli
Copy link
Member Author

gangeli commented Jul 28, 2022

One step closer: apparently if I remove ptb3Escaping=true from the options then it works as expected. Gonna dig into the Lexer more, but it looks like ptb3Escaping has its own opinions about hyphenation, and there's some ordering indeterminacy around whose opinions matter more.

@gangeli
Copy link
Member Author

gangeli commented Jul 28, 2022

Blarg, the decompiler is barfing on PTBLexer and not letting me set breakpoints, but I have pretty good evidence that this is indeed the case.

Consider the block of code in the PTBLexer.flex constructor starting here:

        Properties prop = StringUtils.stringToProperties(options);
        Set<Map.Entry<Object,Object>> props = prop.entrySet();
        for (Map.Entry<Object,Object> item : props) {
          String key = (String) item.getKey();
          String value = (String) item.getValue();
          boolean val = Boolean.parseBoolean(value);
          if ("".equals(key)) {
            // allow an empty item
//...
          } else if ("ptb3Escaping".equals(key)) {
//...
            splitHyphenated = ! val;
//...
          } else if ("ud".equals(key)) {
//...
            splitHyphenated=val;
//...
          } else if ("splitHyphenated".equals(key)) {
            splitHyphenated = val;
          } 

If I inspect props (fortunately, StringUtils still decompiles) via props.entrySet().iterator().next() I get splitHyphenated -> true, which suggests that ptb3Escaping comes later in the property set and thus overwrites the splitHyphenated value.

Are ptb3Escaping and splitHyphenated truly incompatible or is this accidental?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jul 28, 2022 via email

@AngledLuffa
Copy link
Contributor

Am certain now that it is the key order in the Properties object causing this problem

While we come up with some sort of fix, in the meantime, you could always set the splitHyphenated property of the Lexer to whatever value you need...

@AngledLuffa
Copy link
Contributor

So, to what extent is this an issue where you would need a quick fix, versus being able to work around it (such as by setting the appropriate option in the Lexer after creating it) until the next release is made?

@AngledLuffa
Copy link
Contributor

The fix for the tokenizer is now in dev branch. I would like to fix this in the Parser as well, but that requires serializing all the models again. Please leave this open in the meantime!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants