Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: 1. Bundesliga) #380

Open
kno10 opened this issue Mar 10, 2017 · 7 comments

Comments

@kno10
Copy link
Contributor

kno10 commented Mar 10, 2017

German (and some other European languages) use a dot to denote ordinals.

I.e. instead of "1st place", German uses "1. Platz".
Instead of "July 28th", German uses "28. Juli".

Examples can be found en masse, for example:
dewiki:Fußball-Bundesliga (28. Juli, 2. Bundesliga, 1. Liga)
dewiki:9/11 (11. September)
dewiki:Stanford University (Der Grund und Boden wurde am 11. November 1885 von Leland Stanford zur Gründung der Universität gestiftet)

And the Duden, the "prescriptive source for German language spelling" (Wikipedia) uses:
Duden - Die deutsche Rechtschreibung, 26. Auflage

Unfortunately, CoreNLP will split all these sentences at the dot.

So CoreNLP currently cannot reliably split German sentences if they contain ordinal numbers or dates.

I am currently using the following workaround hack:

  private static class FilteredTokenizer implements Annotator {
    private TokenizerAnnotator inner;

    public FilteredTokenizer(TokenizerAnnotator inner) {
      this.inner = inner;
    }

    @Override
    public void annotate(Annotation annotation) {
      inner.annotate(annotation);
      List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
      ArrayList<CoreLabel> filtered = new ArrayList<>(tokens.size());
      CoreLabel previous = null;
      for(CoreLabel t : tokens)
        if(previous == null || !updateAnnotation(previous, t))
          filtered.add(previous = t);
      annotation.set(CoreAnnotations.TokensAnnotation.class, filtered);
    }

    private boolean updateAnnotation(CoreLabel prev, CoreLabel curr) {
      int begin = curr.beginPosition(), end = curr.endPosition();
      if(begin + 1 != end || begin != prev.endPosition() || prev.beginPosition() == prev.endPosition())
        return false;
      String ct = curr.getString(CoreAnnotations.OriginalTextAnnotation.class);
      if(!".".equals(ct))
        return false;
      String pt = prev.getString(CoreAnnotations.OriginalTextAnnotation.class);
      for(int i = 0; i < pt.length(); i++)
        if(!Character.isDigit(pt.charAt(i)))
          return false;
      // We keep TextAnnotation unmodified, to 1. gets labeled CARDINAL.
      prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
      prev.setEndPosition(end);
      return true;
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requirementsSatisfied() {
      return inner.requirementsSatisfied();
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requires() {
      return inner.requires();
    }
  }
@J38
Copy link
Contributor

J38 commented Mar 10, 2017

Ok I'm going to try to figure out the easiest way to fix this, either by altering the tokenizer or sentence splitting.

Could you help me understand what German terms indicate a non-split. I can see months e.g. Juli as an example. Do you have any other terms you would suggest, such as Platz ?

@kno10
Copy link
Contributor Author

kno10 commented Mar 11, 2017

Unfortunately, I don't think there is a general rule.

German writers would likely simply avoid ending a sentence with a digit.

There are too many: 1. Platz, 1. Preis, 1. Mannschaft, 1. Bundesliga, 1. Liga, 1. Rang, 1. Kategorie, 1. Reise, 1. Tag, 52. Woche, 3. Monat, 1. Jahrhundert, 1. Buch Samuel, 42. Kongress, "1. Simmeringer Sport-Club", "1. FC Nürnberg" - these are all grammatical correct use. It's not specific to dates, that was just where I first noticed that this is really a major problem. Once you look out for them, they are literally everywhere.
Just go to http://de.wikipedia.org/ and type "1." into the search box. Even with "42." you get many suggestions.

So I believe my workaround is doing the best heuristic - assume that digits followed by a dot does not end a sentence (it could, but it usually won't).

There are other cases where the assumption that a dot always ends a sentence is incorrect - in particular abbreviations may or may not end a sentence. "Äpfel, Bananen, usw. sind Obst." is another example that is incorrectly split by CoreNLP. (usw. = etc.; Apples, bananas, etc. are fruit.)
But these are hard to define without already having POS tags and without massive learning/dictionary.
To get some samples, you could look at Wikipedia page titles:
https://de.wikipedia.org/w/index.php?title=Usw.&redirect=no

For the digits+(no space)+dot rule, I guess that not splitting the sentence is correct in over 90% of cases; but that is a wild guess. Linguists might be able to come up with a number, though.

@aadrian
Copy link

aadrian commented Mar 11, 2017

Linguists might be able to come up with a number, though.

This https://languagetool.org/ seems to have quite an extensive set of language rules and dictionaries:
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/de

and it's also written in Java.

Maybe CoreNLP could reuse that?

@kno10
Copy link
Contributor Author

kno10 commented Mar 14, 2017

Looking at some of the results for English Wikipedia, it may be well worth investigating sentence splitting further; e.g. abbreviation detection. For example on enwiki:I Represent (album):

6. Like You Badd (Feat. L.Douglass, N.Holsey, D.Lockhart)(Prod. by Fat Boy)

(You may also notice that in enumerations, English apparently also uses the dot after the number.)

Here, the sentence splitter will consider Prod. and Feat. to be end of sentences, i.e.

Like You Badd (Feat.
L.Douglass, N.Holsey, D.Lockhart)(Prod.
by Fat Boy)

a lower case letter on the next word is probably a good indicator of an abbreviation, and one word within the parenthesis is a second negative signal for end-of-sentence. Also, IMHO, when there are balanced parenthesis, a sentence should usually not include the opening, but not the closing parenthesis.

@parajain
Copy link

I am also facing the same issue. @kno10 Can you please give some details on how to use your quick fix until this is fixed?
Also, digits+(no space)+dot rule seems reasonable to be for most of the cases.

@kno10
Copy link
Contributor Author

kno10 commented Feb 13, 2018

Sorry, it is a bit too hackish to share right now, and may or may not work depending on which part of the API you call. What I do is to subclass StanfordCoreNLP, override getAnnotatorImplementations(), and there I override only Annotator tokenizer(Properties properties) with a filtered tokenizer wrapped arround the CoreNLP tokenizer that undos some of the default tokenization.

  private static class FilteredTokenizer implements Annotator {
    private Annotator inner;

    public FilteredTokenizer(Annotator inner) {
      this.inner = inner;
    }

    @Override
    public void annotate(Annotation annotation) {
      inner.annotate(annotation);
      List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
      if(tokens.isEmpty())
        return;
      ArrayList<CoreLabel> filtered = new ArrayList<>(tokens.size());
      CoreLabel previous = null, cur = null, next = tokens.get(0);
      for(int i = 1; i <= tokens.size(); i++) {
        cur = next;
        next = i < tokens.size() ? tokens.get(i) : null;
        if(previous == null || keepNextToken(previous, cur, next))
          filtered.add(previous = cur);
      }
      annotation.set(CoreAnnotations.TokensAnnotation.class, filtered);
    }

    private boolean keepNextToken(CoreLabel prev, CoreLabel curr, CoreLabel next) {
      int begin = curr.beginPosition(), end = curr.endPosition();
      if(begin + 1 != end || begin != prev.endPosition() || prev.beginPosition() == prev.endPosition())
        return true;
      String ct = curr.getString(CoreAnnotations.OriginalTextAnnotation.class);
      // All code below will try to fix sentence splitter problems.
      if(!".".equals(ct))
        return true;
      String pt = prev.getString(CoreAnnotations.OriginalTextAnnotation.class);
      if(pt.isEmpty())
        return true;
      if((pt.length() <= MAX_ARABIC_DIGITS && isDigits(pt)) || isRomanDigits(pt) || LIST.contains(pt)) {
        // We keep TextAnnotation unmodified, to 1. gets labeled CARDINAL.
        // We only modify the original text annotation & adjust the end position
        prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
        prev.setEndPosition(end);
        return false; // Don't retain this token.
      }
      if(next != null) {
        String nt = next.getString(CoreAnnotations.OriginalTextAnnotation.class);
        if(!nt.isEmpty() && Character.isLowerCase(nt.charAt(0))) {
          prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
          prev.set(CoreAnnotations.TextAnnotation.class, prev.getString(CoreAnnotations.TextAnnotation.class) + curr.getString(CoreAnnotations.TextAnnotation.class));
          prev.setEndPosition(end);
          return false;
        }
      }
      return true;
    }

    private static boolean isDigits(String pt) {
      for(int i = 0; i < pt.length(); i++) {
        final char c = pt.charAt(i);
        if(!Character.isDigit(c) && c != '-' && c != '.')
          return false;
      }
      return true;
    }

    private static boolean isRomanDigits(String pt) {
      String numerals = Character.isLowerCase(pt.charAt(0)) ? "ivxlcdm" : "IVXLCDM";
      for(int i = 0; i < pt.length(); i++)
        if(numerals.indexOf(pt.charAt(i)) < 0)
          return false;
      return true;
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requirementsSatisfied() {
      return inner.requirementsSatisfied();
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requires() {
      return inner.requires();
    }
  }

this is very hackish. The roman numeral code does not at all take into account what would be a valid roman numeral, etc. LIST is a set of known abbreviations. I guess a similar list exists in the regular tokenizer already.

@aghasemi
Copy link

Hi, Any update here?

@manning manning modified the milestones: v.3.9.0, v.4.3 May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants