Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: `1. Bundesliga`) #380

kno10 · 2017-03-10T13:03:13Z

German (and some other European languages) use a dot to denote ordinals.

I.e. instead of "1st place", German uses "1. Platz".
Instead of "July 28th", German uses "28. Juli".

Examples can be found en masse, for example:
dewiki:Fußball-Bundesliga (28. Juli, 2. Bundesliga, 1. Liga)
dewiki:9/11 (11. September)
dewiki:Stanford University (Der Grund und Boden wurde am 11. November 1885 von Leland Stanford zur Gründung der Universität gestiftet)

And the Duden, the "prescriptive source for German language spelling" (Wikipedia) uses:
Duden - Die deutsche Rechtschreibung, 26. Auflage

Unfortunately, CoreNLP will split all these sentences at the dot.

So CoreNLP currently cannot reliably split German sentences if they contain ordinal numbers or dates.

I am currently using the following workaround hack:

  private static class FilteredTokenizer implements Annotator {
    private TokenizerAnnotator inner;

    public FilteredTokenizer(TokenizerAnnotator inner) {
      this.inner = inner;
    }

    @Override
    public void annotate(Annotation annotation) {
      inner.annotate(annotation);
      List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
      ArrayList<CoreLabel> filtered = new ArrayList<>(tokens.size());
      CoreLabel previous = null;
      for(CoreLabel t : tokens)
        if(previous == null || !updateAnnotation(previous, t))
          filtered.add(previous = t);
      annotation.set(CoreAnnotations.TokensAnnotation.class, filtered);
    }

    private boolean updateAnnotation(CoreLabel prev, CoreLabel curr) {
      int begin = curr.beginPosition(), end = curr.endPosition();
      if(begin + 1 != end || begin != prev.endPosition() || prev.beginPosition() == prev.endPosition())
        return false;
      String ct = curr.getString(CoreAnnotations.OriginalTextAnnotation.class);
      if(!".".equals(ct))
        return false;
      String pt = prev.getString(CoreAnnotations.OriginalTextAnnotation.class);
      for(int i = 0; i < pt.length(); i++)
        if(!Character.isDigit(pt.charAt(i)))
          return false;
      // We keep TextAnnotation unmodified, to 1. gets labeled CARDINAL.
      prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
      prev.setEndPosition(end);
      return true;
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requirementsSatisfied() {
      return inner.requirementsSatisfied();
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requires() {
      return inner.requires();
    }
  }

The text was updated successfully, but these errors were encountered:

J38 · 2017-03-10T23:52:01Z

Ok I'm going to try to figure out the easiest way to fix this, either by altering the tokenizer or sentence splitting.

Could you help me understand what German terms indicate a non-split. I can see months e.g. Juli as an example. Do you have any other terms you would suggest, such as Platz ?

kno10 · 2017-03-11T16:39:07Z

Unfortunately, I don't think there is a general rule.

German writers would likely simply avoid ending a sentence with a digit.

There are too many: 1. Platz, 1. Preis, 1. Mannschaft, 1. Bundesliga, 1. Liga, 1. Rang, 1. Kategorie, 1. Reise, 1. Tag, 52. Woche, 3. Monat, 1. Jahrhundert, 1. Buch Samuel, 42. Kongress, "1. Simmeringer Sport-Club", "1. FC Nürnberg" - these are all grammatical correct use. It's not specific to dates, that was just where I first noticed that this is really a major problem. Once you look out for them, they are literally everywhere.
Just go to http://de.wikipedia.org/ and type "1." into the search box. Even with "42." you get many suggestions.

So I believe my workaround is doing the best heuristic - assume that digits followed by a dot does not end a sentence (it could, but it usually won't).

There are other cases where the assumption that a dot always ends a sentence is incorrect - in particular abbreviations may or may not end a sentence. "Äpfel, Bananen, usw. sind Obst." is another example that is incorrectly split by CoreNLP. (usw. = etc.; Apples, bananas, etc. are fruit.)
But these are hard to define without already having POS tags and without massive learning/dictionary.
To get some samples, you could look at Wikipedia page titles:
https://de.wikipedia.org/w/index.php?title=Usw.&redirect=no

For the digits+(no space)+dot rule, I guess that not splitting the sentence is correct in over 90% of cases; but that is a wild guess. Linguists might be able to come up with a number, though.

aadrian · 2017-03-11T16:53:21Z

Linguists might be able to come up with a number, though.

This https://languagetool.org/ seems to have quite an extensive set of language rules and dictionaries:
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/de

and it's also written in Java.

Maybe CoreNLP could reuse that?

kno10 · 2017-03-14T15:55:56Z

Looking at some of the results for English Wikipedia, it may be well worth investigating sentence splitting further; e.g. abbreviation detection. For example on enwiki:I Represent (album):

6. Like You Badd (Feat. L.Douglass, N.Holsey, D.Lockhart)(Prod. by Fat Boy)

(You may also notice that in enumerations, English apparently also uses the dot after the number.)

Here, the sentence splitter will consider Prod. and Feat. to be end of sentences, i.e.

Like You Badd (Feat.
L.Douglass, N.Holsey, D.Lockhart)(Prod.
by Fat Boy)

a lower case letter on the next word is probably a good indicator of an abbreviation, and one word within the parenthesis is a second negative signal for end-of-sentence. Also, IMHO, when there are balanced parenthesis, a sentence should usually not include the opening, but not the closing parenthesis.

parajain · 2018-01-31T12:59:47Z

I am also facing the same issue. @kno10 Can you please give some details on how to use your quick fix until this is fixed?
Also, digits+(no space)+dot rule seems reasonable to be for most of the cases.

kno10 · 2018-02-13T13:16:14Z

Sorry, it is a bit too hackish to share right now, and may or may not work depending on which part of the API you call. What I do is to subclass StanfordCoreNLP, override getAnnotatorImplementations(), and there I override only Annotator tokenizer(Properties properties) with a filtered tokenizer wrapped arround the CoreNLP tokenizer that undos some of the default tokenization.

  private static class FilteredTokenizer implements Annotator {
    private Annotator inner;

    public FilteredTokenizer(Annotator inner) {
      this.inner = inner;
    }

    @Override
    public void annotate(Annotation annotation) {
      inner.annotate(annotation);
      List<CoreLabel> tokens = annotation.get(CoreAnnotations.TokensAnnotation.class);
      if(tokens.isEmpty())
        return;
      ArrayList<CoreLabel> filtered = new ArrayList<>(tokens.size());
      CoreLabel previous = null, cur = null, next = tokens.get(0);
      for(int i = 1; i <= tokens.size(); i++) {
        cur = next;
        next = i < tokens.size() ? tokens.get(i) : null;
        if(previous == null || keepNextToken(previous, cur, next))
          filtered.add(previous = cur);
      }
      annotation.set(CoreAnnotations.TokensAnnotation.class, filtered);
    }

    private boolean keepNextToken(CoreLabel prev, CoreLabel curr, CoreLabel next) {
      int begin = curr.beginPosition(), end = curr.endPosition();
      if(begin + 1 != end || begin != prev.endPosition() || prev.beginPosition() == prev.endPosition())
        return true;
      String ct = curr.getString(CoreAnnotations.OriginalTextAnnotation.class);
      // All code below will try to fix sentence splitter problems.
      if(!".".equals(ct))
        return true;
      String pt = prev.getString(CoreAnnotations.OriginalTextAnnotation.class);
      if(pt.isEmpty())
        return true;
      if((pt.length() <= MAX_ARABIC_DIGITS && isDigits(pt)) || isRomanDigits(pt) || LIST.contains(pt)) {
        // We keep TextAnnotation unmodified, to 1. gets labeled CARDINAL.
        // We only modify the original text annotation & adjust the end position
        prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
        prev.setEndPosition(end);
        return false; // Don't retain this token.
      }
      if(next != null) {
        String nt = next.getString(CoreAnnotations.OriginalTextAnnotation.class);
        if(!nt.isEmpty() && Character.isLowerCase(nt.charAt(0))) {
          prev.set(CoreAnnotations.OriginalTextAnnotation.class, pt + ct);
          prev.set(CoreAnnotations.TextAnnotation.class, prev.getString(CoreAnnotations.TextAnnotation.class) + curr.getString(CoreAnnotations.TextAnnotation.class));
          prev.setEndPosition(end);
          return false;
        }
      }
      return true;
    }

    private static boolean isDigits(String pt) {
      for(int i = 0; i < pt.length(); i++) {
        final char c = pt.charAt(i);
        if(!Character.isDigit(c) && c != '-' && c != '.')
          return false;
      }
      return true;
    }

    private static boolean isRomanDigits(String pt) {
      String numerals = Character.isLowerCase(pt.charAt(0)) ? "ivxlcdm" : "IVXLCDM";
      for(int i = 0; i < pt.length(); i++)
        if(numerals.indexOf(pt.charAt(i)) < 0)
          return false;
      return true;
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requirementsSatisfied() {
      return inner.requirementsSatisfied();
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Set<Class<? extends CoreAnnotation>> requires() {
      return inner.requires();
    }
  }

this is very hackish. The roman numeral code does not at all take into account what would be a valid roman numeral, etc. LIST is a set of known abbreviations. I guess a similar list exists in the regular tokenizer already.

aghasemi · 2019-02-13T16:32:59Z

Hi, Any update here?

J38 added severe-bug ssplit tokenize bug and removed severe-bug labels Sep 29, 2017

J38 added 3.9.0-fix and removed 3.9.0-fix labels Oct 27, 2017

J38 added this to the v.3.9.0 milestone Oct 31, 2017

manning modified the milestones: v.3.9.0, v.4.3 May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: `1. Bundesliga`) #380

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: `1. Bundesliga`) #380

kno10 commented Mar 10, 2017

J38 commented Mar 10, 2017

kno10 commented Mar 11, 2017 •

edited

aadrian commented Mar 11, 2017

kno10 commented Mar 14, 2017

parajain commented Jan 31, 2018

kno10 commented Feb 13, 2018

aghasemi commented Feb 13, 2019

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: 1. Bundesliga) #380

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: 1. Bundesliga) #380

Comments

kno10 commented Mar 10, 2017

J38 commented Mar 10, 2017

kno10 commented Mar 11, 2017 • edited

aadrian commented Mar 11, 2017

kno10 commented Mar 14, 2017

parajain commented Jan 31, 2018

kno10 commented Feb 13, 2018

aghasemi commented Feb 13, 2019

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: `1. Bundesliga`) #380

Incorrect sentence splitting in German (and some other European languages) at dots after numbers (e.g. German: `1. Bundesliga`) #380

kno10 commented Mar 11, 2017 •

edited