Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] German contraction of "an dem" to "am" #1369

Open
GeorgeS2019 opened this issue Mar 17, 2024 · 12 comments
Open

[QUESTION] German contraction of "an dem" to "am" #1369

GeorgeS2019 opened this issue Mar 17, 2024 · 12 comments
Labels

Comments

@GeorgeS2019
Copy link

GeorgeS2019 commented Mar 17, 2024

Am

“Am” is a contraction of “an” and “dem”.

  • It is used to mean “at the” for locations or times of day
  • For example, “am Wochenende” means “on the weekend”.

An dem

“An dem” is used when you want to keep “an” and “dem” separate for emphasis or clarity.

  • However, it’s not wrong to use “an dem” instead of “am”, but it might sound a bit unusual

How Stanza handles them?

One word "am" with the right word id has TWO more additional words: "an dem"

It is simpler to just parse an int coming back from a word.id.
Now, instead of int, it is an array referencing the TWO additional words

The challenges:
The parent word has start_char and end_char, but the other morphological features are now transferred to the child word e.g. dem

Question

I wonder how best to handle this when parsing.

[1]

  • please indicate which part of stanza code that create the additional words and how stanza handles them when there exist in a sentence

[2]

  • Is similar thing also happens to CoreNLP (Java). Please indicate where are the codes that create and where are the code that parse them successfully.

image

{
    "id": [
      10,
      11
    ],
    "text": "am",
    "start_char": 56,
    "end_char": 58,
    "ner": "O",
    "multi_ner": [
      "O"
    ]
  },
  {
    "id": 10,
    "text": "an",
    "lemma": "an",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 12,
    "deprel": "case"
  },
  {
    "id": 11,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 12,
    "deprel": "det"
  }
@GeorgeS2019
Copy link
Author

I double if Spacy would handle this way, I am simply curious

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Mar 17, 2024

This is a complicated question which comes up frequently, and people never seem to like the answer. However, my impression of that is probably the same as bullet holes in planes - only the people who don't like the answer show up on github.

This is what CoreNLP does:

edit: this whole German CoreNLP section was done with the wrong annotation pipeline, see below

NLP> Der Firma liegt genau am Ortseingang.

Sentence #1 (7 tokens):
Der Firma liegt genau am Ortseingang.

Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=Der NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP Lemma=Firma NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=NN Lemma=liegt NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=NN Lemma=genau NamedEntityTag=O]
[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NNP Lemma=Ortseingang NamedEntityTag=PERSON]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, Ortseingang-6)
compound(Firma-2, Der-1)
compound(genau-4, Firma-2)
compound(genau-4, liegt-3)
nsubj(Ortseingang-6, genau-4)
cop(Ortseingang-6, am-5)
punct(Ortseingang-6, .-7)

The original training data in the UD treebank was

# sent_id = train-s25
# text = Der Firma liegt genau am Ortseingang.
1       Der     der     DET     ART     Case=Nom|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      2       det     _       _
2       Firma   Firma   NOUN    NN      Case=Nom|Gender=Masc|Number=Sing        3       nsubj   _       _
3       liegt   liegen  VERB    VVFIN   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       _
4       genau   genau   ADV     ADV     _       7       advmod  _       _
5-6     am      _       _       _       _       _       _       _       _
5       an      an      ADP     APPR    _       7       case    _       _
6       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      7       det     _       _
7       Ortseingang     Ortseingang     NOUN    NN      Case=Dat|Gender=Masc|Number=Sing        3       obl     _       SpaceAfter=No
8       .       .       PUNCT   $.      _       3       punct   _       _

The thing with the CoreNLP representation is, am is not a copular verb as far as I know. Google translate says it means "at the". Also, it's completely missing that the liegt is the verb. Basically that representation sucks.

The problem is that am in fact represents two words at the same time - the adposition and the determiner. If you just implement one tag for the entire token, probably the adposition, leaving out the determiner, that would be a little weird. Even more awkward would be a combination tag of some kind (although to be fair some datasets have adopted that approach, such as the Korean UD treebanks)

The solution UD adopted for most languages is to represent the text as a single token, am in this case, and split the analysis into the two words, an and dem. It is true there are some inconveniences here as well, such as an does not correspond to an actual start & end character. However, it makes analysis of words such as am much easier, since now you can analyze both words that it represents in a proper manner.

This happens in other languages. In Spanish, the pronoun clitics get split from verbs - otherwise you'd have 10x as many verbs to analyze. In English, the entire class of possessives, standard contractions such as can't, won't, it's, and colloquial contractions such as cannot, gonna, wanna. Then at the edges you can have 20 response long threads on UD about kinda or mighta as possible additions to the splittable lexicon... (These kind of threads alternate between amusing me every time I kick one off and discouraging me from asking in the first place about the best way for our software to analyze specific text)

Long story short, if all you want is the analysis of the pieces, you can either filter out from the json / dict representation any token whose id isn't just an int, or you can call doc.sentences[idx].words() instead of using the dict representation. That might be a little unsatisfying since it won't have character offsets in a language such as German, where the MWT don't split into easily understood pieces (compare to English, where we split cannot -> can not... how would you split am as text?). The Word objects each have a pointer to the enclosing Token, though, and the Token does have the start_char and end_char for the entire piece of text.

As for spacy, it does

>>> doc = nlp("I don't know what spacy does with MWT")
>>> for token in doc:
...     print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
do AUX aux
n't PART neg
know VERB ROOT
what PRON det
spacy NOUN nsubj
does VERB ccomp
with ADP prep
MWT PROPN pobj

>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
...     print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna VERB ROOT
lick PROPN compound
Jennifer PROPN poss
's PART case
antennae NOUN dobj

>>> nlp = spacy.load('en_core_web_trf')
>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
...    print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna AUX aux
lick VERB ROOT
Jennifer PROPN poss
's PART case
antennae NOUN dobj

>>> nlp = spacy.load('de_dep_news_trf')
>>> doc = nlp("Der Firma liegt genau am Ortseingang.")
>>> for token in doc:
...    print(token.text, token.pos_, token.dep_)
...
Der DET nk
Firma NOUN da
liegt VERB ROOT
genau ADV mo
am ADP mo
Ortseingang NOUN nk
. PUNCT punct

So they are treating contractions as single words (although they do split clitics). IDK, maybe people prefer that representation

@GeorgeS2019
Copy link
Author

First

thx for taking your time to provide elaborate answer.

Many top tech companies are using stanza and CoreNLP.
I saw the same mistake and I am here to feedback.

German langauge is no doubt a very challenging langauge.

I am here to learn and feedback :-)

CoreNLP ( lemma of "am" => "be" <=??

[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]

==> VBP is Unfortunately not correct.

From ChatGPT

The lemma of “am” would be “an” and "dem"

UD Trebank

Correct!

5-6     am      _       _       _       _       _       _       _       _
5       an      an      ADP     APPR    _       7       case    _       _
6       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      7       det     _       _

an => ADP (Preposition)
dem => DET (Determinant)

Spacy

"an dem" => "an" is a preposition and "dem" is a determinant article in Dative form.

Therefore ADP (Preposition) is correct for "am"

@GeorgeS2019
Copy link
Author

thx for tips how to parse. really helpful.

@AngledLuffa
Copy link
Collaborator

A ha ha I accidentally used the English CoreNLP instead of German. Let me revise...

NLP> Der Firma liegt genau am Ortseingang.

Sentence #1 (8 tokens):
Der Firma liegt genau am Ortseingang.

Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DET NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VERB NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=ADV NamedEntityTag=O]
[Text=an CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=ADP NamedEntityTag=O]
[Text=dem CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=DET NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=PUNCT NamedEntityTag=O]

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, liegt-3)
det(Firma-2, Der-1)
nsubj(liegt-3, Firma-2)
advmod(Ortseingang-7, genau-4)
case(Ortseingang-7, an-5)
det(Ortseingang-7, dem-6)
obl:an(liegt-3, Ortseingang-7)
punct(liegt-3, .-8)

Okay, that's much better. It also splits am, then labels the start and end characters as the same (overlapping) text positions as the original word. So effectively it's the same design choice as made in Stanza, but without an explicit marker that it was a multi-word token.

@GeorgeS2019
Copy link
Author

a multi-word token
I have yet to appreciate the benefits of treating it as a multi-word token.

So far, I only know very limited langauges.

@GeorgeS2019
Copy link
Author

Stanza

I could be doing it wrong.

I doubt I get start_char and end_char for "an" and "dem"

foreach word in sentence.words

 doc.sentences[idx].words()
{
    "id": [
      10,
      11
    ],
    "text": "am",
    "start_char": 56,
    "end_char": 58,
    "ner": "O",
    "multi_ner": [
      "O"
    ]
  },
  {
    "id": 10,
    "text": "an",
    "lemma": "an",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 12,
    "deprel": "case"
  },
  {
    "id": 11,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 12,
    "deprel": "det"
  }

@AngledLuffa
Copy link
Collaborator

True true. But what you can do is

>>> doc = pipe("Der Firma liegt genau am Ortseingang.")
>>> doc.sentences[0].words[4]
{
  "id": 5,
  "text": "an"
}
>>> doc.sentences[0].words[4].parent
[
  {
    "id": [
      5,
      6
    ],
    "text": "am",
    "start_char": 22,
    "end_char": 24
  },
  {
    "id": 5,
    "text": "an"
  },
  {
    "id": 6,
    "text": "dem"
  }
]
>>> doc.sentences[0].words[4].parent.start_char
22
>>> doc.sentences[0].words[4].parent.end_char
24

@GeorgeS2019
Copy link
Author

Thank you

valuable tip!!!

@GeorgeS2019
Copy link
Author

What could cause the wrong POS of "miaut" in "Der Hund bellt, die Katze miaut."?

"miaut" is not a verb in stanza. I am curious how this could happen.

@AngledLuffa
Copy link
Collaborator

It is a verb if you use the default_accurate models. That has the more accurate constituency parser, anyway, so I would suggest doing that if accurate constituency parses are desired

As for the root cause, that word doesn't show up in the training data, so all it has to go on are the embeddings and the context of the sentence. Sometimes it will get such a thing wrong

@GeorgeS2019
Copy link
Author

What I have learned over the last few weeks, one may need to go deeper into the source and how the training is done. Each approach seems to have perhaps more success with one case, while another is better with another case. I see in many ways the merits of how Stanza is approaching the subject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants