-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] German contraction of "an dem" to "am" #1369
Comments
I double if Spacy would handle this way, I am simply curious |
This is a complicated question which comes up frequently, and people never seem to like the answer. However, my impression of that is probably the same as bullet holes in planes - only the people who don't like the answer show up on github.
edit: this whole German CoreNLP section was done with the wrong annotation pipeline, see below
The original training data in the UD treebank was
The problem is that The solution UD adopted for most languages is to represent the text as a single token, This happens in other languages. In Spanish, the pronoun clitics get split from verbs - otherwise you'd have 10x as many verbs to analyze. In English, the entire class of possessives, standard contractions such as Long story short, if all you want is the analysis of the pieces, you can either filter out from the json / dict representation any token whose id isn't just an int, or you can call As for spacy, it does
So they are treating contractions as single words (although they do split clitics). IDK, maybe people prefer that representation |
Firstthx for taking your time to provide elaborate answer. Many top tech companies are using stanza and CoreNLP. German langauge is no doubt a very challenging langauge. I am here to learn and feedback :-) CoreNLP ( lemma of "am" => "be" <=??
==> VBP is Unfortunately not correct. From ChatGPTThe lemma of “am” would be “an” and "dem" UD TrebankCorrect!
an => ADP (Preposition) Spacy"an dem" => "an" is a preposition and "dem" is a determinant article in Dative form. Therefore ADP (Preposition) is correct for "am" |
thx for tips how to parse. really helpful. |
A ha ha I accidentally used the English CoreNLP instead of German. Let me revise...
Okay, that's much better. It also splits |
So far, I only know very limited langauges. |
StanzaI could be doing it wrong. I doubt I get start_char and end_char for "an" and "dem"
|
True true. But what you can do is
|
Thank youvaluable tip!!! |
What could cause the wrong POS of "miaut" in "Der Hund bellt, die Katze miaut."? "miaut" is not a verb in stanza. I am curious how this could happen. |
It is a As for the root cause, that word doesn't show up in the training data, so all it has to go on are the embeddings and the context of the sentence. Sometimes it will get such a thing wrong |
What I have learned over the last few weeks, one may need to go deeper into the source and how the training is done. Each approach seems to have perhaps more success with one case, while another is better with another case. I see in many ways the merits of how Stanza is approaching the subject. |
Am
“Am” is a contraction of “an” and “dem”.
An dem
“An dem” is used when you want to keep “an” and “dem” separate for emphasis or clarity.
How Stanza handles them?
One word "am" with the right word id has TWO more additional words: "an dem"
It is simpler to just parse an int coming back from a word.id.
Now, instead of int, it is an array referencing the TWO additional words
The challenges:
The parent word has start_char and end_char, but the other morphological features are now transferred to the child word e.g. dem
Question
I wonder how best to handle this when parsing.
[1]
[2]
The text was updated successfully, but these errors were encountered: