Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistagging of homographic, sentence-initial verbs #7

Open
vizzerdrix55 opened this issue Dec 7, 2019 · 0 comments
Open

Mistagging of homographic, sentence-initial verbs #7

vizzerdrix55 opened this issue Dec 7, 2019 · 0 comments

Comments

@vizzerdrix55
Copy link

vizzerdrix55 commented Dec 7, 2019

As mentioned by Horbach et al. (2015, p. 44), sentence-initial verbs are frequently in CMC-data and are often mistagged as nouns by standard tools. I checked the behavior of SoMeWeTa with the german_web_social_media_2018-12-21.model and noted that it does a real good job in recognizing these kinds of verbs.
The example provided in Horbach et al. (2015, p. 44) is in fact a tricky one:

Blicke da auch nicht so richtig durch und habe Probleme damit

Blicke is homographic to the German plural of 'der Blick' but is meant as first person singular of the German verb 'blicken' in the example.
In this case, also SoMeWeTa mistags it as a noun. This seems to be true for some of the rare cases of homographic sentence-initial verbs. (For being precise: They have to be homographic with a token of another part-of-speech subcategory.)

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#To Do: update path to your model here
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from introspectiv generated German sentences
sentences = ["Blicke da auch nicht durch.",
             "Check ich auch nicht.",
             "Schau mir das morgen an.",
             "Trank kurz den Tee fertig."]

for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    print(tagged_sentence)

If you run the above code the output will be:

  • incorrect for: [('Blicke', 'NN'), ('da', 'ADV'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('durch', 'PTKVZ'), ('.', '$.')]
  • correct for: [('Check', 'VVFIN'), ('ich', 'PPER'), ('auch', 'ADV'), ('nicht', 'PTKNEG'), ('.', '$.')]
  • correct for: [('Schau', 'VVIMP'), ('mir', 'PPER'), ('das', 'ART'), ('morgen', 'NN'), ('an', 'PTKVZ'), ('.', '$.')]
  • correct for: [('Trank', 'VVFIN'), ('kurz', 'ADJD'), ('den', 'ART'), ('Tee', 'NN'), ('fertig', 'ADJD'), ('.', '$.')]

The homographs of my examples are the following nouns: 'der Check', 'die Schau' and 'der Trank'
As you can see from the example above only the example sentence of Horbach et al. seems to be affected. All other test sentences have been tagged correctly. I have not yet discovered a system for the failure. As this is not part of the documented errors of SoMeWeTa (Proisl, 2018, p. 667) I considered it as an issue.

Sources:

  • Horbach, Andrea / Thater, Stefan / Steffen, Diana / Fischer, Peter M. / Witt, Andreas und Pinkal, Manfred (2015). Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum, 15(1), 41–47. https://doi.org/10.1007/s13222-014-0172-z
  • Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106
tsproisl added a commit that referenced this issue Dec 10, 2019
Reported-by: vizzerdrix55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant