Skip to content

Part of Speech

Marcel Heinz edited this page Aug 1, 2018 · 4 revisions

We provide more insights on using part-of-speech pattern.

Seed Exploration Data

Issues

While applying hypernym extraction to the computer language domain, we noticed several occasions where hypernym extraction becomes unreliable. Those occasions are not frequent enough to make any impact. A lot of different minor issues exist, where it does not seem worth it to implement them. This would rather become implementing a "pattern" for single examples.

  • A side effect of quite restrictive pattern is that we can exclude articles whose subject are classifiers. While 'Assembly' is in the seed, it should actually describe a class of languages as the text suggests: `An assembly (or assembler) language, often abbreviated asm, is a low-level programming language.'. Therefore we improved the pattern extraction process by remembering which sentence starts with 'a' or 'an' and ignoring such articles in the classification.
  • Our simplified hypernym extraction fails when two subjects are classified in the same sentence as in 'BlooP and FlooP are simple programming languages[...]'. It also fails, when languages are described collectively as in `Microsoft Office XML formats'.
  • Literature mentions that an article's first sentence defines the entity. There exist a few articles, such as `.bss', where this is not the case.
  • The trained english model for stanford parser cannot cope with names containing a '.' character as in `Visual Basic .NET (VB.NET) is a multi-paradigm, object-oriented programming language'. The first sentence is cut off after the first '.' character.
  • Various kinds of grammar errors hinder NLP-parsers from extracting useful part-of-speach tagging and dependencies between words. One highlight is provided by 'Albatross (programming language)' starting with 'Albatrossis a general purpose programming language [...]', where there is no space symbol behind the name. On other occasions, word such as 'a' may be missing, e.g. in `Image markup is language that attaches annotations to image files.'.
  • Other times, the sentence in Dbpedia may not be properly extracted, when as the markup is complex, e.g., the article on 'Maximal pair'.

Clone this wiki locally