-
Notifications
You must be signed in to change notification settings - Fork 1
Part of Speech
Marcel Heinz edited this page Aug 1, 2018
·
4 revisions
We provide more insights on using part-of-speech pattern.
While applying hypernym extraction to the computer language domain, we noticed several occasions where hypernym extraction becomes unreliable. Those occasions are not frequent enough to make any impact. A lot of different minor issues exist, where it does not seem worth it to implement them. This would rather become implementing a "pattern" for single examples.
- A side effect of quite restrictive pattern is that we can exclude articles whose subject are classifiers. While 'Assembly' is in the seed, it should actually describe a class of languages as the text suggests: `An assembly (or assembler) language, often abbreviated asm, is a low-level programming language.'. Therefore we improved the pattern extraction process by remembering which sentence starts with 'a' or 'an' and ignoring such articles in the classification.
- Our simplified hypernym extraction fails when two subjects are classified in the same sentence as in 'BlooP and FlooP are simple programming languages[...]'. It also fails, when languages are described collectively as in `Microsoft Office XML formats'.
- Literature mentions that an article's first sentence defines the entity. There exist a few articles, such as `.bss', where this is not the case.
- The trained english model for stanford parser cannot cope with names containing a '.' character as in `Visual Basic .NET (VB.NET) is a multi-paradigm, object-oriented programming language'. The first sentence is cut off after the first '.' character.
- Various kinds of grammar errors hinder NLP-parsers from extracting useful part-of-speach tagging and dependencies between words. One highlight is provided by 'Albatross (programming language)' starting with 'Albatrossis a general purpose programming language [...]', where there is no space symbol behind the name. On other occasions, word such as 'a' may be missing, e.g. in `Image markup is language that attaches annotations to image files.'.
- Other times, the sentence in Dbpedia may not be properly extracted, when as the markup is complex, e.g., the article on 'Maximal pair'.