Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Greetings should not be extracted #31

Closed
ruKurz opened this issue Jul 24, 2017 · 8 comments
Closed

Greetings should not be extracted #31

ruKurz opened this issue Jul 24, 2017 · 8 comments
Assignees
Milestone

Comments

@ruKurz
Copy link
Collaborator

ruKurz commented Jul 24, 2017

bildschirmfoto 2017-07-24 um 09 48 39

@tkurz
Copy link
Member

tkurz commented Aug 7, 2017

It would be the best to enable configurable stopword lists.

@tkurz tkurz added this to the beta-sprint milestone Aug 7, 2017
@westei
Copy link
Member

westei commented Aug 24, 2017

Checked why Hi is considered a search term:

OpenNLP classifies Hi as an Person

i.r.nlp.opennlp.OpenNlpNerProcessor : - Named Entity [0,2 | prob: 0.136, tag: PER, type: pers] Hi

NOTE: the low prob: 0.136 says nothing about the real confidence as the used OpenNLP model reports all prob between ~[0.11..0.18]

what makes sense as their could as well be a Person on that place in the sentence (e.g. Rüdiger, wie kann ich eine Nachricht fett machen?)

The Stanford NLP model does NOT classify Hi as a person.

ParseTree: (ROOT (S (NP (NE Hi)) ($, ,) (PWAV wie) (VMFIN kann) (PPER ich) (VP (NP (ART eine) (NN Nachricht) (NN Fett)) (VVINF machen)) ($. ?)))

Regarding Stopwords: the StopwordExtractordoes NOT mark Nouns, Adjectives and Verbs as Stopwords. So even if hi would be in the Stopword list (what is not the case) the Token would not be marked as a Stopword as it is (incorrectly) classified as noun (should be Interjektion) by StanfordNLP.

@westei
Copy link
Member

westei commented Aug 24, 2017

Cause: wrong classification of Hi by the POS tagger

Suggested Solution:

One possibility would be a component that filters Tokens based on predefined rules - in the simplest case as stopword list. However one could also imagine more complex filter (e.g rule based, regex based, token type based, token hint based ...) filters

@westei
Copy link
Member

westei commented Aug 25, 2017

Implemented a TokenFilter infrastructure with currently a single TokenFilter implementation that takes a stopword list to check Token#getValue() against.

Stopword lists are language specific incl. a default list that is used for any lanugage (in addition to language specific words)

Matching uses smart case sensitivity. Meaning that all upper case words are matched case sensitive and all other case insensitive.

Configuration:

A default list for de is included. Additional lists can be configured using Spring Resources by the properties

 processor.token.stopword.{lang}={resource}
 processor.token.stopword.default={resource}

where:

  • lang: is the 2 letter ISO language code (e.g. de for German)
  • resource: the resource to load the stopword list from. This uses the Spring ResoruceLoader infrastructure (e.g. file:/path/stopwords-de.txt or file:stopwords.txt for files, http://myhost.com/resource/path/stopwords-en.txt for urls)
  • default is used to load the stopword list used for any language.

File Format

The file format is really simple

  • one word per line (lines are trimmed)
  • empty lines and lines starting with # are ignored
  • encoding is expected to be utf-8

@westei
Copy link
Member

westei commented Aug 25, 2017

TODOs: This does currently only support a single global configuration. For client specific configuration one needs first to support client specific analysis pipelines.

westei added a commit that referenced this issue Aug 25, 2017
…gle stop word based implementation. See issue for detailed comments and documentation
westei added a commit that referenced this issue Aug 25, 2017
@tkurz tkurz modified the milestones: Gamma Sprint, beta-sprint Aug 28, 2017
@ruKurz
Copy link
Collaborator Author

ruKurz commented Sep 4, 2017

@westei did you add your comment to the documentation?

westei added a commit that referenced this issue Sep 6, 2017
@westei
Copy link
Member

westei commented Sep 6, 2017

Added to Stopword Token Filter section of the Samrti Configuration documentation

@ruKurz
Copy link
Collaborator Author

ruKurz commented Sep 12, 2017

Before shipping this, we need to setup a test scenario. I'm working on it.

@ruKurz ruKurz closed this as completed Sep 15, 2017
@ruKurz ruKurz removed the in review label Sep 15, 2017
mrsimpson added a commit that referenced this issue Sep 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants