Greetings should not be extracted #31

ruKurz · 2017-07-24T09:46:09Z

tkurz · 2017-08-07T09:39:52Z

It would be the best to enable configurable stopword lists.

westei · 2017-08-24T06:47:57Z

Checked why Hi is considered a search term:

OpenNLP classifies Hi as an Person

i.r.nlp.opennlp.OpenNlpNerProcessor : - Named Entity [0,2 | prob: 0.136, tag: PER, type: pers] Hi

NOTE: the low prob: 0.136 says nothing about the real confidence as the used OpenNLP model reports all prob between ~[0.11..0.18]

what makes sense as their could as well be a Person on that place in the sentence (e.g. Rüdiger, wie kann ich eine Nachricht fett machen?)

The Stanford NLP model does NOT classify Hi as a person.

ParseTree: (ROOT (S (NP (NE Hi)) ($, ,) (PWAV wie) (VMFIN kann) (PPER ich) (VP (NP (ART eine) (NN Nachricht) (NN Fett)) (VVINF machen)) ($. ?)))

Regarding Stopwords: the StopwordExtractordoes NOT mark Nouns, Adjectives and Verbs as Stopwords. So even if hi would be in the Stopword list (what is not the case) the Token would not be marked as a Stopword as it is (incorrectly) classified as noun (should be Interjektion) by StanfordNLP.

westei · 2017-08-24T11:59:22Z

Cause: wrong classification of Hi by the POS tagger

Suggested Solution:

One possibility would be a component that filters Tokens based on predefined rules - in the simplest case as stopword list. However one could also imagine more complex filter (e.g rule based, regex based, token type based, token hint based ...) filters

westei · 2017-08-25T11:14:59Z

Implemented a TokenFilter infrastructure with currently a single TokenFilter implementation that takes a stopword list to check Token#getValue() against.

Stopword lists are language specific incl. a default list that is used for any lanugage (in addition to language specific words)

Matching uses smart case sensitivity. Meaning that all upper case words are matched case sensitive and all other case insensitive.

Configuration:

A default list for de is included. Additional lists can be configured using Spring Resources by the properties

 processor.token.stopword.{lang}={resource}
 processor.token.stopword.default={resource}

where:

lang: is the 2 letter ISO language code (e.g. de for German)
resource: the resource to load the stopword list from. This uses the Spring ResoruceLoader infrastructure (e.g. file:/path/stopwords-de.txt or file:stopwords.txt for files, http://myhost.com/resource/path/stopwords-en.txt for urls)
default is used to load the stopword list used for any language.

File Format

The file format is really simple

one word per line (lines are trimmed)
empty lines and lines starting with # are ignored
encoding is expected to be utf-8

westei · 2017-08-25T11:15:52Z

TODOs: This does currently only support a single global configuration. For client specific configuration one needs first to support client specific analysis pipelines.

…gle stop word based implementation. See issue for detailed comments and documentation

ruKurz · 2017-09-04T10:23:11Z

@westei did you add your comment to the documentation?

…marti Configuration (#31)

westei · 2017-09-06T07:52:02Z

Added to Stopword Token Filter section of the Samrti Configuration documentation

ruKurz · 2017-09-12T19:03:02Z

Before shipping this, we need to setup a test scenario. I'm working on it.

Feature/add circleci support

tkurz assigned westei Aug 7, 2017

tkurz added this to the beta-sprint milestone Aug 7, 2017

westei added a commit that referenced this issue Aug 25, 2017

Issue #31: implementation of a TokenFilterinfrastructure with a sin…

531ab49

…gle stop word based implementation. See issue for detailed comments and documentation

westei added a commit that referenced this issue Aug 25, 2017

Issue #31: refactoring

e7f72d0

tkurz modified the milestones: Gamma Sprint, beta-sprint Aug 28, 2017

tkurz added the in review label Aug 28, 2017

westei added a commit that referenced this issue Sep 6, 2017

added additional information about the Stopword Token Filter to the S…

cc6adb1

…marti Configuration (#31)

ruKurz closed this as completed Sep 15, 2017

ruKurz removed the in review label Sep 15, 2017

mrsimpson added a commit that referenced this issue Sep 5, 2018

Merge pull request #31 from assistify/feature/add-circleci-support

a17a239

Feature/add circleci support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greetings should not be extracted #31

Greetings should not be extracted #31

ruKurz commented Jul 24, 2017

tkurz commented Aug 7, 2017

westei commented Aug 24, 2017 •

edited

westei commented Aug 24, 2017

westei commented Aug 25, 2017 •

edited

westei commented Aug 25, 2017

ruKurz commented Sep 4, 2017

westei commented Sep 6, 2017

ruKurz commented Sep 12, 2017

Greetings should not be extracted #31

Greetings should not be extracted #31

Comments

ruKurz commented Jul 24, 2017

tkurz commented Aug 7, 2017

westei commented Aug 24, 2017 • edited

westei commented Aug 24, 2017

westei commented Aug 25, 2017 • edited

Configuration:

File Format

westei commented Aug 25, 2017

ruKurz commented Sep 4, 2017

westei commented Sep 6, 2017

ruKurz commented Sep 12, 2017

westei commented Aug 24, 2017 •

edited

westei commented Aug 25, 2017 •

edited