Greetings should not be extracted #31
Comments
It would be the best to enable configurable stopword lists. |
Checked why OpenNLP classifies
NOTE: the low what makes sense as their could as well be a Person on that place in the sentence (e.g. The Stanford NLP model does NOT classify ParseTree: Regarding Stopwords: the |
Cause: wrong classification of Suggested Solution: One possibility would be a component that filters Tokens based on predefined rules - in the simplest case as stopword list. However one could also imagine more complex filter (e.g rule based, regex based, token type based, token hint based ...) filters |
Implemented a Stopword lists are language specific incl. a default list that is used for any lanugage (in addition to language specific words) Matching uses smart case sensitivity. Meaning that all upper case words are matched case sensitive and all other case insensitive. Configuration:A default list for
where:
File FormatThe file format is really simple
|
TODOs: This does currently only support a single global configuration. For client specific configuration one needs first to support client specific analysis pipelines. |
…gle stop word based implementation. See issue for detailed comments and documentation
@westei did you add your comment to the documentation? |
Added to Stopword Token Filter section of the Samrti Configuration documentation |
Before shipping this, we need to setup a test scenario. I'm working on it. |
The text was updated successfully, but these errors were encountered: