BreakIteratorSegmenter: New parameter for punctuation marks #16

GoogleCodeExporter · 2015-04-24T06:10:54Z

I would appreciate a new boolean parameter in BreakIteratorSegmenter which 
constitutes whether to mark punctuation marks as tokens or not.
(If available, see Bug 851 in DKPro Semantics.)

Thanks in advance,
Marko

Original issue reported on code.google.com by black-c...@web.de on 19 May 2011 at 4:22

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-04-24T06:10:55Z

A patch for this issue.

Original comment by black-c...@web.de on 13 Jun 2011 at 6:41

Attachments:

BreakIteratorSegmenter-patch-issue-16.txt

GoogleCodeExporter · 2015-04-24T06:10:55Z

I have had a look at the patch and it seems to work for German or English, but 
I am not quite sure what the side effects will be for other languages.
Additionally, the opinions about what should be considered as punctuation might 
differ.

Thus, I would rather suggest to implement this functionality as a subsequent 
filtering step where all annotations (type of the annotation would be a 
parameter) are removed that correspond to some pattern (another parameter).

Original comment by torsten....@gmail.com on 29 Sep 2011 at 10:09

GoogleCodeExporter · 2015-04-24T06:10:55Z

I find it a bit strange that the patch looks at the last character of a token 
to decide if a token should be removed or not. 

In terms of alternatives, we have a TokenFilter in the tokit module that 
currently filters out tokens that are too long. That, however, does not handle 
attached POS or Lemma annotations. 
There is also the StopWordRemover which is dictionary-based and is configurable 
with respect to types.

Maybe it would be useful to merge all of that into a single AnnotationFilter 
which can do regexes or dictionaries - or maybe even dictionaries of regexes? 
;) For sake of speed separate parameters for max length and min length might 
also be useful.

Original comment by richard.eckart on 29 Sep 2011 at 10:19

GoogleCodeExporter · 2015-04-24T06:10:55Z

> I find it a bit strange that the patch looks at the last character of a token 
to
> decide if a token should be removed or not. 

Indeed.

> Maybe it would be useful to merge all of that into a single AnnotationFilter 
which
> can do regexes or dictionaries - or maybe even dictionaries of regexes? ;) 
For sake
> of speed separate parameters for max length and min length might also be 
useful

Sounds good. Do we aim at the whole thing (lists of dictionaries of regexes ;) 
or better start small?

Original comment by torsten....@gmail.com on 29 Sep 2011 at 10:31

GoogleCodeExporter · 2015-04-24T06:10:55Z

Changing from the dictionary-based to dictionary-of-regexes should be little 
more a parameter (PARAM_REGEX = true) and switching from equals() to matches().

What's bugging me more is the question of how to deal with POS, Lemma, Stem and 
so on. Traditionally these are co-located and there was no link in Token to 
refer to them. But this is awkward for programming and bad for performance. So 
since recently, the Token has explicit fields to refer to POS and Lemma and I 
am not sure about Stem. It would be possible to override the removeFromIndex() 
method in Token to automatically also remove associated POS. Lemma, Stems - but 
this would only work for JCas. The other alternative is the mechanism used in 
the StopWordRemover - works but is much slower and needs more configuration 
effort. Maybe a combination of both would be good so that CAS-based AEs could 
use the configurable method (that could become a convenience method in uimaFIT 
CASUtil) and JCas-based AEs could call Token.removeFromIndexes() and Lemma, POS 
and Stem are automatically cascaded.

What do you think?

Original comment by richard.eckart on 29 Sep 2011 at 11:03

GoogleCodeExporter · 2015-04-24T06:10:55Z

As far as I see, it would make the component quite dependent on our type 
system, right.
This is not a problem in general, but I would prefer a type system agnostic 
component, and maybe additionally a more specialized one.

Original comment by torsten....@gmail.com on 29 Sep 2011 at 12:25

GoogleCodeExporter · 2015-04-24T06:10:56Z

What do you mean by "more specialized".

Original comment by richard.eckart on 29 Sep 2011 at 6:50

GoogleCodeExporter · 2015-04-24T06:10:56Z

Original comment by richard.eckart on 8 Feb 2012 at 10:51

Added labels: Milestone-1.4.0

GoogleCodeExporter · 2015-04-24T06:10:56Z

Looks like this issue has been superseded by issue 14 (rename and enhance 
tokenfilter). We won't change the BreakIteratorSegmenter because that would 
imply we also need to change all other tokenizers that we have and may have in 
the future.

Original comment by richard.eckart on 7 Jun 2012 at 3:14

Changed state: WontFix
Added labels: Type-Enhancement
Removed labels: Milestone-1.4.0, Type-Defect

GoogleCodeExporter added Type-Enhancement auto-migrated labels Apr 24, 2015

GoogleCodeExporter added the Priority-Medium label Apr 24, 2015

GoogleCodeExporter closed this as completed Apr 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BreakIteratorSegmenter: New parameter for punctuation marks #16

BreakIteratorSegmenter: New parameter for punctuation marks #16

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

BreakIteratorSegmenter: New parameter for punctuation marks #16

BreakIteratorSegmenter: New parameter for punctuation marks #16

Comments

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015

GoogleCodeExporter commented Apr 24, 2015