Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BreakIteratorSegmenter: New parameter for punctuation marks #16

Closed
GoogleCodeExporter opened this issue Apr 24, 2015 · 9 comments
Closed

Comments

@GoogleCodeExporter
Copy link

I would appreciate a new boolean parameter in BreakIteratorSegmenter which 
constitutes whether to mark punctuation marks as tokens or not.
(If available, see Bug 851 in DKPro Semantics.)

Thanks in advance,
Marko

Original issue reported on code.google.com by black-c...@web.de on 19 May 2011 at 4:22

@GoogleCodeExporter
Copy link
Author

A patch for this issue.

Original comment by black-c...@web.de on 13 Jun 2011 at 6:41

Attachments:

@GoogleCodeExporter
Copy link
Author

I have had a look at the patch and it seems to work for German or English, but 
I am not quite sure what the side effects will be for other languages.
Additionally, the opinions about what should be considered as punctuation might 
differ.

Thus, I would rather suggest to implement this functionality as a subsequent 
filtering step where all annotations (type of the annotation would be a 
parameter) are removed that correspond to some pattern (another parameter).

Original comment by torsten....@gmail.com on 29 Sep 2011 at 10:09

@GoogleCodeExporter
Copy link
Author

I find it a bit strange that the patch looks at the last character of a token 
to decide if a token should be removed or not. 

In terms of alternatives, we have a TokenFilter in the tokit module that 
currently filters out tokens that are too long. That, however, does not handle 
attached POS or Lemma annotations. 
There is also the StopWordRemover which is dictionary-based and is configurable 
with respect to types.

Maybe it would be useful to merge all of that into a single AnnotationFilter 
which can do regexes or dictionaries - or maybe even dictionaries of regexes? 
;) For sake of speed separate parameters for max length and min length might 
also be useful.

Original comment by richard.eckart on 29 Sep 2011 at 10:19

@GoogleCodeExporter
Copy link
Author

> I find it a bit strange that the patch looks at the last character of a token 
to
> decide if a token should be removed or not. 

Indeed.

> Maybe it would be useful to merge all of that into a single AnnotationFilter 
which
> can do regexes or dictionaries - or maybe even dictionaries of regexes? ;) 
For sake
> of speed separate parameters for max length and min length might also be 
useful

Sounds good. Do we aim at the whole thing (lists of dictionaries of regexes ;) 
or better start small?

Original comment by torsten....@gmail.com on 29 Sep 2011 at 10:31

@GoogleCodeExporter
Copy link
Author

Changing from the dictionary-based to dictionary-of-regexes should be little 
more a parameter (PARAM_REGEX = true) and switching from equals() to matches().

What's bugging me more is the question of how to deal with POS, Lemma, Stem and 
so on. Traditionally these are co-located and there was no link in Token to 
refer to them. But this is awkward for programming and bad for performance. So 
since recently, the Token has explicit fields to refer to POS and Lemma and I 
am not sure about Stem. It would be possible to override the removeFromIndex() 
method in Token to automatically also remove associated POS. Lemma, Stems - but 
this would only work for JCas. The other alternative is the mechanism used in 
the StopWordRemover - works but is much slower and needs more configuration 
effort. Maybe a combination of both would be good so that CAS-based AEs could 
use the configurable method (that could become a convenience method in uimaFIT 
CASUtil) and JCas-based AEs could call Token.removeFromIndexes() and Lemma, POS 
and Stem are automatically cascaded.

What do you think?

Original comment by richard.eckart on 29 Sep 2011 at 11:03

@GoogleCodeExporter
Copy link
Author

As far as I see, it would make the component quite dependent on our type 
system, right.
This is not a problem in general, but I would prefer a type system agnostic 
component, and maybe additionally a more specialized one.

Original comment by torsten....@gmail.com on 29 Sep 2011 at 12:25

@GoogleCodeExporter
Copy link
Author

What do you mean by "more specialized".

Original comment by richard.eckart on 29 Sep 2011 at 6:50

@GoogleCodeExporter
Copy link
Author

Original comment by richard.eckart on 8 Feb 2012 at 10:51

  • Added labels: Milestone-1.4.0

@GoogleCodeExporter
Copy link
Author

Looks like this issue has been superseded by issue 14 (rename and enhance 
tokenfilter). We won't change the BreakIteratorSegmenter because that would 
imply we also need to change all other tokenizers that we have and may have in 
the future.

Original comment by richard.eckart on 7 Jun 2012 at 3:14

  • Changed state: WontFix
  • Added labels: Type-Enhancement
  • Removed labels: Milestone-1.4.0, Type-Defect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant