"Default" word boundaries for Unicode data? #51504

RegEx4All · 2009-11-03T04:22:02Z

BPO	7255
Nosy	@loewis, @amauryfa, @ezio-melotti

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-11-24.16:16:58.489>
created_at = <Date 2009-11-03.04:22:01.699>
labels = ['expert-regex', 'type-feature']
title = '"Default" word boundaries for Unicode data?'
updated_at = <Date 2014-11-24.16:16:58.481>
user = 'https://bugs.python.org/RegEx4All'

bugs.python.org fields:

activity = <Date 2014-11-24.16:16:58.481>
actor = 'amaury.forgeotdarc'
assignee = 'none'
closed = True
closed_date = <Date 2014-11-24.16:16:58.489>
closer = 'amaury.forgeotdarc'
components = ['Regular Expressions']
creation = <Date 2009-11-03.04:22:01.699>
creator = 'RegEx4All'
dependencies = []
files = []
hgrepos = []
issue_num = 7255
keywords = []
message_count = 6.0
messages = ['94856', '94857', '113928', '113979', '113993', '231607']
nosy_count = 5.0
nosy_names = ['loewis', 'amaury.forgeotdarc', 'ezio.melotti', 'mrabarnett', 'RegEx4All']
pr_nums = []
priority = 'normal'
resolution = 'works for me'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue7255'
versions = ['Python 2.7', 'Python 3.2']

RegEx4All · 2009-11-03T04:22:00Z

Regarding UTS #18 (Unicode Standards for RegEx Engines), which can be
found at:
http://www.unicode.org/reports/tr18/

Is there a plan or commitment for Python to implement at least "default
word boundaries" (a Level 2 feature), rather than the current "simple
word boundaries"? I don't believe that the algorithm for this is a
whole lot more complicated, but it certainly makes a huge difference for
processing non-Roman text.

For example, to match the whole word રત without matching the word રતા
(which has an additional vowel at the end, the vertical line), with
"default word boundary" recognition, you could use the pattern \bરત\b.
With Python's current "simple word boundary" recognition, however, the
\b assertion is pretty much useless here, and I have yet to see a decent
zero-width pattern that can take its place.

BTW, the ICU regex libraries do provide this level of Unicode support:
http://userguide.icu-project.org/strings/regexp
It seems to work perfectly on Indic text, based on the tests I've done.

Being open-source, it may be a helpful reference for the algorithm needed.

Dan

loewis · 2009-11-03T06:12:46Z

Is there a plan or commitment for Python to implement at least "default
word boundaries" (a Level 2 feature), rather than the current "simple
word boundaries"?

No such plan exists at this time. Contributions are welcome.

mrabarnett · 2010-08-14T20:27:37Z

These have been added to the new 'regex' module. See issue bpo-2636 or PyPI at:

http://pypi.python.org/pypi/regex

RegEx4All · 2010-08-15T17:51:07Z

Woo-HOOO! Am very excited to hear this! Thanks, Matthew! This and also the related \w \W handling (bpo-1693050) should be extremely useful for processing Indic text. I'm a python newbie, so will need to find some help on what I need to do to compile/install/use this source-file download, but if I can figure that out, I'd be very happy to test this against a texts in a variety of Indic scripts. Way to go!

mrabarnett · 2010-08-15T18:47:44Z

If you're on Windows (x86, 32-bit) then compilation isn't necessary - just use the appropriate _regex.pyd.

amauryfa · 2014-11-24T16:16:58Z

Closing this old issue: either use the 'regex' module, or wait for bpo-2636.

RegEx4All mannequin added topic-regex type-feature A feature request or enhancement labels Nov 3, 2009

amauryfa closed this as completed Nov 24, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Default" word boundaries for Unicode data? #51504

"Default" word boundaries for Unicode data? #51504

RegEx4All mannequin commented Nov 3, 2009

RegEx4All mannequin commented Nov 3, 2009

loewis mannequin commented Nov 3, 2009

mrabarnett mannequin commented Aug 14, 2010

RegEx4All mannequin commented Aug 15, 2010

mrabarnett mannequin commented Aug 15, 2010

amauryfa commented Nov 24, 2014

"Default" word boundaries for Unicode data? #51504

"Default" word boundaries for Unicode data? #51504

Comments

RegEx4All mannequin commented Nov 3, 2009

RegEx4All mannequin commented Nov 3, 2009

loewis mannequin commented Nov 3, 2009

mrabarnett mannequin commented Aug 14, 2010

RegEx4All mannequin commented Aug 15, 2010

mrabarnett mannequin commented Aug 15, 2010

amauryfa commented Nov 24, 2014