Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode word boundries #72065

Closed
revo mannequin opened this issue Aug 27, 2016 · 2 comments
Closed

Unicode word boundries #72065

revo mannequin opened this issue Aug 27, 2016 · 2 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@revo
Copy link
Mannequin

revo mannequin commented Aug 27, 2016

BPO 27878
Nosy @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-08-27.14:56:48.761>
created_at = <Date 2016-08-27.14:36:09.938>
labels = ['expert-regex', 'type-bug', 'invalid']
title = 'Unicode word boundries'
updated_at = <Date 2016-08-27.14:56:48.749>
user = 'https://bugs.python.org/revo'

bugs.python.org fields:

activity = <Date 2016-08-27.14:56:48.749>
actor = 'SilentGhost'
assignee = 'none'
closed = True
closed_date = <Date 2016-08-27.14:56:48.761>
closer = 'SilentGhost'
components = ['Regular Expressions']
creation = <Date 2016-08-27.14:36:09.938>
creator = 'revo'
dependencies = []
files = []
hgrepos = []
issue_num = 27878
keywords = []
message_count = 2.0
messages = ['273782', '273783']
nosy_count = 4.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'SilentGhost', 'revo']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue27878'
versions = []

@revo
Copy link
Mannequin Author

revo mannequin commented Aug 27, 2016

According to UAX #29 - unicode word boundaries (rule WB5a), an apostrophe includes U+0027 ( ' ) APOSTROPHE and U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK (curly apostrophe).

However regex module only implements U+0027 and the second kind (U+2019) is missing:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
is_unicode_vowel(char_at(state->text, text_pos)))
return TRUE;

Source code

@revo revo mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Aug 27, 2016
@SilentGhost
Copy link
Mannequin

SilentGhost mannequin commented Aug 27, 2016

regex module is not in standard library, on the latest 3.6 branch re module breaks on curly apostrophe just fine. Perhaps, try reporting this issue on the bitbucket tracker?

@SilentGhost SilentGhost mannequin closed this as completed Aug 27, 2016
@SilentGhost SilentGhost mannequin added the invalid label Aug 27, 2016
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

0 participants