re.sub returns str when processing empty unicode string #45481

beda · 2007-09-10T06:37:19Z

BPO	1140
Nosy	@gvanrossum
Files	sre.diff sre.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gvanrossum'
closed_at = <Date 2007-09-17.09:44:15.949>
created_at = <Date 2007-09-10.06:37:18.875>
labels = ['expert-regex', 'type-bug']
title = 're.sub returns str when processing empty unicode string'
updated_at = <Date 2007-09-17.09:44:15.948>
user = 'https://bugs.python.org/beda'

bugs.python.org fields:

activity = <Date 2007-09-17.09:44:15.948>
actor = 'jafo'
assignee = 'gvanrossum'
closed = True
closed_date = <Date 2007-09-17.09:44:15.949>
closer = 'jafo'
components = ['Regular Expressions']
creation = <Date 2007-09-10.06:37:18.875>
creator = 'beda'
dependencies = []
files = ['8415', '8416']
hgrepos = []
issue_num = 1140
keywords = []
message_count = 11.0
messages = ['55775', '55788', '55789', '55790', '55793', '55797', '55798', '55800', '55803', '55805', '55957']
nosy_count = 4.0
nosy_names = ['gvanrossum', 'effbot', 'jafo', 'beda']
pr_nums = []
priority = 'low'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1140'
versions = ['Python 2.5', 'Python 2.4']

beda · 2007-09-10T06:37:18Z

While re.sub normally returns unicode strings when processing unicode,
it returns a normal string when dealing with an empty unicode string.

Example:
>>> print type( re.sub( "XX", "", u""))
<type 'str'>
>>> print type( re.sub( "XX", "", u"A"))
<type 'unicode'>

This inconsistency could lead to annoying bugs (at least it did for me :)

gvanrossum · 2007-09-10T17:14:03Z

I agree. I wonder if it should return Unicode as soon as *any* of the
arguments are unicode???

beda · 2007-09-10T18:25:31Z

I would certainly expect it to return unicode when either the "modified"
string or the replacement are unicode. I don't think that the type of
the replaced string should influence the type of the result.

gvanrossum · 2007-09-10T18:42:55Z

Actually, it already implements the best possible rules, *except* for
the special case of an empty 3rd argument. (When there are no
substitutions, it normally returns the input unchanged; but somehow an
empty input is handled with a shortcut even before that point. It ought
to be a simlpe fix.

gvanrossum · 2007-09-10T20:37:41Z

Here's a patch.

gvanrossum · 2007-09-10T21:40:05Z

Here's a better patch that also fixes a few related issues.

gvanrossum · 2007-09-10T21:40:25Z

Fredrik, thoughts?

effbot · 2007-09-10T21:54:54Z

Looks good to me. I still subscribe to the idea that
robust code should accept 8-bit *ASCII* strings any-
where it accepts Unicode (especially when the 8-bit
string is empty), but that's me.

Feel free to check this in (or assign back to you if
you don't have the time).

effbot · 2007-09-10T21:56:41Z

(is there a way to just add a comment in the new tracker, btw, or is
everything a "change note", even if nothing has changed?)

gvanrossum · 2007-09-10T22:03:42Z

Thanks, Fredrik.
Fixed in 2.6.
Committed revision 58098.
Someone else could backport to 2.5.
Shouldn't be merged into 3.0.

jafo · 2007-09-17T09:44:16Z

Applied as revision 58179 to 2.5 maintenance branch, passes tests.

beda mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Sep 10, 2007

gvanrossum self-assigned this Sep 10, 2007

gvanrossum assigned effbot and unassigned gvanrossum Sep 10, 2007

effbot mannequin assigned gvanrossum and unassigned effbot Sep 10, 2007

jafo mannequin closed this as completed Sep 17, 2007

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.sub returns str when processing empty unicode string #45481

re.sub returns str when processing empty unicode string #45481

beda mannequin commented Sep 10, 2007

beda mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

beda mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

effbot mannequin commented Sep 10, 2007

effbot mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

jafo mannequin commented Sep 17, 2007

re.sub returns str when processing empty unicode string #45481

re.sub returns str when processing empty unicode string #45481

Comments

beda mannequin commented Sep 10, 2007

beda mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

beda mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

effbot mannequin commented Sep 10, 2007

effbot mannequin commented Sep 10, 2007

gvanrossum commented Sep 10, 2007

jafo mannequin commented Sep 17, 2007