Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.sub returns str when processing empty unicode string #45481

Closed
beda mannequin opened this issue Sep 10, 2007 · 11 comments
Closed

re.sub returns str when processing empty unicode string #45481

beda mannequin opened this issue Sep 10, 2007 · 11 comments
Assignees
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@beda
Copy link
Mannequin

beda mannequin commented Sep 10, 2007

BPO 1140
Nosy @gvanrossum
Files
  • sre.diff
  • sre.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gvanrossum'
    closed_at = <Date 2007-09-17.09:44:15.949>
    created_at = <Date 2007-09-10.06:37:18.875>
    labels = ['expert-regex', 'type-bug']
    title = 're.sub returns str when processing empty unicode string'
    updated_at = <Date 2007-09-17.09:44:15.948>
    user = 'https://bugs.python.org/beda'

    bugs.python.org fields:

    activity = <Date 2007-09-17.09:44:15.948>
    actor = 'jafo'
    assignee = 'gvanrossum'
    closed = True
    closed_date = <Date 2007-09-17.09:44:15.949>
    closer = 'jafo'
    components = ['Regular Expressions']
    creation = <Date 2007-09-10.06:37:18.875>
    creator = 'beda'
    dependencies = []
    files = ['8415', '8416']
    hgrepos = []
    issue_num = 1140
    keywords = []
    message_count = 11.0
    messages = ['55775', '55788', '55789', '55790', '55793', '55797', '55798', '55800', '55803', '55805', '55957']
    nosy_count = 4.0
    nosy_names = ['gvanrossum', 'effbot', 'jafo', 'beda']
    pr_nums = []
    priority = 'low'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1140'
    versions = ['Python 2.5', 'Python 2.4']

    @beda
    Copy link
    Mannequin Author

    beda mannequin commented Sep 10, 2007

    While re.sub normally returns unicode strings when processing unicode,
    it returns a normal string when dealing with an empty unicode string.

    Example:
    >>> print type( re.sub( "XX", "", u""))
    <type 'str'>
    >>> print type( re.sub( "XX", "", u"A"))
    <type 'unicode'>

    This inconsistency could lead to annoying bugs (at least it did for me :)

    @beda beda mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Sep 10, 2007
    @gvanrossum
    Copy link
    Member

    I agree. I wonder if it should return Unicode as soon as *any* of the
    arguments are unicode???

    @beda
    Copy link
    Mannequin Author

    beda mannequin commented Sep 10, 2007

    I would certainly expect it to return unicode when either the "modified"
    string or the replacement are unicode. I don't think that the type of
    the replaced string should influence the type of the result.

    @gvanrossum
    Copy link
    Member

    Actually, it already implements the best possible rules, *except* for
    the special case of an empty 3rd argument. (When there are no
    substitutions, it normally returns the input unchanged; but somehow an
    empty input is handled with a shortcut even before that point. It ought
    to be a simlpe fix.

    @gvanrossum
    Copy link
    Member

    Here's a patch.

    @gvanrossum gvanrossum self-assigned this Sep 10, 2007
    @gvanrossum
    Copy link
    Member

    Here's a better patch that also fixes a few related issues.

    @gvanrossum
    Copy link
    Member

    Fredrik, thoughts?

    @gvanrossum gvanrossum assigned effbot and unassigned gvanrossum Sep 10, 2007
    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Sep 10, 2007

    Looks good to me. I still subscribe to the idea that
    robust code should accept 8-bit *ASCII* strings any-
    where it accepts Unicode (especially when the 8-bit
    string is empty), but that's me.

    Feel free to check this in (or assign back to you if
    you don't have the time).

    @effbot effbot mannequin assigned gvanrossum and unassigned effbot Sep 10, 2007
    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Sep 10, 2007

    (is there a way to just add a comment in the new tracker, btw, or is
    everything a "change note", even if nothing has changed?)

    @gvanrossum
    Copy link
    Member

    Thanks, Fredrik.
    Fixed in 2.6.
    Committed revision 58098.
    Someone else could backport to 2.5.
    Shouldn't be merged into 3.0.

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Sep 17, 2007

    Applied as revision 58179 to 2.5 maintenance branch, passes tests.

    @jafo jafo mannequin closed this as completed Sep 17, 2007
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant