-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
str.translate() unexpectedly duplicates characters #70651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Python 3.5.1 x86-64, Windows 10 I created a translation map that translated some characters to None and others to strings and found that in some cases str.translate() will duplicate one of the untranslated characters in the returned string. How to reproduce: table = str.maketrans({'a': None, 'b': 'cd'})
'axb'.translate(table) Expected result: 'xcd' Actual result: 'xxcd' Mapping 'a' to '' instead of None will produce the desired effect. |
It duplicates translated characters as well. For example: >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
>>> 'aaaaaamnopqrb'.translate(table)
'rqponmrqponmĀ' 3.4 returns the correct result: >>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
>>> 'aaaaaamnopqrb'.translate(table)
'rqponmĀ' The problem is the new fast path for one-to-one ASCII mapping (unicode_fast_translate in Objects/unicodeobject.c) doesn't have a way to return the current input position in order to resume processing the translation. _PyUnicode_TranslateCharmap assumes it's the same as the current writer position, which is wrong when input characters have been deleted. |
Oh... I see. It's a bug introduced by the optimization for ASCII replacing one character with another ASCII character or deleting a character: unicode_fast_translate(). See change cca6b056236a of issue bpo-21118. There is a confusion in the code between input and ouput position. "i = writer.pos;" is used in the caller to continue when unicode_fast_translate() was interrupted (because a translation use a non-ASCII character or a string longer than 1 character), but writer.pos is the position in the *output* string, not in the *input* string :-/ I see that I added unit tests on translate, but it lacks an unit testing fast translation, starting with ignore and then switching to regular translation. Attached patch should fix the issue. It adds unit tests. |
The bug was introduced in Python v3.5.0a1. |
LGTM. |
New changeset 27ba9ba5deb1 by Victor Stinner in branch '3.5': |
Thanks for the review. I pushed my fix. Sorry for the regression, I hate being responsible of a regression in a core feature :-/ I may even deserve a release, but Python doesn't have the habit of "release often" yet :-( |
New changeset 6643c5cc9797 by Victor Stinner in branch '3.5': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: