[patch] improve unicode methods: split() rsplit() and replace() #51871
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2010-01-13.08:09:56.647> created_at = <Date 2010-01-03.17:09:44.179> labels = ['interpreter-core', 'expert-unicode', 'performance'] title = '[patch] improve unicode methods: split() rsplit() and replace()' updated_at = <Date 2010-01-13.08:09:56.646> user = 'https://github.com/florentx'
activity = <Date 2010-01-13.08:09:56.646> actor = 'pitrou' assignee = 'none' closed = True closed_date = <Date 2010-01-13.08:09:56.647> closer = 'pitrou' components = ['Interpreter Core', 'Unicode'] creation = <Date 2010-01-03.17:09:44.179> creator = 'flox' dependencies =  files = ['15736', '15744', '15749', '15750'] hgrepos =  issue_num = 7622 keywords = ['patch'] message_count = 24.0 messages = ['97168', '97172', '97173', '97174', '97184', '97194', '97197', '97204', '97208', '97211', '97212', '97213', '97214', '97215', '97216', '97218', '97219', '97220', '97224', '97232', '97267', '97280', '97281', '97698'] nosy_count = 5.0 nosy_names = ['lemburg', 'pitrou', 'eric.smith', 'ezio.melotti', 'flox'] pr_nums =  priority = 'normal' resolution = 'fixed' stage = 'patch review' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue7622' versions = ['Python 2.7', 'Python 3.2']
The text was updated successfully, but these errors were encountered:
Content of the patch:
Benchmark coming soon...
The patch looks wrong for bytearrays. They are mutable, so you shouldn't return the original object as an optimization. Here is the current (unpatched) behaviour:
>>> a = bytearray(b"abc") >>> b, = a.split() >>> b is a False
On the other hand, you aren't doing this optimization at all in the general case of stringlib_split() and stringlib_rsplit(), while it could be done.
A few comments on coding style:
count = countstring(self_s, self_len, from_s, from_len, 0, self_len, FORWARD, maxcount);
/* helper macro to fixup start/end slice values */
+#define ADJUST_INDICES(start, end, len) \
and use similar formatting for the replacement function
#define LONG_BITMASK (LONG_BIT-1) #define BLOOM(mask, ch) ((mask & (1 << ((ch) & LONG_BITMASK))))
LONG_BITMASK has a value of 0x1f (31) - that's a single byte, not
When adjusting the value to be platform dependent, please check
Note that you don't need to expose that value separately if
Py_ssize_t i, j, count=0; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)), *sub;
Py_ssize_t i, j; Py_ssize_t count=0; PyObject *list = PyList_New(PREALLOC_SIZE(maxcount)) PyObject *sub;
instead use this style:
Thank you for your remarks. I will update the patch accordingly.
Since the same value is used to build the mask, I assume it's better to keep the value around (or use (LONG_BIT-1) directly?).
s/LONG_BITMASK/BLOOM_BITMASK/ is not confusing?
I copied the style of "stringlib/partition.h" for this part.
No, it's ok for stringlib to have its own consistent style and there's no reason to change it IMO.
More interesting would be benchmark results showing how much this improves the various methods :-)
And now, the figures.
There's no gain for the string methods.
Most significant results:
--- bench_slow.log Trunk
========== late match, 100 characters -13.30 20.51 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100) -16.12 29.88 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100) +13.27 14.38 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100) +16.19 17.61 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== quick replace multiple character match
========== quick replace single character match
(full benchmark diff is attached)
And we save 1000 lines of code cumulated
Florent Xicluna wrote:
I'd prefer if you change the coding style to what we use elsewhere
See http://www.python.org/dev/peps/pep-0007/ for more C coding
Eric Smith wrote:
For any new files added, PEP-7 should always be used.
For PEP-7-ifying the existing code, we could open a separate ticket or just apply the change as separate patch.