improved allocation of PyUnicode objects #46235

pitrou · 2008-01-26T22:44:34Z

BPO	1943
Nosy	@malemburg, @gvanrossum, @rhettinger, @terryjreedy, @amauryfa, @mdickinson, @pitrou, @vstinner, @ericvsmith, @devdanzin, @benjaminp, @orivej, @ezio-melotti
Files	stringbench3k.py: a quick port to py3k of stringbench.py unialloc4.patch freelists2.patch unialloc5.patch unialloc6.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-09-29.20:03:47.256>
created_at = <Date 2008-01-26.22:44:34.097>
labels = ['interpreter-core', 'type-feature', 'expert-unicode']
title = 'improved allocation of PyUnicode objects'
updated_at = <Date 2011-09-29.20:03:47.254>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2011-09-29.20:03:47.254>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2011-09-29.20:03:47.256>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2008-01-26.22:44:34.097>
creator = 'pitrou'
dependencies = []
files = ['9297', '9789', '9790', '14048', '19142']
hgrepos = []
issue_num = 1943
keywords = ['patch']
message_count = 96.0
messages = ['61728', '61730', '61732', '61735', '61738', '61740', '61741', '61743', '61747', '61832', '61862', '61867', '62332', '62467', '64108', '64112', '64161', '64163', '64165', '64166', '64167', '64168', '64169', '64170', '64194', '64197', '64215', '64216', '64320', '87912', '87917', '87918', '88254', '88287', '88288', '88290', '88291', '88307', '88309', '88310', '88312', '88313', '88584', '88585', '88744', '88745', '88746', '88751', '88753', '88767', '88768', '88774', '88775', '88777', '88779', '88801', '88802', '88804', '88806', '88816', '88819', '88824', '88830', '88879', '88880', '88881', '88883', '88885', '88928', '88936', '88946', '88951', '88953', '88983', '89069', '97545', '97553', '97554', '97555', '97559', '97581', '98656', '98657', '98662', '98668', '98669', '98672', '98676', '98677', '98686', '110599', '116937', '116950', '118087', '134406', '144625']
nosy_count = 19.0
nosy_names = ['lemburg', 'gvanrossum', 'collinwinter', 'rhettinger', 'terry.reedy', 'jafo', 'jimjjewett', 'amaury.forgeotdarc', 'mark.dickinson', 'Rhamphoryncus', 'pitrou', 'vstinner', 'eric.smith', 'ferringb', 'ajaksu2', 'benjamin.peterson', 'orivej', 'ezio.melotti', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1943'
versions = ['Python 3.2']

pitrou · 2008-01-26T22:44:33Z

This is an attempt at improving allocation of str (PyUnicode) objects.
For that it does two things:

make str objects PyVarObjects, so that the object is allocated in one
pass rather than two
maintain an array of freelists, each for a given str length, so that
many str allocations can bypass actual memory allocation calls

There is a ~10% speedup in stringbench, and a slight improvement in
pybench (as invoked with "./python Tools/pybench/pybench.py -t String").
Also, memory consumption is a bit lower when running those benchmarks.

malemburg · 2008-01-26T23:27:34Z

The Unicode object was designed not to be a PyVarObject (in contrast to
the string object), so I'm not sure what you would want to break that
design in return for a mere 10% speedup.

Note that turning the objects into PyVarObjects removes the ability to
extend the type at C level. It also introduces lots of problems if you
want to resize a Unicode object, due to the fact that the memory address
of the whole PyUnicodeObject can change. Resizing is done quite often in
codecs.

Regarding your second point: Unicode objects already use a free list and
even keep the allocated Py_UNICODE buffer allocated (if it's not too
large). Furthermore, the allocation is handled by pymalloc, so you only
rarely need to go to the system malloc() to allocate more space.

Tuning the KEEPALIVE_SIZE_LIMIT will likely have similar effects w/r to
speedup.

pitrou · 2008-01-27T00:24:00Z

I just tried bumping KEEPALIVE_SIZE_LIMIT to 200. It makes up for a bit
of the speedup, but the PyVarObject version is still faster. Moreover,
and perhaps more importantly, it eats less memory (800KB less in
pybench, 3MB less when running the whole test suite).

(then there are of course microbenchmarks. For example:
# With KEEPALIVE_SIZE_LIMIT = 200
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 556 usec per loop
# With this patch
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 391 usec per loop
)

I don't understand the argument for codecs having to resize the unicode
object. Since codecs also have to resize the bytes object when encoding,
the argument should apply to bytes objects as well, yet bytes (and str
in 2.6) is a PyVarObject.

I admit I don't know the exact reasons for PyUnicode's design. I just
know that the patch doesn't break the API, and runs the test suite fine.
Are there any specific things to look for?

malemburg · 2008-01-27T13:52:00Z

Your microbenchmark is biased towards your patched version. The
KEEPALIVE_SIZE_LIMIT will only cut in when you deallocate and then
reallocate Unicode objects. The free list used for Unicode objects is
also limited to 1024 objects - which isn't all that much. You could tune
MAX_UNICODE_FREELIST_SIZE as well.

Regarding memory usage: this is difficult to measure in Python, since
pymalloc will keep memory chunks allocated even if they are not in use
by Python. However, this is a feature of pymalloc and not specific to
the Unicode implementation. It can be tuned in pymalloc. To get more
realistic memory measurements, you'd have to switch off pymalloc
altogether and then create a separate process that consumes lots of
memory to force the OS to have it allocate only memory that's really
needed to the process you're running for memory measurements. Of course,
keeping objects alive in a free list will always use more memory than
freeing them altogether and returning the memory to the OS. It's a
speed/space tradeoff. The RAM/CPU costs ratio has shifted a lot towards
RAM nowadays, so using more RAM is usually more efficient than using
more CPU time.

Regarding resize: you're right - the string object is a PyVarObject as
well and couldn't be changed at the time due to backwards compatibility
reasons. You should also note that when I added Unicode to Python 1.6,
it was a new and not commonly used type. Codecs were not used much
either, so there was no incentive to make resizing strings work better.
Later on, other optimizations were added to the Unicode implementation
that caused the PyUnicode_Resize() API to also require being able to
change the object address. Still, in the common case, it doesn't change
the object address.

The reason for using an external buffer for the Unicode object was to be
able to do further optimizations, such as share buffers between Unicode
objects. We never ended up using this, though, but there's still a lot
of room for speedups and more memory efficiency because of this design.

Like I already mentioned, PyObjects are also easier to extend at C level

adding new variables to the object at the end is easy with PyObjects.
It's difficult for PyVarObjects, since you always have to take the
current size of the object into account and you always have to use
indirection to get at the extra variables due to the undefined offset of
the variables.

How much speedup do you get when you compare the pybench test with
KEEPALIVE_SIZE_LIMIT = 200 compared to your patched version ?

pitrou · 2008-01-27T15:35:35Z

With KEEPALIVE_SIZE_LIMIT = 200, the pybench runtime is basically the
same as with my patched version. stringbench remains a bit faster though
(~8%).

You say that RAM size is cheaper than CPU power today, which is true but
misses one part of the picture: the importance of CPU caches, and thus
of working set size. There are also probably people wanting to use
Python in memory-constrained environments (embedded).

I understand the argument about possible optimizations with an external
buffer, but are such optimizations likely to be implemented? (see
bpo-1590352 and bpo-1629305). If they really are, then I'm happy with the
unicode type remaining a plain PyObject!

malemburg · 2008-01-27T16:03:54Z

I don't really see the connection with bpo-1629305.

An optimization that would be worth checking is hooking up the
Py_UNICODE pointer to interned Unicode objects if the contents match
(e.g. do a quick check based on the hash value and then a memcmp). That
would save memory and the call to the pymalloc allocator.

Another strategy could involve a priority queue style cache with the aim
of identifying often used Unicode strings and then reusing them.

This could also be enhanced using an offline approach: you first run an
application with an instrumented Python interpreter to find the most
often used strings and then pre-fill the cache or interned dictionary on
the production Python interpreter at startup time.

Coming from a completely different angle, you could also use the
Py_UNICODE pointer to share slices of a larger data buffer. A Unicode
sub-type could handle this case, keeping a PyObject* reference to the
larger buffer, so that it doesn't get garbage collected before the
Unicode slice.

Regarding memory constrained environments: these should simply switch
off all free lists and pymalloc. OTOH, even mobile phones come with
gigabytes of RAM nowadays, so it's not really worth the trouble, IMHO.

pitrou · 2008-01-27T16:46:46Z

All of those proposals are much heavier to implement; they also make the
unicode type more complicated and difficult to maintain probably (while
turning it into a PyVarObject actually shortens the implementation a
bit). In any case, it's not something I want to tackle.

The reason that I mentioned bpo-1629305 was that it was such an
optimization which complicated the unicode type while bringing very
significant speedups.

malemburg · 2008-01-27T18:05:39Z

Agreed, those optimizations do make the implementation more complicated.
It's also not clear whether they would really be worth it.

bpo-1629305 only provided speedups for the case where you write s += 'abc'.
The usual idiom for this is to use a list and then concatenate in one
go. If you want a really fast approach, you'd use cStringIO or perhaps
the bufferarray. I don't think that optimizing for just one particular
use case warrants making the code more complicated or changing the C
interface radically.

In your case, I think that closing the door for being able to easily
extend the type implement at the C is the main argument against it. The
speedups are only marginal and can also be achieved (to some extent) by
tuning the existing implementation's parameters.

pitrou · 2008-01-27T19:03:09Z

I know it's not the place to discuss bpo-1629305, but the join() solution
is not always faster. Why? Because 1) there's the list contruction and
method call overhead 2) ceval.c has some bytecode hackery to try and
make plain concatenations somewhat less slow.

As for cStringIO, I actually observed in some experiments that it was
slower than the other alternatives, at least for short strings. All in
all, choosing between the three idioms is far from obvious and needs
case-by-case analysis.

pitrou · 2008-01-29T21:54:43Z

FWIW, I tried using the freelist scheme introduced in my patch without
making PyUnicode a PyVarObject and, although it's better than the
standard version, it's still not as good as the PyVarObject version.
Would you be interested in that patch?

malemburg · 2008-01-30T11:45:15Z

Yes, definitely.

Some comments on style in your first patch:

please use unicode->length instead of the macro LENGTH you added
indents in unicodeobject.c are 4 spaces
line length should stay below 80

pitrou · 2008-01-30T15:56:23Z

After some more tests I must qualify what I said. The freelist patch is
an improvement in some situations. In others it does not really have any
impact. On the other hand, the PyVarObject version handles memory-bound
cases dramatically better, see below.

With a small string:
./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
-> Unpatched py3k: 26.4 usec per loop
-> Freelist patch: 21.5 usec per loop
-> PyVarObject patch: 20.2 usec per loop

With a medium-sized string:
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
-> Unpatched py3k: 458 usec per loop
-> Freelist patch: 408 usec per loop
-> PyVarObject patch: 316 usec per loop

With a long string:
./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
-> Unpatched py3k: 31.3 msec per loop
-> Freelist patch: 32.7 msec per loop
-> PyVarObject patch: 17.8 msec per loop

(the numbers are better than in my previous posts because the
split-accelerating patch has been integrated)

Also, given those results, it is also clear that neither pybench nor
stringbench really stress memory efficiency of strings, only processing
algorithms.

That said, the freelist patch is attached.

pitrou · 2008-02-12T19:58:45Z

Here is an updated patch against the current py3k branch, and with
spaces instead of tabs for indentation.

pitrou · 2008-02-16T20:32:32Z

Here is an updated patch, to comply with the introduction of the
PyUnicode_ClearFreeList() function.

jafo · 2008-03-19T21:40:57Z

Marc-Andre: Wit the udpated patches, is this a set of patches we can accept?

pitrou · 2008-03-19T22:00:58Z

Thanks for your interest Sean :)
By the way, on python-3000 both GvR and Martin von Löwis were ok on the
proposed design change, although they did not review the patch itself.
http://mail.python.org/pipermail/python-3000/2008-February/012076.html

malemburg · 2008-03-20T10:03:59Z

Antoine, as I've already mentioned in my other comments, I'm -1 on
changing the Unicode object to a variable size object.

I also don't think that the micro-benchmarks you are applying really do
test the implementation in a real-life situations. The only case where
your patch appears significantly faster is the "long string" case. If
you look at the distribution of the Unicode strings generated by this
case, you'll find that most strings have less than 10-20 characters.
Optimizing pymalloc for these cases and tuning the parameters in the
Unicode implementation will likely give you the same performance
increase without having to sacrifice the advantages of using a pointer
in the object rather than a inlining the data.

I'm +1 on the free list changes, though, in the long run, I think that
putting more efforts into improving pymalloc and removing the free lists
altogether would be better.

BTW: Unicode slices would be a possible and fairly attractive target for
a C level subclass of Unicode objects. The pointer in the Unicode object
implementation could then point to the original Unicode object's buffer
and the subclass would add an extra pointer to the original object
itself (in order to keep it alive). The Unicode type (written by Fredrik
Lundh) I used as basis for the Unicode implementation worked with this idea.

amauryfa · 2008-03-20T10:31:00Z

Marc-Andre: don't all your objections also apply to the 8bit string
type, which is already a variable-size structure?
Is extending unicode more common than extending str?

With python 3.0, all strings are unicode. Shouldn't this type be
optimized for the same use cases of the previous str type?

malemburg · 2008-03-20T11:00:13Z

Yes, all those objections apply to the string type as well. The fact
that strings are variable length objects makes it impossible to do apply
any of the possible optimizations I mentioned. If strings were a fixed
length object, it would have been possible to write a true basestring
object from which both 8-bit strings and Unicode subclass, making a lot
of things easier.

BTW: Please also see ticket bpo-2321 to see how the change affects your
benchmarks.

pitrou · 2008-03-20T11:32:38Z

Hi,

Marc-André, I'm all for "real-life" benchmarks if someone proposes some.
Until that we have to live with micro-benchmarks, which seems to be the
method used for other CPython optimizations by the way.

You are talking about slicing optimizations but you forget that the
"lazy slices" idea has been shot down by Guido and others when proposed
by Larry Hastings (with a patch) some time ago. Also the "lazy slices"
patch was for 8-bit strings, which *are* variable-size objects, which
seems to counter your argument that variable-size Unicode objects would
prevent making such optimizations.

As I said the freelist changes actually have mixed consequences, and in
general don't improve much (and that improvement is just backed by
micro-benchmarks after all... why would it be more convincing than the
far greater improvement brought by making Unicode objects variable-size
objects?).

Why wouldn't you express your arguments in the python-3000 thread I
launched one month ago, so that at least there is a clear picture of the
different arguments for and against this approach?

malemburg · 2008-03-20T12:53:13Z

Regarding benchmarks: It's difficult to come up with decent benchmarks
for things like this. A possible strategy is to use an instrumented
interpreter that records which Unicode objects are created and when they
are deleted. If you then run this instrumented interpreter against a few
larger applications such as Zope for a while, you'll get a data set that
can be used as input for a more real-life like benchmark. I've done this
in the past for Python byte codes to see which are used how often -
based on such data you can create optimizations that have a real-life
effect.

Regarding the lazy slice patches: those were not using subclassing, they
were patches to the existing type implementations. With subclassing you
don't affect all uses of an object, but instead let the user control
which uses should take advantage of the slice operations. Since the user
also controls the objects which are kept alive this way, the main
argument against Unicode slice objects goes away. This is different than
patching the type itself and doesn't change the main object
implementation in any way. Furthermore, it's possible to implement and
provide such subclasses outside the Python core, which gives developers
a lot more freedom.

Regarding discussions on the py3k list: I'm not on that list, since I
already get more email than I can manage. I am on the python-dev list,
if you want to take up the discussions there.

pitrou · 2008-03-20T13:57:13Z

Well I'm not subscribed to the python-3k list either - too much traffic
indeed. You can read and post into it with gmane for example:
http://thread.gmane.org/gmane.comp.python.python-3000.devel/11768
(there is probably an NNTP gateway too)

As for instrumenting the interpreter, this would tell us when and which
strings are allocated, but not the precise effect it has on a modern
CPU's memory subsystem (which is quite a complicated thing). Also, this
is Py3k - we can't test any real-life applications until they are ported
to it...
(someone really motivated would have to learn Intel VTune or AMD
CodeAnalyst and run such a "py3k real-life application" with it :-))

As for the explicit slicing approach, "explicit string views" have been
discussed in 2006 on the python-3k list, including a proof-of-concept
patch by Josiah Carlson - without any definitive pronouncement it seems.
See the subthread here:
http://mail.python.org/pipermail/python-3000/2006-August/003280.html

The reason I'm bringing in those previous discussions is that, in
theory, I'm all for much faster concatenation and slicing thanks to
buffer sharing etc., but realistically the previous attempts have failed
for various reasons (either technical or political). And since those
optimizations are far from simple to implement and maintain, chances are
they will not be attempted again, let alone accepted. I'd be happy to be
proven wrong :)

regards

Antoine.

malemburg · 2008-03-20T14:34:33Z

I've read the comments from Guido and Martin, but they don't convince me
in changing my -1.

As you say: it's difficult to get support for optimizations such a
slicing and concatenation into the core. And that's exactly why I want
to keep the door open for developers to add such support via extension
modules. With the 8-bit string implementation this was never possible.

With the Unicode implementation and the subclassing support for builtin
types, it is possible to add support without having to go through all
the hoops and discussions on python-dev or py3k lists.

I'm also for making Python faster, but not if it limits future
optimizations and produces dead-ends. The fact that such optimizations
haven't been implemented yet doesn't really mean anything either -
people are not yet using Unicode intensively enough to let the need arise.

pitrou · 2008-03-20T14:50:41Z

Well, I'm not gonna try to defend my patch eternally :)
I understand your opinion even if I find a bit disturbing that we refuse
a concrete, actual optimization on the basis of future hypothetical ones.

Since all the arguments have been laid down, I'll let other developers
decide whether further discussion of this proposal is useful or not.

malemburg · 2008-03-20T19:10:34Z

Regarding the benchmark: You can instrument a 2.x version of the
interpreter to build the data set and then have the test load this data
set in Py3k and have it replay the allocation/deallocation in the same
way it was done on the 2.x system.

I also expect that patch bpo-2321 will have an effect on the performance
since that patch changed the way memory allocation of the buffer was
done (from using the system malloc() to using pymalloc's allocator for
the buffer as well).

pitrou · 2008-03-20T19:36:48Z

You are right, bpo-2321 made the numbers a bit tighter:

With a small string:
./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
-> Unpatched py3k: 23.1 usec per loop
-> Freelist patch: 21.3 usec per loop
-> PyVarObject patch: 20.5 usec per loop

With a medium-sized string:
./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
-> Unpatched py3k: 406 usec per loop
-> Freelist patch: 353 usec per loop
-> PyVarObject patch: 314 usec per loop

With a long string:
./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
-> Unpatched py3k: 22.7 msec per loop
-> Freelist patch: 24 msec per loop
-> PyVarObject patch: 20.6 msec per loop

stringbench3k:
-> Unpatched py3k: 266 seconds
-> Freelist patch: 264 seconds
-> PyVarObject patch: 249 seconds

Regarding your benchmarking suggestion, this would certainly be an
interesting thing to do, but I fear it is also much more than I'm
willing to do...

I'm going to post the updated patches.

malemburg · 2008-03-20T22:13:01Z

Thanks for running the tests again. The use of pymalloc for the buffer
made a significant difference indeed. I expect that more can be had by
additionally tweaking KEEPALIVE_SIZE_LIMIT.

It is interesting to see that the free list patch only appears to
provide better timings for the "smaller" strings tests. I couldn't find
the INTBENCH file anywhere in the Python source tree, but here's the
distribution of the words of the latter two files:

Dev-Python/LICENSE:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
1: [ 28] ===
2: [ 350] =================================================
3: [ 354] ==================================================
4: [ 190] ==========================
5: [ 191] ==========================
6: [ 158] ======================
7: [ 164] =======================
8: [ 132] ==================
9: [ 127] =================
10: [ 102] ==============
11: [ 39] =====
12: [ 34] ====
13: [ 21] ==
14: [ 10] =
15: [ 10] =
16: [ 0]
17: [ 0]
18: [ 1]
19: [ 0]
20: [ 0]
21: [ 1]
22: [ 0]
23: [ 0]
24: [ 0]
25: [ 1]
26: [ 1]
27: [ 1]
28: [ 0]
29: [ 1]
30: [ 0]
31: [ 0]
32: [ 0]
33: [ 0]
34: [ 0]
35: [ 0]
36: [ 2]
37: [ 0]
38: [ 0]
39: [ 1]
40: [ 0]
41: [ 0]
42: [ 0]
43: [ 1]
44: [ 1]
45: [ 0]
46: [ 0]
47: [ 0]
48: [ 0]
49: [ 0]
50: [ 1]
51: [ 0]
52: [ 0]
53: [ 0]
54: [ 0]
55: [ 0]
56: [ 0]
57: [ 0]
58: [ 0]
59: [ 0]
60: [ 0]
61: [ 0]
62: [ 0]
63: [ 1]

Dev-Python/Misc/HISTORY:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
1: [ 6853] ==================
2: [13920] =====================================
3: [18401] ==================================================
4: [12626] ==================================
5: [ 9545] =========================
6: [ 9348] =========================
7: [ 9625] ==========================
8: [ 7351] ===================
9: [ 5353] ==============
10: [ 3266] ========
11: [ 1947] =====
12: [ 1336] ===
13: [ 983] ==
14: [ 638] =
15: [ 408] =
16: [ 288]
17: [ 286]
18: [ 216]
19: [ 176]
20: [ 147]
21: [ 120]
22: [ 116]
23: [ 85]
24: [ 89]
25: [ 70]
26: [ 44]
27: [ 59]
28: [ 39]
29: [ 32]
30: [ 65]
31: [ 23]
32: [ 26]
33: [ 28]
34: [ 19]
35: [ 18]
36: [ 9]
37: [ 18]
38: [ 5]
39: [ 10]
40: [ 9]
41: [ 9]
42: [ 1]
43: [ 1]
44: [ 8]
45: [ 4]
46: [ 5]
47: [ 5]
48: [ 3]
49: [ 3]
50: [ 4]
51: [ 0]
52: [ 0]
53: [ 2]
54: [ 0]
55: [ 0]
56: [ 4]
57: [ 2]
58: [ 4]
59: [ 1]
60: [ 0]
61: [ 1]
62: [ 0]
63: [ 1]
64: [ 1]
65: [ 9]
66: [ 1]
67: [ 1]
68: [ 0]
69: [ 0]
70: [ 16]
71: [ 1]
72: [ 0]
73: [ 1]

Compare that to a typical Python module source file...

Dev-Python/Lib/urllib.py:
------------------------------------------------------------------------
Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
1: [ 806] ==================================================
2: [ 672] =========================================
3: [ 675] =========================================
4: [ 570] ===================================
5: [ 531] ================================
6: [ 501] ===============================
7: [ 296] ==================
8: [ 393] ========================
9: [ 246] ===============
10: [ 147] =========
11: [ 150] =========
12: [ 102] ======
13: [ 90] =====
14: [ 116] =======
15: [ 61] ===
16: [ 51] ===
17: [ 45] ==
18: [ 38] ==
19: [ 31] =
20: [ 39] ==
21: [ 24] =
22: [ 18] =
23: [ 18] =
24: [ 23] =
25: [ 17] =
26: [ 15]
27: [ 13]
28: [ 14]
29: [ 11]
30: [ 9]
31: [ 7]
32: [ 1]
33: [ 6]
34: [ 10]
35: [ 2]
36: [ 4]
37: [ 3]
38: [ 6]
39: [ 1]
40: [ 0]
41: [ 1]
42: [ 5]
43: [ 0]
44: [ 0]
45: [ 0]
46: [ 0]
47: [ 0]
48: [ 2]
49: [ 0]
50: [ 1]
51: [ 1]
52: [ 2]

The distributions differ a lot, but they both show that typical strings
(both in English text and Python program code) have a length of < 25
characters.

Setting KEEPALIVE_SIZE_LIMIT to 32 should cover most of those cases
while still keeping the memory load within bounds. Raising the free list
limit to e.g. 2048 would cause at most 64kB to be used by the free list

which is well within bounds of any modern CPU L2 cache.

malemburg · 2009-06-05T11:06:16Z

Antoine Pitrou wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Raymond suggested the patch be committed in 3.1, so as to minimize
disruption between 3.1 and 3.2. Benjamin, what do you think?

Has Guido pronounced on this already ?

gvanrossum · 2009-06-05T14:08:51Z

On Fri, Jun 5, 2009 at 4:06 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

Marc-Andre Lemburg <mal@egenix.com> added the comment:

Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Raymond suggested the patch be committed in 3.1, so as to minimize
> disruption between 3.1 and 3.2. Benjamin, what do you think?

Has Guido pronounced on this already ?

I don't want it added to 3.1 unless we start the beta cycle afresh.
It's too subtle for submitting after rc1. Talk to Benjamin if you
disagree.

I think it's fine to wait for 3.2. Maybe add something to the docs
about not subclassing unicode in C.

malemburg · 2009-06-05T15:55:18Z

Guido van Rossum wrote:

I think it's fine to wait for 3.2. Maybe add something to the docs
about not subclassing unicode in C.

We should have a wider discussion about this on python-dev.

I'll publish the unicoderef extension and then we can see
whether users want this or not.

Antoine's patch makes such extensions impossible (provided you
don't want to copy over the complete unicodeobject.c
implementation in order to change the memory allocation
scheme).

Note that in Python 2.x you don't have such issues because
there, most tools for text processing will happily work on
any sort of buffer, so you don't need a string sub-type
in order to implement e.g. references into another string
(the buffer type will allow you to do this easily).

pitrou · 2009-06-05T16:17:26Z

Note that in Python 2.x you don't have such issues because
there, most tools for text processing will happily work on
any sort of buffer, so you don't need a string sub-type
in order to implement e.g. references into another string
(the buffer type will allow you to do this easily).

The new buffer API has a provision for type flags, although none of them
had a chance to be implemented by the original author before he ran
away...
There could be a type flag for unicode characters and then its support
could be implemented in the memoryview object.

terryjreedy · 2009-06-06T00:16:17Z

In the interest of possibly improving the imminent 3.1 release,
I opened bpo-6216
Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

I wonder if it is possible to make it generically easier to subclass
PyVarObjects (but my C knowledge to getting too faded to have any ideas).

malemburg · 2009-06-08T07:58:44Z

Terry J. Reedy wrote:

In the interest of possibly improving the imminent 3.1 release,
I opened bpo-6216
Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

Thanks for opening that ticket.

I wonder if it is possible to make it generically easier to subclass
PyVarObjects (but my C knowledge to getting too faded to have any ideas).

Even if we were to add some pointer arithmetic tricks to at least
hide away the complexities, you'd no longer be able to change the
way the data memory allocation works.

The reason is simple: subclassing is about reusing existing method
implementations and only adding/changing a few of them.

If you want to change the way the allocation works, you'd have
to replace all of them.

Furthermore, using your subclasses objects with the existing APIs
would no longer be safe, since these would still assume the
original base class memory allocation scheme.

In summary:

Implementations like the unicoderef type I posted
and most of the other use cases I mentioned are no longer
possible, ie. you will *always* have to copy the data in order
to work with the existing Unicode APIs on it.

The current implementation has no problem with working on referenced
data, since support for this was built in right from the start.

That's what I meant with closing the door on future enhancements
that would make a huge difference if used right, for a mere
10% performance increase.

Rhamphoryncus · 2010-01-10T19:52:12Z

Points against the subclassing argument:

We have a null-termination invariant. For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API? This prevents lazy slicing.
UTF-16 and UTF-32 are rarely used encodings, especially for longer strings (ie files). For shorter strings (APIs) the unicode object overhead is more significant and we'd need a way to slave to the buffer's lifetime to that of the unicode object (hard to do). For longer strings UTF-8 would be much more useful, but that's been shot down before.
subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack. It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?) Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.
If you dislike PyVarObject in general (I think the API is ugly too) you should argue for a general policy discouraging future use of it, not just get in the way of the one place where it's most appropriate

Terry: PyVarObjects would be much easier to subclass if the type object stored an offset to the beginning of the variable section, so it could be automatically recalculated for subclasses based on the size of the struct. This'd mean the PyBytesObject struct would no longer end with a char ob_sval[1]. The down side is a tiny bit more math when accessing the variable section (as the offset is no longer constant).

malemburg · 2010-01-10T21:59:26Z

Adam Olsen wrote:

Adam Olsen <rhamph@gmail.com> added the comment:

Points against the subclassing argument:

We have a null-termination invariant. For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API? This prevents lazy slicing.

Base type Unicode buffers end with a null-Py_UNICODE termination,
but this is not used anywhere, AFAIK. We could probably remove
that overallocation at some point.

There's no such thing as a null-termination invariant for Unicode.

subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack. It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?) Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.

Actually, Unicode objects were designed to be subclassable right
from the start and adjusting the buffer to point e.g. into some
other already allocated string was too. I removed this feature from
Fredrik's type implementation with the intent to readd it later on as
subclass.

See the prototype implementation of such a subclass uniref that I've
written to show how easy it is to add a subclass which can be used
to slice large Unicode objects without having to reallocate new
buffers all the time.

BTW, I'm not aware of any changes to the PyUnicodeObject by some
fastsearch implementation. Could you point me to this ?

amauryfa · 2010-01-10T22:27:35Z

Base type Unicode buffers end with a null-Py_UNICODE termination,
but this is not used anywhere, AFAIK
On Windows, code like
CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
is very common, at least in posixmodule.c.

mdickinson · 2010-01-10T22:36:26Z

I find that the null termination for 8-bit strings makes low-level parsing operations (e.g., parsing a numeric string) safer and easier: for example, it makes skipping a series of digits with something like:

while (isdigit(*s)) ++s;

safe. I'd imagine that null terminated PyUNICODE arrays would have similar benefits.

pitrou · 2010-01-10T22:52:25Z

I find that the null termination for 8-bit strings makes low-level
parsing operations (e.g., parsing a numeric string) safer and easier:

Not to mention faster. The new IO library makes use of it (for newline
detection), on both bytestrings and unicode strings.

Rhamphoryncus · 2010-01-11T08:08:29Z

On Sun, Jan 10, 2010 at 14:59, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

BTW, I'm not aware of any changes to the PyUnicodeObject by some
fastsearch implementation. Could you point me to this ?

/* We allocate one more byte to make sure the string is Ux0000 terminated.
   The overallocation is also used by fastsearch, which assumes that it's
   safe to look at str[length] (without making any assumptions about what
   it contains). */

malemburg · 2010-02-01T10:39:17Z

Antoine Pitrou wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

> I find that the null termination for 8-bit strings makes low-level
> parsing operations (e.g., parsing a numeric string) safer and easier:

Not to mention faster. The new IO library makes use of it (for newline
detection), on both bytestrings and unicode strings.

I'd consider that a bug. Esp. the IO lib should be 8-bit clean
in the sense that it doesn't add any special meaning to NUL
characters or code points.

Besides, using a for-loop with a counter is both safer and faster
than checking each an every character for NUL.

Just think of what can happen if you have buggy code that overwrites
the NUL byte in some corner case situation and then use the assumption
of having the NUL byte as terminator - a classical buffer overrun.

If you're lucky, you get a segfault. If not, you end up with
data corruption or manipulation of data which could lead to
unwanted code execution.

The Python Unicode API deliberately tries to always use the combination
of a Py_UNICODE* pointer and a length integer to avoid such issues.

pitrou · 2010-02-01T11:39:53Z

I'd consider that a bug. Esp. the IO lib should be 8-bit clean
in the sense that it doesn't add any special meaning to NUL
characters or code points.

It doesn't add any special meaning to them. It just relies on a NUL
being present after the end of the string. It doesn't care about other
NULs.

Besides, using a for-loop with a counter is both safer and faster
than checking each an every character for NUL.

It's slower, since it has one more condition to check.
Newline detection as it is written has a fast path in the form of:
while (*c++ >= 0x20);

Just think of what can happen if you have buggy code that overwrites
the NUL byte in some corner case situation and then use the assumption
of having the NUL byte as terminator - a classical buffer overrun.

Well, buggy code leads to bugs :)

amauryfa · 2010-02-01T13:43:13Z

Again, on Windows there are many many usages of PyUnicode_AS_UNICODE() that pass the result to various Windows API functions, expecting a nul-terminated array of WCHARs. Please don't change this!

malemburg · 2010-02-01T16:12:11Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> Base type Unicode buffers end with a null-Py_UNICODE termination,
> but this is not used anywhere, AFAIK
On Windows, code like
CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
is very common, at least in posixmodule.c.

The above usage is clearly wrong. PyUnicode_AS_UNICODE() should
only be used to access Py_UNICODE data directly when
working on Python Unicode objects. The macro is not meant
to be used directly in external APIs.

For such uses, the Unicode conversion APIs need to be used,
e.g. the PyUnicode_AsWideChar() API. These will then also
apply any 0-termination as necessary.

Note that Python is free to change the meaning of Py_UNICODE
(e.g. to use UCS4 on all platforms) or Unicode implementation
details (such as e.g. the 0-termination) and this would then break
any use such as the one you quoted.

amauryfa · 2010-02-01T17:27:22Z

Then there are many places to change, in core python as well as in third-party code. And PyArg_ParseTuple("u") would not work any more.

amauryfa · 2010-02-01T17:54:58Z

Note that Python is free to change the meaning of Py_UNICODE
(e.g. to use UCS4 on all platforms)

Python-UCS4 has never worked on Windows. Most developers on Windows, taking example on core python source code, implicitly assumed that HAVE_USABLE_WCHAR_T is true, and use the Py_Unicode* api the same way they use the PyString* functions.

PyString_AsString is documented to return a nul-terminated array. If PyUnicode_AsUnicode starts to behave differently, people will have more trouble to port their modules to py3k.
This is not an implementation detail.

malemburg · 2010-02-01T19:21:41Z

modules to py3k.

This is not an implementation detail.

It is, otherwise I would have documented it. The fact that some
developers are not using those APIs correctly doesn't change that.

Note that PyUnicode_AsUnicode() only returns a pointer to the
Py_UNICODE buffer. It makes no guarantees on the 0-termination.
Developers need to use PyUnicode_GetSize() to access the size
of the Unicode string.

But no worries: We're not going to change it. It's too late
after 10 years in the wild.

Still, developers will have to be aware of the fact that 0-termination
is not a guaranteed Unicode feature and should stop making that
assumption and it will not necessarily hold or be guaranteed
for Unicode subclasses.

pitrou · 2010-02-01T19:28:57Z

Le lundi 01 février 2010 à 19:21 +0000, Marc-Andre Lemburg a écrit :

> This is not an implementation detail.

It is, otherwise I would have documented it.

Ok, so the current allocation scheme of unicode objects is an
implementation detail as well, right?

terryjreedy · 2010-02-01T20:30:07Z

It is, otherwise I would have documented it. The fact that some
developers are not using those APIs correctly doesn't change that.

If, as Antoine claimed, 'it' is a documented feature of str strings, and Py3 says str = Unicode, it is a plausible inference.

BreamoreBoy · 2010-07-17T19:28:53Z

@antoine: do you wish to try and take this forward?

BreamoreBoy · 2010-09-20T13:58:09Z

No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.

benjaminp · 2010-09-20T15:05:32Z

2010/9/20 Mark Lawrence <report@bugs.python.org>:

Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment:

No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.

Please don't. This is still a valid issue.

pitrou · 2010-10-06T23:21:47Z

Updated patch against current py3k.

amauryfa · 2011-04-25T19:34:57Z

I just found that the extension zope.i18nmessageid: http://pypi.python.org/pypi/zope.i18nmessageid subclasses unicode at the C level:
http://svn.zope.org/zope.i18nmessageid/trunk/src/zope/i18nmessageid/_zope_i18nmessageid_message.c?rev=120914&view=markup

Notably, the Message structure is defined this way:
typedef struct {
  PyUnicodeObject base;
  PyObject *domain;
  PyObject *default_;
  PyObject *mapping;
} Message;

How would such an extension type behave after the patch? Is there a workaround we can propose?

vstinner · 2011-09-29T20:03:47Z

The PEP-393 is based on the idea proposed in this issue (use only one memory block, not two), but also enhanced it to reduce more the memory using other technics:

use a different character type depending on the maximum character,
use a shorter structure for ASCII only strings

The PEP-393 has been accepted and merged into Python 3.3. So I consider this issue as done.

pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Jan 26, 2008

jafo mannequin assigned malemburg Mar 19, 2008

gvanrossum removed their assignment Aug 21, 2010

vstinner closed this as completed Sep 29, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

improved allocation of PyUnicode objects #46235

improved allocation of PyUnicode objects #46235

Comments

pitrou commented Jan 26, 2008

pitrou commented Jan 26, 2008

malemburg commented Jan 26, 2008

pitrou commented Jan 27, 2008

malemburg commented Jan 27, 2008

pitrou commented Jan 27, 2008

malemburg commented Jan 27, 2008

pitrou commented Jan 27, 2008

malemburg commented Jan 27, 2008

pitrou commented Jan 27, 2008

pitrou commented Jan 29, 2008

malemburg commented Jan 30, 2008

pitrou commented Jan 30, 2008

pitrou commented Feb 12, 2008

pitrou commented Feb 16, 2008

jafo mannequin commented Mar 19, 2008

pitrou commented Mar 19, 2008

malemburg commented Mar 20, 2008

amauryfa commented Mar 20, 2008

malemburg commented Mar 20, 2008

pitrou commented Mar 20, 2008

malemburg commented Mar 20, 2008

pitrou commented Mar 20, 2008

malemburg commented Mar 20, 2008

pitrou commented Mar 20, 2008

malemburg commented Mar 20, 2008

pitrou commented Mar 20, 2008

malemburg commented Mar 20, 2008

malemburg commented Jun 5, 2009

gvanrossum commented Jun 5, 2009

malemburg commented Jun 5, 2009

pitrou commented Jun 5, 2009

terryjreedy commented Jun 6, 2009

malemburg commented Jun 8, 2009

Rhamphoryncus mannequin commented Jan 10, 2010

malemburg commented Jan 10, 2010

amauryfa commented Jan 10, 2010

mdickinson commented Jan 10, 2010

pitrou commented Jan 10, 2010

Rhamphoryncus mannequin commented Jan 11, 2010

malemburg commented Feb 1, 2010

pitrou commented Feb 1, 2010

amauryfa commented Feb 1, 2010

malemburg commented Feb 1, 2010

amauryfa commented Feb 1, 2010

amauryfa commented Feb 1, 2010

malemburg commented Feb 1, 2010

pitrou commented Feb 1, 2010

terryjreedy commented Feb 1, 2010

BreamoreBoy mannequin commented Jul 17, 2010

BreamoreBoy mannequin commented Sep 20, 2010

benjaminp commented Sep 20, 2010

pitrou commented Oct 6, 2010

amauryfa commented Apr 25, 2011

vstinner commented Sep 29, 2011