Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved allocation of PyUnicode objects #46235

Closed
pitrou opened this issue Jan 26, 2008 · 96 comments
Closed

improved allocation of PyUnicode objects #46235

pitrou opened this issue Jan 26, 2008 · 96 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Jan 26, 2008

BPO 1943
Nosy @malemburg, @gvanrossum, @rhettinger, @terryjreedy, @amauryfa, @mdickinson, @pitrou, @vstinner, @ericvsmith, @devdanzin, @benjaminp, @orivej, @ezio-melotti
Files
  • stringbench3k.py: a quick port to py3k of stringbench.py
  • unialloc4.patch
  • freelists2.patch
  • unialloc5.patch
  • unialloc6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-09-29.20:03:47.256>
    created_at = <Date 2008-01-26.22:44:34.097>
    labels = ['interpreter-core', 'type-feature', 'expert-unicode']
    title = 'improved allocation of PyUnicode objects'
    updated_at = <Date 2011-09-29.20:03:47.254>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2011-09-29.20:03:47.254>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-09-29.20:03:47.256>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2008-01-26.22:44:34.097>
    creator = 'pitrou'
    dependencies = []
    files = ['9297', '9789', '9790', '14048', '19142']
    hgrepos = []
    issue_num = 1943
    keywords = ['patch']
    message_count = 96.0
    messages = ['61728', '61730', '61732', '61735', '61738', '61740', '61741', '61743', '61747', '61832', '61862', '61867', '62332', '62467', '64108', '64112', '64161', '64163', '64165', '64166', '64167', '64168', '64169', '64170', '64194', '64197', '64215', '64216', '64320', '87912', '87917', '87918', '88254', '88287', '88288', '88290', '88291', '88307', '88309', '88310', '88312', '88313', '88584', '88585', '88744', '88745', '88746', '88751', '88753', '88767', '88768', '88774', '88775', '88777', '88779', '88801', '88802', '88804', '88806', '88816', '88819', '88824', '88830', '88879', '88880', '88881', '88883', '88885', '88928', '88936', '88946', '88951', '88953', '88983', '89069', '97545', '97553', '97554', '97555', '97559', '97581', '98656', '98657', '98662', '98668', '98669', '98672', '98676', '98677', '98686', '110599', '116937', '116950', '118087', '134406', '144625']
    nosy_count = 19.0
    nosy_names = ['lemburg', 'gvanrossum', 'collinwinter', 'rhettinger', 'terry.reedy', 'jafo', 'jimjjewett', 'amaury.forgeotdarc', 'mark.dickinson', 'Rhamphoryncus', 'pitrou', 'vstinner', 'eric.smith', 'ferringb', 'ajaksu2', 'benjamin.peterson', 'orivej', 'ezio.melotti', 'BreamoreBoy']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1943'
    versions = ['Python 3.2']

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 26, 2008

    This is an attempt at improving allocation of str (PyUnicode) objects.
    For that it does two things:

    1. make str objects PyVarObjects, so that the object is allocated in one
      pass rather than two
    2. maintain an array of freelists, each for a given str length, so that
      many str allocations can bypass actual memory allocation calls

    There is a ~10% speedup in stringbench, and a slight improvement in
    pybench (as invoked with "./python Tools/pybench/pybench.py -t String").
    Also, memory consumption is a bit lower when running those benchmarks.

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Jan 26, 2008
    @malemburg
    Copy link
    Member

    The Unicode object was designed not to be a PyVarObject (in contrast to
    the string object), so I'm not sure what you would want to break that
    design in return for a mere 10% speedup.

    Note that turning the objects into PyVarObjects removes the ability to
    extend the type at C level. It also introduces lots of problems if you
    want to resize a Unicode object, due to the fact that the memory address
    of the whole PyUnicodeObject can change. Resizing is done quite often in
    codecs.

    Regarding your second point: Unicode objects already use a free list and
    even keep the allocated Py_UNICODE buffer allocated (if it's not too
    large). Furthermore, the allocation is handled by pymalloc, so you only
    rarely need to go to the system malloc() to allocate more space.

    Tuning the KEEPALIVE_SIZE_LIMIT will likely have similar effects w/r to
    speedup.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 27, 2008

    I just tried bumping KEEPALIVE_SIZE_LIMIT to 200. It makes up for a bit
    of the speedup, but the PyVarObject version is still faster. Moreover,
    and perhaps more importantly, it eats less memory (800KB less in
    pybench, 3MB less when running the whole test suite).

    (then there are of course microbenchmarks. For example:
    # With KEEPALIVE_SIZE_LIMIT = 200
    ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
    1000 loops, best of 3: 556 usec per loop
    # With this patch
    ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
    1000 loops, best of 3: 391 usec per loop
    )

    I don't understand the argument for codecs having to resize the unicode
    object. Since codecs also have to resize the bytes object when encoding,
    the argument should apply to bytes objects as well, yet bytes (and str
    in 2.6) is a PyVarObject.

    I admit I don't know the exact reasons for PyUnicode's design. I just
    know that the patch doesn't break the API, and runs the test suite fine.
    Are there any specific things to look for?

    @malemburg
    Copy link
    Member

    Your microbenchmark is biased towards your patched version. The
    KEEPALIVE_SIZE_LIMIT will only cut in when you deallocate and then
    reallocate Unicode objects. The free list used for Unicode objects is
    also limited to 1024 objects - which isn't all that much. You could tune
    MAX_UNICODE_FREELIST_SIZE as well.

    Regarding memory usage: this is difficult to measure in Python, since
    pymalloc will keep memory chunks allocated even if they are not in use
    by Python. However, this is a feature of pymalloc and not specific to
    the Unicode implementation. It can be tuned in pymalloc. To get more
    realistic memory measurements, you'd have to switch off pymalloc
    altogether and then create a separate process that consumes lots of
    memory to force the OS to have it allocate only memory that's really
    needed to the process you're running for memory measurements. Of course,
    keeping objects alive in a free list will always use more memory than
    freeing them altogether and returning the memory to the OS. It's a
    speed/space tradeoff. The RAM/CPU costs ratio has shifted a lot towards
    RAM nowadays, so using more RAM is usually more efficient than using
    more CPU time.

    Regarding resize: you're right - the string object is a PyVarObject as
    well and couldn't be changed at the time due to backwards compatibility
    reasons. You should also note that when I added Unicode to Python 1.6,
    it was a new and not commonly used type. Codecs were not used much
    either, so there was no incentive to make resizing strings work better.
    Later on, other optimizations were added to the Unicode implementation
    that caused the PyUnicode_Resize() API to also require being able to
    change the object address. Still, in the common case, it doesn't change
    the object address.

    The reason for using an external buffer for the Unicode object was to be
    able to do further optimizations, such as share buffers between Unicode
    objects. We never ended up using this, though, but there's still a lot
    of room for speedups and more memory efficiency because of this design.

    Like I already mentioned, PyObjects are also easier to extend at C level

    • adding new variables to the object at the end is easy with PyObjects.
      It's difficult for PyVarObjects, since you always have to take the
      current size of the object into account and you always have to use
      indirection to get at the extra variables due to the undefined offset of
      the variables.

    How much speedup do you get when you compare the pybench test with
    KEEPALIVE_SIZE_LIMIT = 200 compared to your patched version ?

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 27, 2008

    With KEEPALIVE_SIZE_LIMIT = 200, the pybench runtime is basically the
    same as with my patched version. stringbench remains a bit faster though
    (~8%).

    You say that RAM size is cheaper than CPU power today, which is true but
    misses one part of the picture: the importance of CPU caches, and thus
    of working set size. There are also probably people wanting to use
    Python in memory-constrained environments (embedded).

    I understand the argument about possible optimizations with an external
    buffer, but are such optimizations likely to be implemented? (see
    bpo-1590352 and bpo-1629305). If they really are, then I'm happy with the
    unicode type remaining a plain PyObject!

    @malemburg
    Copy link
    Member

    I don't really see the connection with bpo-1629305.

    An optimization that would be worth checking is hooking up the
    Py_UNICODE pointer to interned Unicode objects if the contents match
    (e.g. do a quick check based on the hash value and then a memcmp). That
    would save memory and the call to the pymalloc allocator.

    Another strategy could involve a priority queue style cache with the aim
    of identifying often used Unicode strings and then reusing them.

    This could also be enhanced using an offline approach: you first run an
    application with an instrumented Python interpreter to find the most
    often used strings and then pre-fill the cache or interned dictionary on
    the production Python interpreter at startup time.

    Coming from a completely different angle, you could also use the
    Py_UNICODE pointer to share slices of a larger data buffer. A Unicode
    sub-type could handle this case, keeping a PyObject* reference to the
    larger buffer, so that it doesn't get garbage collected before the
    Unicode slice.

    Regarding memory constrained environments: these should simply switch
    off all free lists and pymalloc. OTOH, even mobile phones come with
    gigabytes of RAM nowadays, so it's not really worth the trouble, IMHO.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 27, 2008

    All of those proposals are much heavier to implement; they also make the
    unicode type more complicated and difficult to maintain probably (while
    turning it into a PyVarObject actually shortens the implementation a
    bit). In any case, it's not something I want to tackle.

    The reason that I mentioned bpo-1629305 was that it was such an
    optimization which complicated the unicode type while bringing very
    significant speedups.

    @malemburg
    Copy link
    Member

    Agreed, those optimizations do make the implementation more complicated.
    It's also not clear whether they would really be worth it.

    bpo-1629305 only provided speedups for the case where you write s += 'abc'.
    The usual idiom for this is to use a list and then concatenate in one
    go. If you want a really fast approach, you'd use cStringIO or perhaps
    the bufferarray. I don't think that optimizing for just one particular
    use case warrants making the code more complicated or changing the C
    interface radically.

    In your case, I think that closing the door for being able to easily
    extend the type implement at the C is the main argument against it. The
    speedups are only marginal and can also be achieved (to some extent) by
    tuning the existing implementation's parameters.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 27, 2008

    I know it's not the place to discuss bpo-1629305, but the join() solution
    is not always faster. Why? Because 1) there's the list contruction and
    method call overhead 2) ceval.c has some bytecode hackery to try and
    make plain concatenations somewhat less slow.

    As for cStringIO, I actually observed in some experiments that it was
    slower than the other alternatives, at least for short strings. All in
    all, choosing between the three idioms is far from obvious and needs
    case-by-case analysis.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 29, 2008

    FWIW, I tried using the freelist scheme introduced in my patch without
    making PyUnicode a PyVarObject and, although it's better than the
    standard version, it's still not as good as the PyVarObject version.
    Would you be interested in that patch?

    @malemburg
    Copy link
    Member

    Yes, definitely.

    Some comments on style in your first patch:

    • please use unicode->length instead of the macro LENGTH you added
    • indents in unicodeobject.c are 4 spaces
    • line length should stay below 80

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 30, 2008

    After some more tests I must qualify what I said. The freelist patch is
    an improvement in some situations. In others it does not really have any
    impact. On the other hand, the PyVarObject version handles memory-bound
    cases dramatically better, see below.

    With a small string:
    ./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
    -> Unpatched py3k: 26.4 usec per loop
    -> Freelist patch: 21.5 usec per loop
    -> PyVarObject patch: 20.2 usec per loop

    With a medium-sized string:
    ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
    -> Unpatched py3k: 458 usec per loop
    -> Freelist patch: 408 usec per loop
    -> PyVarObject patch: 316 usec per loop

    With a long string:
    ./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
    -> Unpatched py3k: 31.3 msec per loop
    -> Freelist patch: 32.7 msec per loop
    -> PyVarObject patch: 17.8 msec per loop

    (the numbers are better than in my previous posts because the
    split-accelerating patch has been integrated)

    Also, given those results, it is also clear that neither pybench nor
    stringbench really stress memory efficiency of strings, only processing
    algorithms.

    That said, the freelist patch is attached.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 12, 2008

    Here is an updated patch against the current py3k branch, and with
    spaces instead of tabs for indentation.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 16, 2008

    Here is an updated patch, to comply with the introduction of the
    PyUnicode_ClearFreeList() function.

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 19, 2008

    Marc-Andre: Wit the udpated patches, is this a set of patches we can accept?

    @jafo jafo mannequin assigned malemburg Mar 19, 2008
    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 19, 2008

    Thanks for your interest Sean :)
    By the way, on python-3000 both GvR and Martin von Löwis were ok on the
    proposed design change, although they did not review the patch itself.
    http://mail.python.org/pipermail/python-3000/2008-February/012076.html

    @malemburg
    Copy link
    Member

    Antoine, as I've already mentioned in my other comments, I'm -1 on
    changing the Unicode object to a variable size object.

    I also don't think that the micro-benchmarks you are applying really do
    test the implementation in a real-life situations. The only case where
    your patch appears significantly faster is the "long string" case. If
    you look at the distribution of the Unicode strings generated by this
    case, you'll find that most strings have less than 10-20 characters.
    Optimizing pymalloc for these cases and tuning the parameters in the
    Unicode implementation will likely give you the same performance
    increase without having to sacrifice the advantages of using a pointer
    in the object rather than a inlining the data.

    I'm +1 on the free list changes, though, in the long run, I think that
    putting more efforts into improving pymalloc and removing the free lists
    altogether would be better.

    BTW: Unicode slices would be a possible and fairly attractive target for
    a C level subclass of Unicode objects. The pointer in the Unicode object
    implementation could then point to the original Unicode object's buffer
    and the subclass would add an extra pointer to the original object
    itself (in order to keep it alive). The Unicode type (written by Fredrik
    Lundh) I used as basis for the Unicode implementation worked with this idea.

    @amauryfa
    Copy link
    Member

    Marc-Andre: don't all your objections also apply to the 8bit string
    type, which is already a variable-size structure?
    Is extending unicode more common than extending str?

    With python 3.0, all strings are unicode. Shouldn't this type be
    optimized for the same use cases of the previous str type?

    @malemburg
    Copy link
    Member

    Yes, all those objections apply to the string type as well. The fact
    that strings are variable length objects makes it impossible to do apply
    any of the possible optimizations I mentioned. If strings were a fixed
    length object, it would have been possible to write a true basestring
    object from which both 8-bit strings and Unicode subclass, making a lot
    of things easier.

    BTW: Please also see ticket bpo-2321 to see how the change affects your
    benchmarks.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 20, 2008

    Hi,

    Marc-André, I'm all for "real-life" benchmarks if someone proposes some.
    Until that we have to live with micro-benchmarks, which seems to be the
    method used for other CPython optimizations by the way.

    You are talking about slicing optimizations but you forget that the
    "lazy slices" idea has been shot down by Guido and others when proposed
    by Larry Hastings (with a patch) some time ago. Also the "lazy slices"
    patch was for 8-bit strings, which *are* variable-size objects, which
    seems to counter your argument that variable-size Unicode objects would
    prevent making such optimizations.

    As I said the freelist changes actually have mixed consequences, and in
    general don't improve much (and that improvement is just backed by
    micro-benchmarks after all... why would it be more convincing than the
    far greater improvement brought by making Unicode objects variable-size
    objects?).

    Why wouldn't you express your arguments in the python-3000 thread I
    launched one month ago, so that at least there is a clear picture of the
    different arguments for and against this approach?

    @malemburg
    Copy link
    Member

    Regarding benchmarks: It's difficult to come up with decent benchmarks
    for things like this. A possible strategy is to use an instrumented
    interpreter that records which Unicode objects are created and when they
    are deleted. If you then run this instrumented interpreter against a few
    larger applications such as Zope for a while, you'll get a data set that
    can be used as input for a more real-life like benchmark. I've done this
    in the past for Python byte codes to see which are used how often -
    based on such data you can create optimizations that have a real-life
    effect.

    Regarding the lazy slice patches: those were not using subclassing, they
    were patches to the existing type implementations. With subclassing you
    don't affect all uses of an object, but instead let the user control
    which uses should take advantage of the slice operations. Since the user
    also controls the objects which are kept alive this way, the main
    argument against Unicode slice objects goes away. This is different than
    patching the type itself and doesn't change the main object
    implementation in any way. Furthermore, it's possible to implement and
    provide such subclasses outside the Python core, which gives developers
    a lot more freedom.

    Regarding discussions on the py3k list: I'm not on that list, since I
    already get more email than I can manage. I am on the python-dev list,
    if you want to take up the discussions there.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 20, 2008

    Well I'm not subscribed to the python-3k list either - too much traffic
    indeed. You can read and post into it with gmane for example:
    http://thread.gmane.org/gmane.comp.python.python-3000.devel/11768
    (there is probably an NNTP gateway too)

    As for instrumenting the interpreter, this would tell us when and which
    strings are allocated, but not the precise effect it has on a modern
    CPU's memory subsystem (which is quite a complicated thing). Also, this
    is Py3k - we can't test any real-life applications until they are ported
    to it...
    (someone really motivated would have to learn Intel VTune or AMD
    CodeAnalyst and run such a "py3k real-life application" with it :-))

    As for the explicit slicing approach, "explicit string views" have been
    discussed in 2006 on the python-3k list, including a proof-of-concept
    patch by Josiah Carlson - without any definitive pronouncement it seems.
    See the subthread here:
    http://mail.python.org/pipermail/python-3000/2006-August/003280.html

    The reason I'm bringing in those previous discussions is that, in
    theory, I'm all for much faster concatenation and slicing thanks to
    buffer sharing etc., but realistically the previous attempts have failed
    for various reasons (either technical or political). And since those
    optimizations are far from simple to implement and maintain, chances are
    they will not be attempted again, let alone accepted. I'd be happy to be
    proven wrong :)

    regards

    Antoine.

    @malemburg
    Copy link
    Member

    I've read the comments from Guido and Martin, but they don't convince me
    in changing my -1.

    As you say: it's difficult to get support for optimizations such a
    slicing and concatenation into the core. And that's exactly why I want
    to keep the door open for developers to add such support via extension
    modules. With the 8-bit string implementation this was never possible.

    With the Unicode implementation and the subclassing support for builtin
    types, it is possible to add support without having to go through all
    the hoops and discussions on python-dev or py3k lists.

    I'm also for making Python faster, but not if it limits future
    optimizations and produces dead-ends. The fact that such optimizations
    haven't been implemented yet doesn't really mean anything either -
    people are not yet using Unicode intensively enough to let the need arise.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 20, 2008

    Well, I'm not gonna try to defend my patch eternally :)
    I understand your opinion even if I find a bit disturbing that we refuse
    a concrete, actual optimization on the basis of future hypothetical ones.

    Since all the arguments have been laid down, I'll let other developers
    decide whether further discussion of this proposal is useful or not.

    @malemburg
    Copy link
    Member

    Regarding the benchmark: You can instrument a 2.x version of the
    interpreter to build the data set and then have the test load this data
    set in Py3k and have it replay the allocation/deallocation in the same
    way it was done on the 2.x system.

    I also expect that patch bpo-2321 will have an effect on the performance
    since that patch changed the way memory allocation of the buffer was
    done (from using the system malloc() to using pymalloc's allocator for
    the buffer as well).

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 20, 2008

    You are right, bpo-2321 made the numbers a bit tighter:

    With a small string:
    ./python -m timeit -s "s=open('INTBENCH', 'r').read()" "s.split()"
    -> Unpatched py3k: 23.1 usec per loop
    -> Freelist patch: 21.3 usec per loop
    -> PyVarObject patch: 20.5 usec per loop

    With a medium-sized string:
    ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
    -> Unpatched py3k: 406 usec per loop
    -> Freelist patch: 353 usec per loop
    -> PyVarObject patch: 314 usec per loop

    With a long string:
    ./python -m timeit -s "s=open('Misc/HISTORY', 'r').read()" "s.split()"
    -> Unpatched py3k: 22.7 msec per loop
    -> Freelist patch: 24 msec per loop
    -> PyVarObject patch: 20.6 msec per loop

    stringbench3k:
    -> Unpatched py3k: 266 seconds
    -> Freelist patch: 264 seconds
    -> PyVarObject patch: 249 seconds

    Regarding your benchmarking suggestion, this would certainly be an
    interesting thing to do, but I fear it is also much more than I'm
    willing to do...

    I'm going to post the updated patches.

    @malemburg
    Copy link
    Member

    Thanks for running the tests again. The use of pymalloc for the buffer
    made a significant difference indeed. I expect that more can be had by
    additionally tweaking KEEPALIVE_SIZE_LIMIT.

    It is interesting to see that the free list patch only appears to
    provide better timings for the "smaller" strings tests. I couldn't find
    the INTBENCH file anywhere in the Python source tree, but here's the
    distribution of the words of the latter two files:

    Dev-Python/LICENSE:
    ------------------------------------------------------------------------
    Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
    1: [ 28] ===
    2: [ 350] =================================================
    3: [ 354] ==================================================
    4: [ 190] ==========================
    5: [ 191] ==========================
    6: [ 158] ======================
    7: [ 164] =======================
    8: [ 132] ==================
    9: [ 127] =================
    10: [ 102] ==============
    11: [ 39] =====
    12: [ 34] ====
    13: [ 21] ==
    14: [ 10] =
    15: [ 10] =
    16: [ 0]
    17: [ 0]
    18: [ 1]
    19: [ 0]
    20: [ 0]
    21: [ 1]
    22: [ 0]
    23: [ 0]
    24: [ 0]
    25: [ 1]
    26: [ 1]
    27: [ 1]
    28: [ 0]
    29: [ 1]
    30: [ 0]
    31: [ 0]
    32: [ 0]
    33: [ 0]
    34: [ 0]
    35: [ 0]
    36: [ 2]
    37: [ 0]
    38: [ 0]
    39: [ 1]
    40: [ 0]
    41: [ 0]
    42: [ 0]
    43: [ 1]
    44: [ 1]
    45: [ 0]
    46: [ 0]
    47: [ 0]
    48: [ 0]
    49: [ 0]
    50: [ 1]
    51: [ 0]
    52: [ 0]
    53: [ 0]
    54: [ 0]
    55: [ 0]
    56: [ 0]
    57: [ 0]
    58: [ 0]
    59: [ 0]
    60: [ 0]
    61: [ 0]
    62: [ 0]
    63: [ 1]

    Dev-Python/Misc/HISTORY:
    ------------------------------------------------------------------------
    Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
    1: [ 6853] ==================
    2: [13920] =====================================
    3: [18401] ==================================================
    4: [12626] ==================================
    5: [ 9545] =========================
    6: [ 9348] =========================
    7: [ 9625] ==========================
    8: [ 7351] ===================
    9: [ 5353] ==============
    10: [ 3266] ========
    11: [ 1947] =====
    12: [ 1336] ===
    13: [ 983] ==
    14: [ 638] =
    15: [ 408] =
    16: [ 288]
    17: [ 286]
    18: [ 216]
    19: [ 176]
    20: [ 147]
    21: [ 120]
    22: [ 116]
    23: [ 85]
    24: [ 89]
    25: [ 70]
    26: [ 44]
    27: [ 59]
    28: [ 39]
    29: [ 32]
    30: [ 65]
    31: [ 23]
    32: [ 26]
    33: [ 28]
    34: [ 19]
    35: [ 18]
    36: [ 9]
    37: [ 18]
    38: [ 5]
    39: [ 10]
    40: [ 9]
    41: [ 9]
    42: [ 1]
    43: [ 1]
    44: [ 8]
    45: [ 4]
    46: [ 5]
    47: [ 5]
    48: [ 3]
    49: [ 3]
    50: [ 4]
    51: [ 0]
    52: [ 0]
    53: [ 2]
    54: [ 0]
    55: [ 0]
    56: [ 4]
    57: [ 2]
    58: [ 4]
    59: [ 1]
    60: [ 0]
    61: [ 1]
    62: [ 0]
    63: [ 1]
    64: [ 1]
    65: [ 9]
    66: [ 1]
    67: [ 1]
    68: [ 0]
    69: [ 0]
    70: [ 16]
    71: [ 1]
    72: [ 0]
    73: [ 1]

    Compare that to a typical Python module source file...

    Dev-Python/Lib/urllib.py:
    ------------------------------------------------------------------------
    Length: [Count] 0----1----2----3----4----5----6----7----8----9----10 * 10%
    1: [ 806] ==================================================
    2: [ 672] =========================================
    3: [ 675] =========================================
    4: [ 570] ===================================
    5: [ 531] ================================
    6: [ 501] ===============================
    7: [ 296] ==================
    8: [ 393] ========================
    9: [ 246] ===============
    10: [ 147] =========
    11: [ 150] =========
    12: [ 102] ======
    13: [ 90] =====
    14: [ 116] =======
    15: [ 61] ===
    16: [ 51] ===
    17: [ 45] ==
    18: [ 38] ==
    19: [ 31] =
    20: [ 39] ==
    21: [ 24] =
    22: [ 18] =
    23: [ 18] =
    24: [ 23] =
    25: [ 17] =
    26: [ 15]
    27: [ 13]
    28: [ 14]
    29: [ 11]
    30: [ 9]
    31: [ 7]
    32: [ 1]
    33: [ 6]
    34: [ 10]
    35: [ 2]
    36: [ 4]
    37: [ 3]
    38: [ 6]
    39: [ 1]
    40: [ 0]
    41: [ 1]
    42: [ 5]
    43: [ 0]
    44: [ 0]
    45: [ 0]
    46: [ 0]
    47: [ 0]
    48: [ 2]
    49: [ 0]
    50: [ 1]
    51: [ 1]
    52: [ 2]

    The distributions differ a lot, but they both show that typical strings
    (both in English text and Python program code) have a length of < 25
    characters.

    Setting KEEPALIVE_SIZE_LIMIT to 32 should cover most of those cases
    while still keeping the memory load within bounds. Raising the free list
    limit to e.g. 2048 would cause at most 64kB to be used by the free list

    • which is well within bounds of any modern CPU L2 cache.

    @malemburg
    Copy link
    Member

    Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Raymond suggested the patch be committed in 3.1, so as to minimize
    disruption between 3.1 and 3.2. Benjamin, what do you think?

    Has Guido pronounced on this already ?

    @gvanrossum
    Copy link
    Member

    On Fri, Jun 5, 2009 at 4:06 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    Marc-Andre Lemburg <mal@egenix.com> added the comment:

    Antoine Pitrou wrote:
    > Antoine Pitrou <pitrou@free.fr> added the comment:
    >
    > Raymond suggested the patch be committed in 3.1, so as to minimize
    > disruption between 3.1 and 3.2. Benjamin, what do you think?

    Has Guido pronounced on this already ?

    I don't want it added to 3.1 unless we start the beta cycle afresh.
    It's too subtle for submitting after rc1. Talk to Benjamin if you
    disagree.

    I think it's fine to wait for 3.2. Maybe add something to the docs
    about not subclassing unicode in C.

    @malemburg
    Copy link
    Member

    Guido van Rossum wrote:

    I think it's fine to wait for 3.2. Maybe add something to the docs
    about not subclassing unicode in C.

    We should have a wider discussion about this on python-dev.

    I'll publish the unicoderef extension and then we can see
    whether users want this or not.

    Antoine's patch makes such extensions impossible (provided you
    don't want to copy over the complete unicodeobject.c
    implementation in order to change the memory allocation
    scheme).

    Note that in Python 2.x you don't have such issues because
    there, most tools for text processing will happily work on
    any sort of buffer, so you don't need a string sub-type
    in order to implement e.g. references into another string
    (the buffer type will allow you to do this easily).

    @pitrou
    Copy link
    Member Author

    pitrou commented Jun 5, 2009

    Note that in Python 2.x you don't have such issues because
    there, most tools for text processing will happily work on
    any sort of buffer, so you don't need a string sub-type
    in order to implement e.g. references into another string
    (the buffer type will allow you to do this easily).

    The new buffer API has a provision for type flags, although none of them
    had a chance to be implemented by the original author before he ran
    away...
    There could be a type flag for unicode characters and then its support
    could be implemented in the memoryview object.

    @terryjreedy
    Copy link
    Member

    In the interest of possibly improving the imminent 3.1 release,
    I opened bpo-6216
    Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

    I wonder if it is possible to make it generically easier to subclass
    PyVarObjects (but my C knowledge to getting too faded to have any ideas).

    @malemburg
    Copy link
    Member

    Terry J. Reedy wrote:

    In the interest of possibly improving the imminent 3.1 release,
    I opened bpo-6216
    Raise Unicode KEEPALIVE_SIZE_LIMIT from 9 to 32?

    Thanks for opening that ticket.

    I wonder if it is possible to make it generically easier to subclass
    PyVarObjects (but my C knowledge to getting too faded to have any ideas).

    Even if we were to add some pointer arithmetic tricks to at least
    hide away the complexities, you'd no longer be able to change the
    way the data memory allocation works.

    The reason is simple: subclassing is about reusing existing method
    implementations and only adding/changing a few of them.

    If you want to change the way the allocation works, you'd have
    to replace all of them.

    Furthermore, using your subclasses objects with the existing APIs
    would no longer be safe, since these would still assume the
    original base class memory allocation scheme.

    In summary:

    Implementations like the unicoderef type I posted
    and most of the other use cases I mentioned are no longer
    possible, ie. you will *always* have to copy the data in order
    to work with the existing Unicode APIs on it.

    The current implementation has no problem with working on referenced
    data, since support for this was built in right from the start.

    That's what I meant with closing the door on future enhancements
    that would make a huge difference if used right, for a mere
    10% performance increase.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 10, 2010

    Points against the subclassing argument:

    • We have a null-termination invariant. For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API? This prevents lazy slicing.

    • UTF-16 and UTF-32 are rarely used encodings, especially for longer strings (ie files). For shorter strings (APIs) the unicode object overhead is more significant and we'd need a way to slave to the buffer's lifetime to that of the unicode object (hard to do). For longer strings UTF-8 would be much more useful, but that's been shot down before.

    • subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack. It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?) Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.

    • If you dislike PyVarObject in general (I think the API is ugly too) you should argue for a general policy discouraging future use of it, not just get in the way of the one place where it's most appropriate

    Terry: PyVarObjects would be much easier to subclass if the type object stored an offset to the beginning of the variable section, so it could be automatically recalculated for subclasses based on the size of the struct. This'd mean the PyBytesObject struct would no longer end with a char ob_sval[1]. The down side is a tiny bit more math when accessing the variable section (as the offset is no longer constant).

    @malemburg
    Copy link
    Member

    Adam Olsen wrote:

    Adam Olsen <rhamph@gmail.com> added the comment:

    Points against the subclassing argument:

    • We have a null-termination invariant. For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API? This prevents lazy slicing.

    Base type Unicode buffers end with a null-Py_UNICODE termination,
    but this is not used anywhere, AFAIK. We could probably remove
    that overallocation at some point.

    There's no such thing as a null-termination invariant for Unicode.

    • subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack. It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?) Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot.

    Actually, Unicode objects were designed to be subclassable right
    from the start and adjusting the buffer to point e.g. into some
    other already allocated string was too. I removed this feature from
    Fredrik's type implementation with the intent to readd it later on as
    subclass.

    See the prototype implementation of such a subclass uniref that I've
    written to show how easy it is to add a subclass which can be used
    to slice large Unicode objects without having to reallocate new
    buffers all the time.

    BTW, I'm not aware of any changes to the PyUnicodeObject by some
    fastsearch implementation. Could you point me to this ?

    @amauryfa
    Copy link
    Member

    Base type Unicode buffers end with a null-Py_UNICODE termination,
    but this is not used anywhere, AFAIK
    On Windows, code like
    CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
    is very common, at least in posixmodule.c.

    @mdickinson
    Copy link
    Member

    I find that the null termination for 8-bit strings makes low-level parsing operations (e.g., parsing a numeric string) safer and easier: for example, it makes skipping a series of digits with something like:

    while (isdigit(*s)) ++s;

    safe. I'd imagine that null terminated PyUNICODE arrays would have similar benefits.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 10, 2010

    I find that the null termination for 8-bit strings makes low-level
    parsing operations (e.g., parsing a numeric string) safer and easier:

    Not to mention faster. The new IO library makes use of it (for newline
    detection), on both bytestrings and unicode strings.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 11, 2010

    On Sun, Jan 10, 2010 at 14:59, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    BTW, I'm not aware of any changes to the PyUnicodeObject by some
    fastsearch implementation. Could you point me to this ?

    /* We allocate one more byte to make sure the string is Ux0000 terminated.
       The overallocation is also used by fastsearch, which assumes that it's
       safe to look at str[length] (without making any assumptions about what
       it contains). */
    

    @malemburg
    Copy link
    Member

    Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    > I find that the null termination for 8-bit strings makes low-level
    > parsing operations (e.g., parsing a numeric string) safer and easier:

    Not to mention faster. The new IO library makes use of it (for newline
    detection), on both bytestrings and unicode strings.

    I'd consider that a bug. Esp. the IO lib should be 8-bit clean
    in the sense that it doesn't add any special meaning to NUL
    characters or code points.

    Besides, using a for-loop with a counter is both safer and faster
    than checking each an every character for NUL.

    Just think of what can happen if you have buggy code that overwrites
    the NUL byte in some corner case situation and then use the assumption
    of having the NUL byte as terminator - a classical buffer overrun.

    If you're lucky, you get a segfault. If not, you end up with
    data corruption or manipulation of data which could lead to
    unwanted code execution.

    The Python Unicode API deliberately tries to always use the combination
    of a Py_UNICODE* pointer and a length integer to avoid such issues.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 1, 2010

    I'd consider that a bug. Esp. the IO lib should be 8-bit clean
    in the sense that it doesn't add any special meaning to NUL
    characters or code points.

    It doesn't add any special meaning to them. It just relies on a NUL
    being present after the end of the string. It doesn't care about other
    NULs.

    Besides, using a for-loop with a counter is both safer and faster
    than checking each an every character for NUL.

    It's slower, since it has one more condition to check.
    Newline detection as it is written has a fast path in the form of:
    while (*c++ >= 0x20);

    Just think of what can happen if you have buggy code that overwrites
    the NUL byte in some corner case situation and then use the assumption
    of having the NUL byte as terminator - a classical buffer overrun.

    Well, buggy code leads to bugs :)

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 1, 2010

    Again, on Windows there are many many usages of PyUnicode_AS_UNICODE() that pass the result to various Windows API functions, expecting a nul-terminated array of WCHARs. Please don't change this!

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > Base type Unicode buffers end with a null-Py_UNICODE termination,
    > but this is not used anywhere, AFAIK
    On Windows, code like
    CreateDirectoryW(PyUnicode_AS_UNICODE(po), NULL)
    is very common, at least in posixmodule.c.

    The above usage is clearly wrong. PyUnicode_AS_UNICODE() should
    only be used to access Py_UNICODE data directly when
    working on Python Unicode objects. The macro is not meant
    to be used directly in external APIs.

    For such uses, the Unicode conversion APIs need to be used,
    e.g. the PyUnicode_AsWideChar() API. These will then also
    apply any 0-termination as necessary.

    Note that Python is free to change the meaning of Py_UNICODE
    (e.g. to use UCS4 on all platforms) or Unicode implementation
    details (such as e.g. the 0-termination) and this would then break
    any use such as the one you quoted.

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 1, 2010

    Then there are many places to change, in core python as well as in third-party code. And PyArg_ParseTuple("u") would not work any more.

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 1, 2010

    Note that Python is free to change the meaning of Py_UNICODE
    (e.g. to use UCS4 on all platforms)

    Python-UCS4 has never worked on Windows. Most developers on Windows, taking example on core python source code, implicitly assumed that HAVE_USABLE_WCHAR_T is true, and use the Py_Unicode* api the same way they use the PyString* functions.

    PyString_AsString is documented to return a nul-terminated array. If PyUnicode_AsUnicode starts to behave differently, people will have more trouble to port their modules to py3k.
    This is not an implementation detail.

    @malemburg
    Copy link
    Member

    modules to py3k.

    This is not an implementation detail.

    It is, otherwise I would have documented it. The fact that some
    developers are not using those APIs correctly doesn't change that.

    Note that PyUnicode_AsUnicode() only returns a pointer to the
    Py_UNICODE buffer. It makes no guarantees on the 0-termination.
    Developers need to use PyUnicode_GetSize() to access the size
    of the Unicode string.

    But no worries: We're not going to change it. It's too late
    after 10 years in the wild.

    Still, developers will have to be aware of the fact that 0-termination
    is not a guaranteed Unicode feature and should stop making that
    assumption and it will not necessarily hold or be guaranteed
    for Unicode subclasses.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 1, 2010

    Le lundi 01 février 2010 à 19:21 +0000, Marc-Andre Lemburg a écrit :

    > This is not an implementation detail.

    It is, otherwise I would have documented it.

    Ok, so the current allocation scheme of unicode objects is an
    implementation detail as well, right?

    @terryjreedy
    Copy link
    Member

    It is, otherwise I would have documented it. The fact that some
    developers are not using those APIs correctly doesn't change that.

    If, as Antoine claimed, 'it' is a documented feature of str strings, and Py3 says str = Unicode, it is a plausible inference.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 17, 2010

    @antoine: do you wish to try and take this forward?

    @gvanrossum gvanrossum removed their assignment Aug 21, 2010
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 20, 2010

    No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.

    @benjaminp
    Copy link
    Contributor

    2010/9/20 Mark Lawrence <report@bugs.python.org>:

    Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment:

    No reply to msg110599, I'll close this in a couple of weeks unless anyone objects.

    Please don't. This is still a valid issue.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 6, 2010

    Updated patch against current py3k.

    @amauryfa
    Copy link
    Member

    I just found that the extension zope.i18nmessageid: http://pypi.python.org/pypi/zope.i18nmessageid subclasses unicode at the C level:
    http://svn.zope.org/zope.i18nmessageid/trunk/src/zope/i18nmessageid/_zope_i18nmessageid_message.c?rev=120914&view=markup

    Notably, the Message structure is defined this way:
    typedef struct {
      PyUnicodeObject base;
      PyObject *domain;
      PyObject *default_;
      PyObject *mapping;
    } Message;

    How would such an extension type behave after the patch? Is there a workaround we can propose?

    @vstinner
    Copy link
    Member

    The PEP-393 is based on the idea proposed in this issue (use only one memory block, not two), but also enhanced it to reduce more the memory using other technics:

    • use a different character type depending on the maximum character,
    • use a shorter structure for ASCII only strings

    The PEP-393 has been accepted and merged into Python 3.3. So I consider this issue as done.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants