Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marshal: performance regression with versions 3 and 4 #64615

Closed
vstinner opened this issue Jan 28, 2014 · 25 comments
Closed

Marshal: performance regression with versions 3 and 4 #64615

vstinner opened this issue Jan 28, 2014 · 25 comments
Assignees
Labels
interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Jan 28, 2014

BPO 20416
Nosy @loewis, @pitrou, @kristjanvalur, @vstinner, @larryhastings, @serhiy-storchaka, @vajrasky
Files
  • marshal3_numbers.patch
  • bench.py
  • marshal_small_ints_refs.patch
  • marshal_refs_by_value.patch
  • bench_issue20416.py
  • marshal_refs_by_value_3.patch
  • marshal_hashtable.patch
  • marshal_hashtable_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-02-11.15:53:15.720>
    created_at = <Date 2014-01-28.10:10:48.757>
    labels = ['interpreter-core', 'performance']
    title = 'Marshal: performance regression with versions 3 and 4'
    updated_at = <Date 2015-02-11.15:53:15.719>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-02-11.15:53:15.719>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-02-11.15:53:15.720>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2014-01-28.10:10:48.757>
    creator = 'vstinner'
    dependencies = []
    files = ['33770', '33771', '34071', '37894', '37895', '38012', '38013', '38088']
    hgrepos = []
    issue_num = 20416
    keywords = ['patch']
    message_count = 25.0
    messages = ['209524', '209525', '209526', '209528', '209530', '209531', '209532', '209533', '209534', '209535', '211164', '211173', '211691', '234903', '235050', '235384', '235385', '235469', '235479', '235485', '235713', '235721', '235722', '235746', '235753']
    nosy_count = 8.0
    nosy_names = ['loewis', 'pitrou', 'kristjan.jonsson', 'vstinner', 'larry', 'python-dev', 'serhiy.storchaka', 'vajrasky']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue20416'
    versions = ['Python 3.5']

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    Attached patched disables references for int and float types.

    @vstinner vstinner added the performance Performance or resource usage label Jan 28, 2014
    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    Use attached bench.py to compare performances.

    Without the patch:
    ---
    dumps v0: 389.6 ms
    data size v0: 45582.9 kB
    loads v0: 573.3 ms

    dumps v1: 391.4 ms
    data size v1: 45582.9 kB
    loads v1: 558.0 ms

    dumps v2: 166.9 ms
    data size v2: 41395.4 kB
    loads v2: 482.2 ms

    dumps v3: 431.2 ms
    data size v3: 41395.4 kB
    loads v3: 543.8 ms

    dumps v4: 361.8 ms
    data size v4: 37000.9 kB
    loads v4: 560.4 ms
    ---

    With the patch:
    ---
    dumps v0: 391.4 ms
    data size v0: 45582.9 kB
    loads v0: 578.2 ms

    dumps v1: 392.3 ms
    data size v1: 45582.9 kB
    loads v1: 556.8 ms

    dumps v2: 167.7 ms
    data size v2: 41395.4 kB
    loads v2: 484.6 ms

    dumps v3: 170.3 ms
    data size v3: 41395.4 kB
    loads v3: 467.0 ms

    dumps v4: 122.8 ms
    data size v4: 37000.9 kB
    loads v4: 468.9 ms
    ---

    dumps v3 is 60% faster, loads v3 is also 14% *faster*.

    dumps v4 is 66% faster, loads v4 is 16% faster.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    Performance of Python 3.3.3+:
    ---
    dumps v0: 374.8 ms
    data size v0: 45582.9 kB
    loads v0: 625.3 ms

    dumps v1: 374.6 ms
    data size v1: 45582.9 kB
    loads v1: 605.1 ms

    dumps v2: 152.9 ms
    data size v2: 41395.4 kB
    loads v2: 556.5 ms
    ---

    So with the patch, the Python 3.4 default version (4) is *faster* (dump 20% faster, load 16% faster) and produces *smaller files* (10% smaller).

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    Oh by the way, on Python 3.4, the file size (on version 3 and 4) is unchanged with my patch. Writing a reference produces takes exactly the same size than an integer.

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jan 28, 2014

    I am doing clinic conversion for marshal module so I am adding myself to nosy list to make sure both tickets are synchronized.

    http://bugs.python.org/issue20185

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    I am doing clinic conversion for marshal module so I am adding myself to nosy list to make sure both tickets are synchronized.

    The Derby is suspended until the release of Python 3.4 final.

    I consider this issue as an important performance regression that should be fixed before Python 3.4 final.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 28, 2014

    Did you tested for numerous shared int and floats? [1000] * 1000000 and [1000.0] * 1000000? AFAIK this was important use cases for adding 3 or 4 versions.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jan 28, 2014

    Did you tested for numerous shared int and floats? [1000] * 1000000 and [1000.0] * 1000000? AFAIK this was important use cases for adding 3 or 4 versions.

    Here are new benchmarks on Python 3.4 with:

    Integers: [1000] * 1000000
    Floats: [1000.0] * 1000000
    

    Integers, without the patch:

    dumps v3: 62.8 ms
    data size v3: 4882.8 kB
    loads v3: 10.7 ms
    

    Integers, with the patch:

    dumps v3: 18.6 ms (-70%)
    data size v3: 4882.8 kB (same size)
    loads v3: 27.7 ms (+158%)
    

    Floats, without the patch:

    dumps v3: 62.5 ms
    data size v3: 4882.8 kB
    loads v3: 11.0 ms
    

    Floats, with the patch:

    dumps v3: 29.3 ms (-53%)
    data size v3: 8789.1 kB (+80%)
    loads v3: 25.5 ms (+132%)
    

    The version 3 was added by:
    ---
    changeset: 82816:01372117a5b4
    user: Kristján Valur Jónsson <sweskman@gmail.com>
    date: Tue Mar 19 18:02:10 2013 -0700
    files: Doc/library/marshal.rst Include/marshal.h Lib/test/test_marshal.py Misc/NEWS Python/marshal.c
    description:
    Issue bpo-16475: Support object instancing, recursion and interned strings in marshal
    ---

    This issue tells about "sharing string constants, common tuples, even common code objects", not sharing numbers.

    For real data, here are interesting numbers:
    http://bugs.python.org/issue16475#msg176013

    Integers only represent 4.8% of serialized data, and only 8.2% of these integers can be shared. (Floats represent 0.29%.) Whereas strings repsent 58% and 57% can be shared.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 28, 2014

    For the record, format 3 was added through bpo-16475, format 4 was added through bpo-19219.

    In msg175962, Kristjan argued that there is no reason _not_ to share int objects, e.g. across multiple code objects. Now it seems that this argument is flawed: there is a reason, namely the performance impact.

    OTOH, I consider both use case (marshaling a large number of integers, and desiring to share ints across code objects) equally obscure: you shouldn't worry about marshal performance too much if you have loads of tiny int objects, and you shouldn't worry whether these ints get shared or not.

    As a compromise, we could suppress the sharing for small int objects, since they are singletons, anyway. This would allow marshal to preserve/copy the object graph, while not impacting the use case that the original poster on python-dev presented.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 28, 2014

    Integers, without the patch:

    dumps v3: 62.8 ms
    data size v3: 4882.8 kB
    loads v3: 10.7 ms
    

    Integers, with the patch:

    dumps v3: 18.6 ms (-70%)
    data size v3: 4882.8 kB (same size)
    loads v3: 27.7 ms (+158%)
    

    As I wrote on python-dev, dumps performance isn't important for the pyc
    use case, but loads performance is. Therefore it appears this patch goes
    into the wrong direction.

    You are also ignoring the *runtime* benefit of sharing objects: smaller
    memory footprint of the actual Python process.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 13, 2014

    As a compromise, we could suppress the sharing for small int objects, since they are singletons, anyway.

    This is implementation detail.

    But we can use more efficient way to memoizating small int objects.

    I also suppose than efficient C implementation of IdentityDict will significant decrease an overhead (may be 2 times). As far as this is probably a first case for which IdentityDict give significant benefit, I'll provide an implementation for 3.5.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 13, 2014

    Here is a patch which special cases small integers. It decreases v3 and v4 dump time of bench.py by 10% without affecting load time. Of course in real cases the benefit will be much less.

    @larryhastings
    Copy link
    Contributor

    larryhastings commented Feb 20, 2014

    This is not a release blocker. You guys can play with this for 3.5. FWIW I prefer Serhiy's approach.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 28, 2015

    Here is more general solution. For simple values (ints, floats, complex numbers, short strings) it is faster to use the value itself as a key than create new integer object (id).

    Without the patch:

    data ver. dumps(ms) loads(ms) size(KiB)

    genData 2 103.9 186.4 4090.7
    genData 3 250.3 196.8 4090.7
    genData 4 223.5 182.5 3651.3

    [1000]*10**6 2 98.6 134.8 4882.8
    [1000]*10**6 3 491.1 62.2 4882.8
    [1000]*10**6 4 494.9 62.1 4882.8

    [1000.0]*10**6 2 173.5 158.4 8789.1
    [1000.0]*10**6 3 494.8 62.2 4882.8
    [1000.0]*10**6 4 491.4 62.8 4882.8

    [1000.0j]*10**6 2 288.8 221.4 16601.6
    [1000.0j]*10**6 3 493.6 62.4 4882.8
    [1000.0j]*10**6 4 489.2 62.0 4882.8

    20 pydecimals 2 85.0 114.7 3936.5
    20 pydecimals 3 97.2 44.3 3373.4
    20 pydecimals 4 86.2 40.0 3297.5

    With marshal3_numbers.patch:

    data ver. dumps(ms) loads(ms) size(KiB)

    genData 3 108.4 187.5 4090.7
    genData 4 83.0 179.3 3651.3

    [1000]*10**6 3 104.2 145.8 4882.8
    [1000]*10**6 4 103.9 147.0 4882.8

    [1000.0]*10**6 3 177.4 154.5 8789.1
    [1000.0]*10**6 4 177.1 164.2 8789.1

    [1000.0j]*10**6 3 501.6 61.1 4882.8
    [1000.0j]*10**6 4 501.6 62.3 4882.8

    20 pydecimals 3 95.2 41.9 3373.4
    20 pydecimals 4 83.5 38.5 3297.5

    With marshal_refs_by_value.patch:

    data ver. dumps(ms) loads(ms) size(KiB)

    genData 3 150.4 197.0 4090.7
    genData 4 122.1 184.8 3651.3

    [1000]*10**6 3 169.1 62.3 4882.8
    [1000]*10**6 4 169.2 62.2 4882.8

    [1000.0]*10**6 3 313.5 62.2 4882.8
    [1000.0]*10**6 4 312.8 62.3 4882.8

    [1000.0j]*10**6 3 410.6 62.5 4882.8
    [1000.0j]*10**6 4 410.5 62.3 4882.8

    20 pydecimals 3 68.5 41.1 3373.4
    20 pydecimals 4 57.5 39.8 3297.5

    The marshal_refs_by_value.patch produces data so compact as unpatched code does, but dumps faster. Usually it dumps slower than marshal3_numbers.patch, but may produce smaller data and loads faster. Surprisingly it dumps the code of the _pydecimal module faster.

    As side effect the patch can turn some simple equal values to identical objects.

    @serhiy-storchaka serhiy-storchaka added the interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) label Jan 28, 2015
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 30, 2015

    Here are results of the benchmark which measures dump and load time for all pyc files in the stdlib (including tests).

    https://bitbucket.org/storchaka/cpython-stuff/src/default/marshal/marshalbench.py

    $ find * -name __pycache__ -exec rm -rf '{}' +
    $ ./python -m compileall -qq -r 99 Lib
    $ find Lib -name '*.pyc' | xargs ./python marshalbench.py

    Without the patch:

    ver. dumps loads size
    770.3 19941.2
    0 695.7 1178.4 19941.2
    1 680.4 1180.9 19941.2
    2 635.9 1115.9 19941.2
    3 896.3 757.3 19941.2
    4 748.0 700.6 19941.2

    With marshal3_numbers.patch:

    ver. dumps loads size
    750.5 19946.1
    0 690.2 1173.7 19946.1
    1 678.2 1166.6 19946.1
    2 630.9 1105.2 19946.1
    3 873.2 751.5 19946.1
    4 724.6 687.6 19946.1

    With marshal_refs_by_value.patch:

    ver. dumps loads size
    780.2 19935.8
    0 712.7 1202.6 19935.8
    1 713.8 1195.1 19935.8
    2 676.5 1126.0 19935.8
    3 686.1 762.7 19935.8
    4 515.1 697.6 19935.8

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 4, 2015

    Here is a patch which adds separate dict for interned strings (otherwise they can be uninterned) and for bytes. It also slightly simplify the code.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 4, 2015

    And here is alternative patch which uses a hashtable.

    Both patches have about the same performance for *.pyc files, but marshal_hashtable.patch is much faster for duplicated values. Marshalling [1000]*10**6, [1000.0]*10**6 and [1000.0j]*10**6 with version 3 an 4 is so fast as marshalling [1000]*10**6 with version 2 (i.e. 5 times faster than current implementation).

    data ver. dumps(ms) loads(ms) size(KiB)

    genData 2 99.9 188.9 4090.7
    genData 3 148.2 189.1 4090.7
    genData 4 121.4 177.4 3651.3

    [1000]*10**6 2 97.7 131.6 4882.8
    [1000]*10**6 3 95.1 63.1 4882.8
    [1000]*10**6 4 95.1 64.4 4882.8

    [1000.0]*10**6 2 172.9 153.5 8789.1
    [1000.0]*10**6 3 97.4 61.9 4882.8
    [1000.0]*10**6 4 95.7 61.6 4882.8

    [1000.0j]*10**6 2 288.6 228.2 16601.6
    [1000.0j]*10**6 3 94.9 61.6 4882.8
    [1000.0j]*10**6 4 95.1 62.2 4882.8

    20 pydecimals 2 88.0 111.4 3929.6
    20 pydecimals 3 57.0 51.4 3368.5
    20 pydecimals 4 46.6 39.9 3292.8

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 6, 2015

    What are your thoughts Victor?

    @serhiy-storchaka serhiy-storchaka changed the title Marshal: special case int and float, don't use references Marshal: performance regression with versions 3 and 4 Feb 6, 2015
    @pitrou
    Copy link
    Member

    pitrou commented Feb 6, 2015

    Can you post the results of your hashtable patch with marshalbench.py?

    PS: we don't care about versions older than 4.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 6, 2015

    Here are new results with corrected marshalbench.py (more precise and with recalculated total size):

    Without the patch:
    ver. dumps loads size
    746.5 19971.2
    0 669.1 1149.2 26202.9
    1 682.0 1130.1 26202.9
    2 652.0 1134.8 26202.4
    3 920.8 757.9 21316.7
    4 771.2 691.4 19971.2

    With marshal3_numbers.patch:
    3 894.6 778.4 21321.6
    4 733.1 704.2 19976.2

    With marshal_refs_by_value_3.patch:
    3 687.6 777.7 21310.3
    4 535.5 701.5 19966.3

    With marshal_hashtable.patch:
    3 656.0 764.2 21316.7
    4 512.9 696.6 19971.2

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 10, 2015

    Update patch addresses Victor's comments.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 11, 2015

    Except of the minor suggestion that I added on the review, the patch looks good the me. It's quite simple and makes dumps() 34% faster (for protocol 4).

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 11, 2015

    (I didn't reproduce the benchmark, I just used Serhy numbers. I guess that using memcpy() doesn't make the code slower.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 11, 2015

    New changeset 012364df2712 by Serhiy Storchaka in branch 'default':
    Issue bpo-20416: marshal.dumps() with protocols 3 and 4 is now 40-50% faster on
    https://hg.python.org/cpython/rev/012364df2712

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 11, 2015

    Thanks for your review Antoine and Victor.

    @serhiy-storchaka serhiy-storchaka self-assigned this Feb 11, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants