Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The protocol > 0 of cPickle does not given stable dictionary values #57944

Closed
kayhayen mannequin opened this issue Jan 8, 2012 · 9 comments
Closed

The protocol > 0 of cPickle does not given stable dictionary values #57944

kayhayen mannequin opened this issue Jan 8, 2012 · 9 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kayhayen
Copy link
Mannequin

kayhayen mannequin commented Jan 8, 2012

BPO 13735
Nosy @jcea, @pitrou
Files
  • stream.py
  • stream.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-01-08.19:04:29.737>
    created_at = <Date 2012-01-08.03:38:30.117>
    labels = ['interpreter-core', 'type-bug', 'library']
    title = 'The protocol > 0 of cPickle does not given stable dictionary values'
    updated_at = <Date 2012-01-09.14:37:14.291>
    user = 'https://bugs.python.org/kayhayen'

    bugs.python.org fields:

    activity = <Date 2012-01-09.14:37:14.291>
    actor = 'jcea'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-01-08.19:04:29.737>
    closer = 'pitrou'
    components = ['Interpreter Core', 'Library (Lib)']
    creation = <Date 2012-01-08.03:38:30.117>
    creator = 'kayhayen'
    dependencies = []
    files = ['24170', '24180']
    hgrepos = []
    issue_num = 13735
    keywords = []
    message_count = 9.0
    messages = ['150841', '150871', '150891', '150892', '150893', '150894', '150895', '150896', '150897']
    nosy_count = 3.0
    nosy_names = ['jcea', 'pitrou', 'kayhayen']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13735'
    versions = ['Python 2.7']

    @kayhayen
    Copy link
    Mannequin Author

    kayhayen mannequin commented Jan 8, 2012

    Hello,

    I am implementing a Python compiler (Nuitka) that is testing if when it compiles itself, it gives the same output.

    I have been using "protocol = 0" ever since with "pickle" module for historic reasons (gcc bug with raw strings lead me to believe it's better) and lately, I have changed to "protocol = 2" and cPickle. But then I noticed that my compile itself test now fail to give same code from pickling of dictionary constants.

    Imanaged and isolated the issue, and it's a Python2.7 regression, Python2.6 is fine:

    Observe this output from "cPickle.dumps" for a constant dictionary with one element:

    Protocol 0 :
    Dumping read const const stream "(dp1\nS'modules'\np2\nNs."
    Dumping load const const stream "(dp1\nS'modules'\np2\nNs."
    Dumping load const const stream "(dp1\nS'modules'\np2\nNs."
    Protocol 1 :
    Dumping read const const stream '}q\x01U\x07modulesq\x02Ns.'
    Dumping load const const stream '}q\x01U\x07modulesNs.'
    Dumping load const const stream '}q\x01U\x07modulesNs.'
    Protocol 2 :
    Dumping read const const stream '\x80\x02}q\x01U\x07modulesq\x02Ns.'
    Dumping load const const stream '\x80\x02}q\x01U\x07modulesNs.'
    Dumping load const const stream '\x80\x02}q\x01U\x07modulesNs.'

    It seems that cPickle as of CPython2.7 does give a better stream for dictionaries it itself emitted. With CPython2.6 I observe no difference.

    My work-around is to "re-stream", "dumps" -> "loads" -> "dumps" with CPython2.7 for the time being.

    Can you either: Fix cPickle to treat the dictionaries the same, or enhance to core to produce the same dict as cPickle does? It appears at least some kind of efficiency might be missed out for marshall as well.

    @kayhayen kayhayen mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 8, 2012
    @pitrou
    Copy link
    Member

    pitrou commented Jan 8, 2012

    • Can you use pickletools.dis to examine what differs in the pickles' bytecode?
    • Does it also apply to 3.3?

    It appears at least some kind of efficiency might be missed out for
    marshall as well.

    marshal doesn't use the pickle protocols, so it's unlikely to be affected.

    @kayhayen
    Copy link
    Mannequin Author

    kayhayen mannequin commented Jan 8, 2012

    It seems that there is an extra "BINPUT 2", whatever it does. I am attaching a variant that does pickletools.dis on the 3 dumps.

    Protocol 2 :
    Dumping read const const stream '\x80\x02}q\x01U\x07modulesq\x02Ns.'
    0: \x80 PROTO 2
    2: } EMPTY_DICT
    3: q BINPUT 1
    5: U SHORT_BINSTRING 'modules'
    14: q BINPUT 2
    16: N NONE
    17: s SETITEM
    18: . STOP
    highest protocol among opcodes = 2
    Dumping load const const stream '\x80\x02}q\x01U\x07modulesNs.'
    0: \x80 PROTO 2
    2: } EMPTY_DICT
    3: q BINPUT 1
    5: U SHORT_BINSTRING 'modules'
    14: N NONE
    15: s SETITEM
    16: . STOP
    highest protocol among opcodes = 2
    Dumping load const const stream '\x80\x02}q\x01U\x07modulesNs.'
    0: \x80 PROTO 2
    2: } EMPTY_DICT
    3: q BINPUT 1
    5: U SHORT_BINSTRING 'modules'
    14: N NONE
    15: s SETITEM
    16: . STOP
    highest protocol among opcodes = 2

    @pitrou
    Copy link
    Member

    pitrou commented Jan 8, 2012

    Ah, right. BINPUT is used to memoize objects in case a restructive structure happens. You can use pickletools.optimize() to remove the useless BINPUTs out of a pickle stream.

    Note that 3.x is more consistent and always emits the BINPUTs. It seems 2.x's cPickle (which is really a weird piece of code) may do some premature optimization here.

    @kayhayen
    Copy link
    Mannequin Author

    kayhayen mannequin commented Jan 8, 2012

    Sending my attached file "stream.py" through "2to3.py" it shows that CPython 3.2 doesn't exihibit the issue for either protocol, which may be because it's now "unicode" key, but as it's the only value I tried, I can't tell.

    Hope this helps.

    Regarding the "marshal", I presume, that somehow the dictionary when created via "marshal" (or compile, if no ".pyc" is involved?) first time is somehow less efficient to determine/stream that the one "cPickle" created.

    My assumption is that somehow, when creating the dictionary from cPickle, it is different (has to be), and that's interesting. And of course the easier to stream dictionary may be the nicer one at runtime as well. But that is just my speculation.

    Yours,
    Kay

    @pitrou
    Copy link
    Member

    pitrou commented Jan 8, 2012

    I think the "premature" optimization comes down to that code:

    """
    static int
    put(Picklerobject *self, PyObject *ob)
    {
        if (Py_REFCNT(ob) < 2 || self->fast)
            return 0;
    return put2(self, ob);
    

    }
    """

    (put2() being the function which emits BINPUT)

    When you unpickle a dict, its contents are created anew and therefore the refcount is 1 => next pickling avoids emitting a BINPUT.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 8, 2012

    Regarding the "marshal", I presume, that somehow the dictionary when
    created via "marshal" (or compile, if no ".pyc" is involved?) first
    time is somehow less efficient to determine/stream that the one
    "cPickle" created.

    I don't think so. marshal uses a much simpler and less general
    serialization protocol, so it should theoretically (!) be faster as
    well.
    For example, I don't think marshal supports recursive structures.

    @kayhayen
    Copy link
    Mannequin Author

    kayhayen mannequin commented Jan 8, 2012

    I see, if it's refcount dependent, that explains why it changes from interpreter provided dictionary and self-created one.

    So, I take, I should always call "pickletools.optimize( cPickle.dumps( value ))" then.

    Thanks,
    Kay

    @pitrou
    Copy link
    Member

    pitrou commented Jan 8, 2012

    Closing the issue as it's really a quirk rather than a bug. Thanks for reporting, though.

    @pitrou pitrou closed this as completed Jan 8, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant