Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming struct unpacking #62004

Closed
pitrou opened this issue Apr 20, 2013 · 19 comments
Closed

streaming struct unpacking #62004

pitrou opened this issue Apr 20, 2013 · 19 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Apr 20, 2013

BPO 17804
Nosy @rhettinger, @terryjreedy, @mdickinson, @pitrou, @skrah, @meadori, @serhiy-storchaka, @phmc
Files
  • iter_unpack.patch
  • bench_unpack.py
  • struct_array_view.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-04-26.22:21:03.173>
    created_at = <Date 2013-04-20.20:33:23.478>
    labels = ['type-feature', 'library']
    title = 'streaming struct unpacking'
    updated_at = <Date 2013-04-26.22:21:03.172>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2013-04-26.22:21:03.172>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-04-26.22:21:03.173>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2013-04-20.20:33:23.478>
    creator = 'pitrou'
    dependencies = []
    files = ['29953', '29955', '29967']
    hgrepos = []
    issue_num = 17804
    keywords = ['patch']
    message_count = 19.0
    messages = ['187455', '187457', '187465', '187467', '187468', '187469', '187470', '187471', '187472', '187499', '187508', '187509', '187554', '187602', '187609', '187682', '187875', '187880', '187881']
    nosy_count = 10.0
    nosy_names = ['rhettinger', 'terry.reedy', 'mark.dickinson', 'pitrou', 'skrah', 'meador.inge', 'python-dev', 'serhiy.storchaka', 'pconnell', 'isoschiz']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17804'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    For certain applications, you want to unpack repeatedly the same pattern. This came in bpo-17618 (base85 decoding), where you want to unpack a stream of bytes as 32-bit big-endian unsigned ints. The solution adopted in bpo-17618 patch (struct.Struct("!{}I")) is clearly suboptimal.

    I would suggest something like a iter_unpack() function which would repeatedly yield tuples until the bytes object is over.

    @pitrou pitrou added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 20, 2013
    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    (my initial intuition here was to use memoryview.cast() but it doesn't support non-native formats)

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    Here is a patch (still lacking docs). Comments welcome.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 20, 2013

    Perhaps we need not iter_unpack(), but a grouper (some sort of)?

    def grouper(seq, size):
        for i in range(0, len(seq), size):
            yield seq[i: i + size]
    
    unpack = struct.Struct('!I').unpack
    for chunk in grouper(data, 4):
        word, = unpack(chunk)
        ...

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    Well, according to a quick benchmark, iter_unpack() is 3x to 6x faster than the grouper() + unpack() recipe.
    (it's also a bit more user-friendly)

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    For the record, here is the benchmark script.

    @isoschiz
    Copy link
    Mannequin

    isoschiz mannequin commented Apr 20, 2013

    I like the idea of this. Two comments:

    • I'm no expert on the C API, but in s_iter_unpack do you not need to check for failure of PyType_GenericAlloc before calling PyObject_GetBuffer?

    • I'm not a fan of separate iter_ functions (and there seemed to be a general move away from them elsewhere in Python3; obviously here we have to maintain backwards compat though). Perhaps a boolean keyword "asiter" arg to the regular unpack() instead?

    @isoschiz
    Copy link
    Mannequin

    isoschiz mannequin commented Apr 20, 2013

    On 20 Apr 2013, at 23:01, Martin Morrison <report@bugs.python.org> wrote:

    • I'm not a fan of separate iter_ functions (and there seemed to be a general move away from them elsewhere in Python3; obviously here we have to maintain backwards compat though). Perhaps a boolean keyword "asiter" arg to the regular unpack() instead?

    Thinking about this more, the functionality is probably too radically different to overload the same function, so I withdraw the suggestion.

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 20, 2013

    • I'm no expert on the C API, but in s_iter_unpack do you not need to
      check for failure of PyType_GenericAlloc before calling
      PyObject_GetBuffer?

    Yes, good catch.

    • I'm not a fan of separate iter_ functions (and there seemed to be a
      general move away from them elsewhere in Python3; obviously here we
      have to maintain backwards compat though). Perhaps a boolean keyword
      "asiter" arg to the regular unpack() instead?

    We generally consider it bad API design when a parameter changes the
    return *type* of the function. "iter_unpack" may not be terrific as a
    name but it describes the semantics quite clearly (and it's not too
    annoying to type).

    @mdickinson
    Copy link
    Member

    mdickinson commented Apr 21, 2013

    This seems like an attractive idea. There's definitely a need for repeated unpacking with the same pattern, and I agree that putting the repetition into the pattern is suboptimal (not least from the point of view of caching structs).

    One thing that feels a bit unorthogonal is that this is doing two things at once: both allowing for repetition of a pattern, and also adding the lazy iteration. I'd guess that there's also a use-case for allowing repetition but not returning an iterator; but then that's easily covered by list(iter_unpack).

    +1 from me.

    Hmm; the name. 'iterunpack'? 'iter_unpack'? 'unpack_stream'? 'unpack_all'?

    Would we want something similar for packing, too? I guess that's effectively covered by b''.join(s.pack(item) for item in ...).

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 21, 2013

    Well, according to a quick benchmark, iter_unpack() is 3x to 6x faster than the grouper() + unpack() recipe.
    (it's also a bit more user-friendly)

    Yes, It's mainly because a grouper written on Python. When it will be implemented in C, the difference will be less. This function will be useful beside struct. Note that in my patch for bpo-17618 struct.Struct("!{}I") is not used.

    As for extending Struct, what you think about a more powerful feature? About a method which returns not an iterator, but an iterable and indexable sequence. Here is a sample Python implementation.

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 21, 2013

    Yes, It's mainly because a grouper written on Python. When it will be
    implemented in C, the difference will be less. This function will be
    useful beside struct.

    I'm not against adding useful C tools to itertools, but you may have to
    convince Raymond ;)

    As for extending Struct, what you think about a more powerful feature?
    About a method which returns not an iterator, but an iterable and
    indexable sequence. Here is a sample Python implementation.

    I'll take a look, but the question is how complex a C implementation
    would be.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 22, 2013

    I'm not against adding useful C tools to itertools, but you may have to
    convince Raymond ;)

    C implementation speeds up the benchmark only 1.5x. Granting the fact that this idiom is used in stdlib less than two dozens times (without tests and iobench), I do not think more this is a worthful idea.

    I'll take a look, but the question is how complex a C implementation
    would be.

    Definitely it will be more complex than for iter_unpack. ;)

    @meadori
    Copy link
    Member

    meadori commented Apr 23, 2013

    This seems reasonable to me to. So +1.

    Small bikeshed on the name: I think 'unpack_iter' would be more
    consistent with what is already there, e.g. 'unpack' and 'unpack_from'.
    In fact, when experimenting with this patch I found myself typing
    'unpack_iter' several times.

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 23, 2013

    Small bikeshed on the name: I think 'unpack_iter' would be more
    consistent with what is already there, e.g. 'unpack' and 'unpack_from'.
    In fact, when experimenting with this patch I found myself typing
    'unpack_iter' several times.

    I thought so, but "unpack_iter" would mean we are unpacking an iterator,
    while we are unpacking *as* an iterator (like iterkeys() and friends in
    Python 2).

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Apr 24, 2013

    iter_unpack() is closer to how we name other iter variants, so I think you ought to stick with that. The other names are too creative (and hence less guessable).

    @terryjreedy
    Copy link
    Member

    terryjreedy commented Apr 26, 2013

    I think 'iter_unpack' is deceptive and wrong for the following reason. Up to now, 'ixyz' or 'iterxyz' or 'iter_xyz' has meant a version of 'xyz' that presents items one at a time rather than all at once in a collection object (usually in a list). Unpack returns a tuple, but the new function would *not* present the members of the tuple one at time. Instead, it would apply unpack multiple times, yielding multiple tuples. I would call the new thing 'unpack_all' or 'unpacker' (the latter works especially well for an iterator class). An unpacker is a person or machine that repeatedly unpacks. (I was once a bottle unpacker for a college summer job ;-).

    struct.unpacker(fmt, buffer)
        Return an iterator that repeatedly unpacks successive buffer slices of size calcsize(fmt) into tuples according to the format string fmt. The buffer length must be an exact multiple of calcsize(fmt)). (? Not clear from text description. Add param to allow remainder?)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 26, 2013

    New changeset d232cff25bbd by Antoine Pitrou in branch 'default':
    Issue bpo-17804: New function struct.iter_unpack allows for streaming struct unpacking.
    http://hg.python.org/cpython/rev/d232cff25bbd

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 26, 2013

    Ok, I don't want the bikeshedding to last too long, so I committed the patch with docs. Thanks everyone!

    @pitrou pitrou closed this as completed Apr 26, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants