Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct module documentation should have more predictable examples/warnings #99146

Closed
smontanaro opened this issue Nov 5, 2022 · 15 comments
Closed
Labels
docs Documentation in the Doc dir

Comments

@smontanaro
Copy link
Contributor

smontanaro commented Nov 5, 2022

Documentation

The documentation for the struct module isn't explicit about what's expected of the various examples. Working through that in a PR...

@mdickinson
Copy link
Member

+1 for the changes. The original text here, and the choice to use big-endian for the examples, dates from over 28 years ago. These days I suspect that it's rather rare that the endianness of the machine you currently happen to be working on is relevant to the data manipulation task at hand. As such, I'd consider it a best practice to always use a struct "sigil" as the first character in your format string, and if we follow that best practice in the docs then it'll help it propagate to struct users.

@smontanaro
Copy link
Contributor Author

Thanks @mdickinson. That suggests that I should do a bit more tweaking of both the text and the examples in my PR.

@smontanaro
Copy link
Contributor Author

I think some consensus seems to have been reached in this thread. In particular, most examples should be explicit in their layout definitions.

@cameron-simpson's suggestion sums things up nicely:

... an example with
native byte order (no < or >) cannot work “as is” on all
platforms. And further, having it “just work” on the commonest
platform is actively misleading. I am AGAINST that.

I think the “just works” examples should all use < or >.

I think there needs to be at least one “native” example, and it should
be prefaced clearly that this may well not work identically on a
user’s machine because it is machine type (and compiler type) dependent.

And then it should be presented, with commentary.

I’d even advocate presenting the existing hhl example, with
contradicting example outputs from different platforms. So:

keep the existing output, and explain the source platform and its unpadded behaviour
add a current example (yours or any of mine) and explain its padding behaviour

@smontanaro
Copy link
Contributor Author

smontanaro commented Nov 6, 2022

As I'm working through some examples (and trying to update the documentation text), I find myself confused by some of the behavior. Consider these four similar struct.pack() examples (on an Apple M1 processor):

>>> sys.byteorder
'little'
>>> struct.pack('qqh', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00'
>>> struct.pack('qqh0q', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'
>>> struct.pack('>qqh0q', 1, 2, 3)
b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03'
>>> struct.pack('<qqh0q', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00'

I would have expected the 0q suffix on the format string to force padding of the output byte string to always be a multiple of eight bytes, but that's not the case:

>>> len(struct.pack('qqh', 1, 2, 3))
18
>>> len(struct.pack('qqh0q', 1, 2, 3))
24
>>> len(struct.pack('<qqh0q', 1, 2, 3))
18
>>> len(struct.pack('>qqh0q', 1, 2, 3))
18

If I don't understand what's going on here, anything I write will be gibberish...

(Edit: update my expectation to be more forceful)

@smontanaro
Copy link
Contributor Author

I switched to my Raspberry Pi, which is a little endian 32-bit ARM machine. I get confusing (to me) results there as well.

>>> sys.byteorder
'little'
>>> sys.maxsize
2147483647
>>> struct.pack('llh0l', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
>>> struct.pack('>llh0l', 1, 2, 3)
b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03'
>>> struct.pack('<llh0l', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00'

If my reading of the documentation is correct:

To align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero.

the '0l' at the end of the format strings should force padding at the end of the byte string to that necessary for long (four bytes on the Raspberry Pi). It seems not to be working. What about after the 'h' but before another 'l'?

>>> struct.pack('<llh0ll', 1, 2, 3, 4)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x04\x00\x00\x00'
>>> len(struct.pack('<llh0ll', 1, 2, 3, 4))
14

Again, I see nothing to suggest the growing byte string was padded out to a four-byte boundary before the trailing b'\x04\x00\x00\x00' was appended.

In general, the 0<char> format doesn't seem to do what the docs say it should. Looking at the code in Modules/struct.c, nothing jumped out at me that suggested it was handing that case, though I haven't convinced myself I've looked everywhere.

@mdickinson
Copy link
Member

So my understanding of what's going on is that for non-native packing and unpacking (e.g., your examples with < and >), alignment simply doesn't come into play at all (which is why it says "None" in the "Alignment" column) in the table here.

E.g., we get a simple non-padded 9-byte output from the following (on any machine, in theory, but here on macOS / 64-bit Intel)

>>> import struct
>>> struct.pack('>lbl', 1, 2, 3)
b'\x00\x00\x00\x01\x02\x00\x00\x00\x03'

For the native examples, it looks to me as though you're getting the padding that you'd expect.

@mdickinson
Copy link
Member

Possibly we could reword note 3 to emphasize the contrast with the "non-native size and alignment" comment in note 2? Something like:

3. To pad the end of a structure to the alignment requirement of a particular type when using native size and alignment, end the format with the code for that type with a repeat count of zero. See [Examples](https://docs.python.org/3.11/library/struct.html#struct-examples).

@smontanaro
Copy link
Contributor Author

Thanks @mdickinson. If I understand correctly, struct is working as it should, and users don't have complete control over endianness and padding. That's unfortunate, as that would seem to eliminate one of the stated data sources, network connections, since there's no guarantee such data would match your C compiler's struct layout.

@smontanaro
Copy link
Contributor Author

no guarantee such data would match your C compiler's struct layout.

More correctly, the programmer couldn't precisely control the format definition to match the network data layout.

@mdickinson
Copy link
Member

mdickinson commented Nov 8, 2022

There's some control over padding for the non-native modes, via the x format code (which adds a byte of padding). E.g.,

>>> import struct
>>> struct.pack('>lb3xl', 1, 2, 3)
b'\x00\x00\x00\x01\x02\x00\x00\x00\x00\x00\x00\x03'

But I agree it's not ideal.

network connections, since there's no guarantee such data would match your C compiler's struct layout.

Wouldn't data streamed over the network be expected to follow a machine-agnostic format, in general? I'm not sure when one would expect it to match the layout for whatever the C compiler happens to be on your local machine.

@smontanaro
Copy link
Contributor Author

Thanks @mdickinson. Apologies for the late reply. Been away.

The longer this thread continues, the more I'm beginning to believe there are two distinct use cases.

  1. Map data into and out of the format dictated by the compiler used to build Python. This is typified by the @ prefix and zero-repeat format codes. Something like llh0l is meaningful in this context.
  2. Map data into and out of specific formats needed for other uses (like network communication). For this, @ and zero-repeat formats are useless and you need to resort to more explicit format characters (endianness, specific padding with x). Assuming my compiler wants longs aligned on four bytes and I need to send/receive big-endian data across the net, the above format string would be replaced by >llhxx.

Have I got that about right? If I'm starting to think about this in the right way, I think the documentation should be explicit about these distinct use cases.

@cameron-simpson
Copy link

cameron-simpson commented Nov 11, 2022 via email

@smontanaro
Copy link
Contributor Author

I updated the PR and marked it ready for review, should anyone be interested.

@mdickinson
Copy link
Member

@smontanaro

the more I'm beginning to believe there are two distinct use cases [...]

I can't speak for other struct users, but that matches the way I think I about it.

mdickinson pushed a commit that referenced this issue Nov 22, 2022
…mples/warnings (GH-99141)

* nail down a couple examples to have more predictable output

* update a number of things, but this is really just a stash...

* added an applications section to describe typical uses for native and machine-independent formats

* make sure all format strings use a format prefix character

* responding to comments from @gpshead. Not likely finished yet.

* This got more involved than I expected...

* respond to several PR comments
* a lot of wordsmithing
* try and be more consistent in use of ``x`` vs ``'x'``
* expand examples a bit
* update the "see also" to be more up-to-date
* original examples relied on import * so present all examples as if

* reformat based on @gpshead comment (missed before)

* responding to comments

* missed this

* one more suggested edit

* wordsmithing
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 22, 2022
…le examples/warnings (pythonGH-99141)

* nail down a couple examples to have more predictable output

* update a number of things, but this is really just a stash...

* added an applications section to describe typical uses for native and machine-independent formats

* make sure all format strings use a format prefix character

* responding to comments from @gpshead. Not likely finished yet.

* This got more involved than I expected...

* respond to several PR comments
* a lot of wordsmithing
* try and be more consistent in use of ``x`` vs ``'x'``
* expand examples a bit
* update the "see also" to be more up-to-date
* original examples relied on import * so present all examples as if

* reformat based on @gpshead comment (missed before)

* responding to comments

* missed this

* one more suggested edit

* wordsmithing
(cherry picked from commit 22d91c1)

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 22, 2022
…le examples/warnings (pythonGH-99141)

* nail down a couple examples to have more predictable output

* update a number of things, but this is really just a stash...

* added an applications section to describe typical uses for native and machine-independent formats

* make sure all format strings use a format prefix character

* responding to comments from @gpshead. Not likely finished yet.

* This got more involved than I expected...

* respond to several PR comments
* a lot of wordsmithing
* try and be more consistent in use of ``x`` vs ``'x'``
* expand examples a bit
* update the "see also" to be more up-to-date
* original examples relied on import * so present all examples as if

* reformat based on @gpshead comment (missed before)

* responding to comments

* missed this

* one more suggested edit

* wordsmithing
(cherry picked from commit 22d91c1)

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
mdickinson pushed a commit that referenced this issue Nov 22, 2022
…ble examples/warnings (GH-99141) (GH-99702)

gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141)

* nail down a couple examples to have more predictable output

* update a number of things, but this is really just a stash...

* added an applications section to describe typical uses for native and machine-independent formats

* make sure all format strings use a format prefix character

* responding to comments from @gpshead. Not likely finished yet.

* This got more involved than I expected...

* respond to several PR comments
* a lot of wordsmithing
* try and be more consistent in use of ``x`` vs ``'x'``
* expand examples a bit
* update the "see also" to be more up-to-date
* original examples relied on import * so present all examples as if

* reformat based on @gpshead comment (missed before)

* responding to comments

* missed this

* one more suggested edit

* wordsmithing
(cherry picked from commit 22d91c1)

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
mdickinson pushed a commit that referenced this issue Nov 22, 2022
…ble examples/warnings (GH-99141) (GH-99703)

gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141)

* nail down a couple examples to have more predictable output

* update a number of things, but this is really just a stash...

* added an applications section to describe typical uses for native and machine-independent formats

* make sure all format strings use a format prefix character

* responding to comments from @gpshead. Not likely finished yet.

* This got more involved than I expected...

* respond to several PR comments
* a lot of wordsmithing
* try and be more consistent in use of ``x`` vs ``'x'``
* expand examples a bit
* update the "see also" to be more up-to-date
* original examples relied on import * so present all examples as if

* reformat based on @gpshead comment (missed before)

* responding to comments

* missed this

* one more suggested edit

* wordsmithing
(cherry picked from commit 22d91c1)

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>

Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
@hauntsaninja
Copy link
Contributor

Thanks, looks like the docs PRs have been merged and backported!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

4 participants