Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-40302: Add _Py_bswap32() function to pyport.h #19552

Merged
merged 3 commits into from
Apr 17, 2020
Merged

bpo-40302: Add _Py_bswap32() function to pyport.h #19552

merged 3 commits into from
Apr 17, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Apr 16, 2020

Add the following static inline functions to pyport.h:

  • _Py_bswap16()
  • _Py_bswap32()
  • _Py_bswap64()

Use these functions in _ctypes, sha256 and sha512 modules.

https://bugs.python.org/issue40302

@vstinner
Copy link
Member Author

cc @serhiy-storchaka

Include/pyport.h Outdated
#ifdef _PY_HAVE_BUILTIN_BSWAP
return __builtin_bswap64(word);
#elif defined(_MSC_VER)
return ( ((word & 0x00000000000000FFUL) << 56)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done with 4 &s, 6 shifts and 3 |s instead of 8 &s, 8 shifts and 7 |s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the code from _ctypes. I cannot guess which code you would like.

In practice, efficient builtin functions are used on GCC, clang and Windows (MSC).

Include/pyport.h Outdated
#ifdef _PY_HAVE_BUILTIN_BSWAP
return __builtin_bswap32(word);
#else
word = ( (word & 0xFF00FF00UL) >> 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but it may be a benefit from using the same constant: (word >> 8) & 0x00FF00FFUL. It may also be a benefit from using & and >> in different orders. I did not tested this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let you test it. I don't really care of the performance on this code path, since efficient builtin functions are used on most platforms.

@vstinner vstinner requested a review from a team as a code owner April 16, 2020 17:06
@vstinner
Copy link
Member Author

PR updated:

  • Use MSC builtin functions
  • Add basic unit tests
  • Fix _ctypes code
  • Leave pyport.h unchanged: add a new internal pycore_bswap.h header instead. Is it what you wanted @pitrou?

In Objects/stringlib/codecs.h, I left the following macro unchanged since I don't understand if it uses uint16_t or uint32_t:

#define SWAB2(CH)  (((CH) << 8) | ((CH) >> 8))

My _ctypes/cfield.c static inline functions look overkill. Maybe using _Py_bswapXX() directly would be enough? I'm not sure about conversions between signed and unsigned integers.

Move new functions to an internal header means that sha256, sha512 and _ctypes modules must now be compiled with the internal C API to access this header.

@pitrou
Copy link
Member

pitrou commented Apr 16, 2020

Perhaps pyport_internal.h? A dedicated header for bswap is overkill.

@vstinner
Copy link
Member Author

I looked at Linux /usr/include/bits/byteswap.h and decided to use a simpler implementation: don't expect "a << b" or "a >> b" to be circular.

@vstinner
Copy link
Member Author

Perhaps pyport_internal.h? A dedicated header for bswap is overkill.

I can come up with a better name if needed. But first, do you care of having these functions in the public C API or are you ok to have it in the internal C API?

Public C API: private function prefixed by _Py, and it would be a new header file, since it adds a new include on Windows.

@pitrou
Copy link
Member

pitrou commented Apr 17, 2020

I don't think they need to be in the public API. It's just internal helpers for CPython.

@vstinner
Copy link
Member Author

I don't think they need to be in the public API. It's just internal helpers for CPython.

Ok, so let's start with pycore_byteswap.h. If tomorrow, we add more bit and byte utilities, we can rename the header. Since it's the internal C API, we don't have to bother with the backward compatibility.

"byteswap.h" name comes from GNU byteswap.h header name:

NAME
       bswap_16, bswap_32, bswap_64 - reverse order of bytes

SYNOPSIS
       #include <byteswap.h>

       bswap_16(x);
       bswap_32(x);
       bswap_64(x);

Add a new internal pycore_byteswap.h header file with the following
functions:

* _Py_bswap16()
* _Py_bswap32()
* _Py_bswap64()

Use these functions in _ctypes, sha256 and sha512 modules.
Also use it in the UTF-32 encoder.

sha256, sha512 and _ctypes modules are now built with the internal C
API.
@vstinner
Copy link
Member Author

The glibc managed to unify all of its byteswap.h implementations into a single header file thanks to GCC builtin functions: https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0d40d0ecba3b1e5b8c3b8da01c708fea3948e193 Previously, it had one implementation per architecture using assembly code!

I checked how _sha256 module is compiled with gcc -O3. longReverse() is inlined and bswap x86 instruction is used (work on 32-bit registers). It sems like even the loop is unrolled. Example:

(gdb) disassemble longReverse 
Dump of assembler code for function sha_transform:
   0x00007fffeaaa92b0 <+0>:	push   r15
   0x00007fffeaaa92b2 <+2>:	mov    r9d,0x10
   0x00007fffeaaa92b8 <+8>:	push   r14
   0x00007fffeaaa92ba <+10>:	push   r13
   0x00007fffeaaa92bc <+12>:	push   r12
   0x00007fffeaaa92be <+14>:	push   rbp
   0x00007fffeaaa92bf <+15>:	push   rbx
   0x00007fffeaaa92c0 <+16>:	sub    rsp,0xc0
   0x00007fffeaaa92c7 <+23>:	movdqu xmm1,XMMWORD PTR [rdi+0x38]
   0x00007fffeaaa92cc <+28>:	mov    esi,DWORD PTR [rdi+0x48]
   0x00007fffeaaa92cf <+31>:	movdqu xmm2,XMMWORD PTR [rdi+0x48]
   0x00007fffeaaa92d4 <+36>:	mov    eax,DWORD PTR [rdi+0x38]
   0x00007fffeaaa92d7 <+39>:	movaps XMMWORD PTR [rsp-0x48],xmm1
=> 0x00007fffeaaa92dc <+44>:	bswap  esi
   0x00007fffeaaa92de <+46>:	mov    r8d,DWORD PTR [rsp-0x44]
   0x00007fffeaaa92e3 <+51>:	mov    ecx,DWORD PTR [rdi+0x58]
   0x00007fffeaaa92e6 <+54>:	movaps XMMWORD PTR [rsp-0x38],xmm2
   0x00007fffeaaa92eb <+59>:	movdqu xmm3,XMMWORD PTR [rdi+0x58]
   0x00007fffeaaa92f0 <+64>:	movdqu xmm4,XMMWORD PTR [rdi+0x68]
   0x00007fffeaaa92f5 <+69>:	bswap  eax
   0x00007fffeaaa92f7 <+71>:	mov    DWORD PTR [rsp-0x38],esi
   0x00007fffeaaa92fb <+75>:	mov    esi,DWORD PTR [rsp-0x34]
   0x00007fffeaaa92ff <+79>:	bswap  r8d
   0x00007fffeaaa9302 <+82>:	bswap  ecx
   0x00007fffeaaa9304 <+84>:	mov    DWORD PTR [rsp-0x44],r8d
   0x00007fffeaaa9309 <+89>:	mov    r8d,DWORD PTR [rsp-0x40]
   0x00007fffeaaa930e <+94>:	bswap  esi
   0x00007fffeaaa9310 <+96>:	mov    DWORD PTR [rsp-0x34],esi
   0x00007fffeaaa9314 <+100>:	mov    esi,DWORD PTR [rsp-0x30]
   0x00007fffeaaa9318 <+104>:	bswap  r8d
   0x00007fffeaaa931b <+107>:	mov    DWORD PTR [rsp-0x40],r8d
   0x00007fffeaaa9320 <+112>:	mov    r8d,DWORD PTR [rsp-0x3c]
   0x00007fffeaaa9325 <+117>:	bswap  esi
(...)

Oh. GCC looks clever. Without this PR, it was already able to recognize that longReverse() implements bytecode and it already used bswap!

Dump of assembler code for function sha_transform:
   0x00007fffeaaa9230 <+0>:	push   r15
   0x00007fffeaaa9232 <+2>:	mov    r9d,0x10
   0x00007fffeaaa9238 <+8>:	push   r14
   0x00007fffeaaa923a <+10>:	push   r13
   0x00007fffeaaa923c <+12>:	push   r12
   0x00007fffeaaa923e <+14>:	push   rbp
   0x00007fffeaaa923f <+15>:	push   rbx
   0x00007fffeaaa9240 <+16>:	sub    rsp,0xc0
   0x00007fffeaaa9247 <+23>:	movdqu xmm1,XMMWORD PTR [rdi+0x38]
   0x00007fffeaaa924c <+28>:	mov    esi,DWORD PTR [rdi+0x48]
   0x00007fffeaaa924f <+31>:	movdqu xmm2,XMMWORD PTR [rdi+0x48]
   0x00007fffeaaa9254 <+36>:	mov    eax,DWORD PTR [rdi+0x38]
   0x00007fffeaaa9257 <+39>:	movaps XMMWORD PTR [rsp-0x48],xmm1
   0x00007fffeaaa925c <+44>:	bswap  esi
   0x00007fffeaaa925e <+46>:	mov    r8d,DWORD PTR [rsp-0x44]
   0x00007fffeaaa9263 <+51>:	mov    ecx,DWORD PTR [rdi+0x58]
   0x00007fffeaaa9266 <+54>:	movaps XMMWORD PTR [rsp-0x38],xmm2
   0x00007fffeaaa926b <+59>:	movdqu xmm3,XMMWORD PTR [rdi+0x58]
   0x00007fffeaaa9270 <+64>:	movdqu xmm4,XMMWORD PTR [rdi+0x68]
   0x00007fffeaaa9275 <+69>:	bswap  eax
   0x00007fffeaaa9277 <+71>:	mov    DWORD PTR [rsp-0x38],esi
   0x00007fffeaaa927b <+75>:	mov    esi,DWORD PTR [rsp-0x34]
   0x00007fffeaaa927f <+79>:	bswap  r8d
   0x00007fffeaaa9282 <+82>:	bswap  ecx
   0x00007fffeaaa9284 <+84>:	mov    DWORD PTR [rsp-0x44],r8d
   0x00007fffeaaa9289 <+89>:	mov    r8d,DWORD PTR [rsp-0x40]
   0x00007fffeaaa928e <+94>:	bswap  esi
   0x00007fffeaaa9290 <+96>:	mov    DWORD PTR [rsp-0x34],esi
   0x00007fffeaaa9294 <+100>:	mov    esi,DWORD PTR [rsp-0x30]
   0x00007fffeaaa9298 <+104>:	bswap  r8d
   0x00007fffeaaa929b <+107>:	mov    DWORD PTR [rsp-0x40],r8d
   0x00007fffeaaa92a0 <+112>:	mov    r8d,DWORD PTR [rsp-0x3c]
   0x00007fffeaaa92a5 <+117>:	bswap  esi
   0x00007fffeaaa92a7 <+119>:	mov    DWORD PTR [rsp-0x30],esi
   0x00007fffeaaa92ab <+123>:	mov    esi,DWORD PTR [rsp-0x2c]
   0x00007fffeaaa92af <+127>:	movaps XMMWORD PTR [rsp-0x18],xmm4
   0x00007fffeaaa92b4 <+132>:	bswap  r8d
(...)

I added volatile in longReverse() of the master branch, to see how unoptimized looks alike:

Dump of assembler code for function sha_transform:
(...)
   0x00007fffeaaa836a <+106>:	mov    edx,DWORD PTR [rsp-0x5c]
   0x00007fffeaaa836e <+110>:	mov    eax,DWORD PTR [rsp-0x5c]
   0x00007fffeaaa8372 <+114>:	shl    edx,0x10
   0x00007fffeaaa8375 <+117>:	shr    eax,0x10
   0x00007fffeaaa8378 <+120>:	or     eax,edx
   0x00007fffeaaa837a <+122>:	mov    edx,DWORD PTR [rsp-0x34]
(...)

C code:

        value = *buffer;
        value = ( ( value & 0xFF00FF00L ) >> 8  ) | \
                ( ( value & 0x00FF00FFL ) << 8 );
        *buffer++ = ( value << 16 ) | ( value >> 16 );

GCC is quite smart nowadays!

@vstinner vstinner merged commit 1ae035b into python:master Apr 17, 2020
@vstinner vstinner deleted the bswap branch April 17, 2020 15:47
@vstinner
Copy link
Member Author

I know that the "portable" implementation may not be the most efficient, but nowadays C compilers are clever and implement crazy optimizations (see my previous comment). Moreover, this change now uses efficient builtin functions when the C compiler provides them. Overall, this change shouldn't have any impact on performance, but new functions have better defined API: static inline function with wel defined input and output types.

CuriousLearner added a commit to CuriousLearner/cpython that referenced this pull request May 29, 2020
* master: (1985 commits)
  bpo-40179: Fix translation of #elif in Argument Clinic (pythonGH-19364)
  bpo-35967: Skip test with `uname -p` on Android (pythonGH-19577)
  bpo-40257: Improve help for the typing module (pythonGH-19546)
  Fix two typos in multiprocessing (pythonGH-19571)
  bpo-40286: Use random.randbytes() in tests (pythonGH-19575)
  bpo-40286: Makes simpler the relation between randbytes() and getrandbits() (pythonGH-19574)
  bpo-39894: Route calls from pathlib.Path.samefile() to os.stat() via the path accessor (pythonGH-18836)
  bpo-39897: Remove needless `Path(self.parent)` call, which makes `is_mount()` misbehave in `Path` subclasses. (pythonGH-18839)
  bpo-40282: Allow random.getrandbits(0) (pythonGH-19539)
  bpo-40302: UTF-32 encoder SWAB4() macro use a|b rather than a+b (pythonGH-19572)
  bpo-40302: Replace PY_INT64_T with int64_t (pythonGH-19573)
  bpo-40286: Add randbytes() method to random.Random (pythonGH-19527)
  bpo-39901: Move `pathlib.Path.owner()` and `group()` implementations into the path accessor. (pythonGH-18844)
  bpo-40300: Allow empty logging.Formatter.default_msec_format. (pythonGH-19551)
  bpo-40302: Add pycore_byteswap.h header file (pythonGH-19552)
  bpo-40287: Fix SpooledTemporaryFile.seek() return value (pythonGH-19540)
  Minor modernization and readability improvement to the tokenizer example (pythonGH-19558)
  bpo-40294: Fix _asyncio when module is loaded/unloaded multiple times (pythonGH-19542)
  Fix parameter names in assertIn() docs (pythonGH-18829)
  bpo-39793: use the same domain on make_msgid tests (python#18698)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants