Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functable without TLS #1609

Merged
merged 2 commits into from
Dec 25, 2023
Merged

functable without TLS #1609

merged 2 commits into from
Dec 25, 2023

Conversation

phprus
Copy link
Contributor

@phprus phprus commented Dec 12, 2023

Previous discussion in issue #1608.

@phprus phprus mentioned this pull request Dec 12, 2023
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a0356fa) 83.02% compared to head (ea81350) 83.03%.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1609   +/-   ##
========================================
  Coverage    83.02%   83.03%           
========================================
  Files          133      133           
  Lines        10895    10896    +1     
  Branches      2816     2816           
========================================
+ Hits          9046     9047    +1     
- Misses        1146     1147    +1     
+ Partials       703      702    -1     
Flag Coverage Δ
macos_clang 42.97% <ø> (ø)
macos_gcc 74.51% <100.00%> (+<0.01%) ⬆️
ubuntu_clang 81.91% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_debug 81.58% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_inflate_allow_invalid_dist 81.57% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_inflate_strict 81.91% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_mmap 82.24% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_pigz 13.72% <0.00%> (+0.03%) ⬆️
ubuntu_clang_pigz_no_optim 11.36% <100.00%> (+0.06%) ⬆️
ubuntu_clang_pigz_no_threads 13.63% <0.00%> (-0.01%) ⬇️
ubuntu_clang_reduced_mem 82.31% <100.00%> (+<0.01%) ⬆️
ubuntu_clang_toolchain_riscv ∅ <ø> (∅)
ubuntu_gcc 75.08% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_aarch64 77.25% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_aarch64_compat_no_opt 75.48% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_aarch64_no_acle 76.00% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_aarch64_no_neon 76.00% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armhf 77.04% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armhf_compat_no_opt 75.44% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armhf_no_acle 76.96% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armhf_no_neon 77.11% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armsf 74.43% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_armsf_compat_no_opt 73.89% <100.00%> (-0.02%) ⬇️
ubuntu_gcc_benchmark 73.21% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_compat_no_opt 76.68% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_compat_sprefix 73.55% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_m32 73.20% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_mingw_i686 73.46% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_mingw_x86_64 73.47% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_mips 74.76% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_mips64 74.78% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_avx2 74.15% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_ctz 74.45% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_ctzll 74.44% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_pclmulqdq 74.06% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_sse2 74.33% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_no_sse42 74.01% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_o1 73.97% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_osb ∅ <ø> (∅)
ubuntu_gcc_pigz 37.86% <100.00%> (+0.21%) ⬆️
ubuntu_gcc_pigz_aarch64 38.79% <100.00%> (-0.12%) ⬇️
ubuntu_gcc_ppc 73.72% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc64 74.17% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc64_power9 74.35% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc64le 74.24% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc64le_novsx 74.55% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc64le_power9 74.12% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_ppc_no_power8 74.44% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_s390x 74.60% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_s390x_dfltcc 71.71% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_s390x_dfltcc_compat 73.82% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_s390x_no_crc32 74.39% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_sparc64 74.58% <100.00%> (+<0.01%) ⬆️
ubuntu_gcc_sprefix 73.21% <100.00%> (+<0.01%) ⬆️
win64_gcc 73.85% <100.00%> (+<0.01%) ⬆️
win64_gcc_compat_no_opt 74.51% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Dec 12, 2023

I think powerpc has 32 bit atomics but not necessarily 64 bit. This is just a vague recollection, though.

Hmm, quick bit of googling suggests that 32 bit powerpc didn't have 64 bit atomics, so, maybe not a big deal? Your PR also relies on the implicit 64 bit writes being untearable, rather than something explicit. Maybe we could use explicit atomic integer storage for the pointers? The compiler is likely to convert them to plain old stores on architectures that do have this implicit behavior (at least it does with C++'s std::atomic, I would assume as much for the compiler intrinsics in C).

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Dec 12, 2023

https://en.cppreference.com/w/c/atomic/atomic_store

Requires C11, but a possible solution.

There's also these, though they are more GCC specific. Maybe we could wrap them in something to shim for everything we currently support:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Uglier, but, these sorts of things guarantee correct behavior for weaker memory ordering and more importantly, torn stores. We'd only need to wrap the stores, the loads of course don't need any sync semantics of any kind because even if not cache coherent, it's either valid or it's null and if it's null, it can just reassign the pointer to the same address on its own.

@phprus
Copy link
Contributor Author

phprus commented Dec 12, 2023

@KungFuJesus
Not 64bit, only pointer size atomic write is need (4 byte on 32bit platform or 8 byte on 64bit platform).

C11 atomic require "atomic object types". For this reason I use the builtin functions of gcc and clang.

@phprus
Copy link
Contributor Author

phprus commented Dec 12, 2023

functable is initialized by stub functions.
Stubs select platform-specific functions, write functable, and call the selected platform-specific function to avoid recursion if weak-memory CPU reorder load and store. Stubs do not read functable.

For this PR, the platform must provide single guarantee: Atomic store of pointer-size variable.

functable.c Outdated
}

#define STUB_BODY(RET, FUNC_NAME, ...) \
struct functable_s ft; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be static? Am I missing something, but wouldn't this evaluate the entire functable for each function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.
A stub may be called more than once only if the first call of versioned functions is parallel. To avoid race we need local variable.

@KungFuJesus
Copy link
Contributor

@KungFuJesus Not 64bit, only pointer size atomic write is need (4 byte on 32bit platform or 8 byte on 64bit platform).

C11 atomic require "atomic object types". For this reason I use the builtin functions of gcc and clang.

Right, it doesn't much matter on ppc32 if your pointer is 32 bit. The C11 bits are portable, at least to C11 capable compilers, anyway. The atomic_store macro approach here should do the job just fine, but it would involve wrapping it seven ways to sunday for other platforms. MSVC, GCC, and Clang covers a lot, but there's also Solaris Studio, ICC, XL, Open64, etc. Admittedly obscure, but if we could rely on C11 for those, it might be multiple birds with one stone.

@phprus
Copy link
Contributor Author

phprus commented Dec 12, 2023

Admittedly obscure, but if we could rely on C11 for those, it might be multiple birds with one stone.

Quick reply: No.
C11 atomic require special type. Using an C11 atomic pointer will require rewriting all versioned function calls to atomic_load the pointer and call the function at that pointer (overhead and unnecessary atomic ops).
For this PR we need hardware guarantees, not compiler guarantees.

If ppc32 guarantees 32-bit write atomicity and ppc64 guarantees 64-bit write atomicity, then PPC support can be added.

@KungFuJesus
Copy link
Contributor

Oh sorry I followed the link and see what you're getting at. We'd need to decorate everything with that atomic keyword, which would implicitly leverage the atomic loads.

If ppc32 guarantees 32-bit write atomicity and ppc64 guarantees 64-bit write atomicity, then PPC support can be added.

I think that it does.

For this PR we need hardware guarantees, not compiler guarantees.

Right, but, with the compiler guarantees, the hardware atomicity would come with it. The compiler should be doing the right thing under the hood (for most ISAs, nothing at all, for others, using whatever atomic store is available with cache aware compare and exchanges / mutexes if needed). The GCC and MSVC paths you have in that macro do, at the cost of only supporting 3 compilers. I agree though that we should be able to avoid the atomic load, the C11 approach might make that difficult.

@phprus
Copy link
Contributor Author

phprus commented Dec 12, 2023

The GCC and MSVC paths you have in that macro do, at the cost of only supporting 3 compilers.

GCC-compatible, Clang-compatible and MSVC. This is most of modern compilers.
For example, IBM XL C/C++ compiler defines the __clang__ macro (https://www.ibm.com/docs/ru/xl-c-and-cpp-linux/16.1.0?topic=macros-identify-xl-cc-compiler).

@KungFuJesus
Copy link
Contributor

No doubt GCCisms are widely supported. I can think of one tiny minor contribution I made for something that didn't:
libuv/libuv@ef47e8b

@KungFuJesus
Copy link
Contributor

I'm also wondering, however difficult, if we could write a test case that tries to induce a torn store to make sure. I have SPARC, PowerPC, and ARM hardware I can test against for it.

Unfortunately reliably producing a race with UB is probably impossible to do deterministically.

@phprus
Copy link
Contributor Author

phprus commented Dec 12, 2023

CI errors:

[2023-12-12T19:25:57.298Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 502 -

@KungFuJesus
Copy link
Contributor

CI errors:

[2023-12-12T19:25:57.298Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 502 -

Yep, been getting that one a lot recently. Trying again usually resolves it.

@phprus phprus marked this pull request as draft December 13, 2023 06:46
@phprus
Copy link
Contributor Author

phprus commented Dec 13, 2023

Converted to draft.
I need to fix one inaccuracy in macros.

functable.c Outdated Show resolved Hide resolved
@phprus phprus force-pushed the functable-atomic-1 branch 2 times, most recently from 17f00a6 to ac98028 Compare December 13, 2023 16:10
@phprus phprus marked this pull request as ready for review December 13, 2023 17:08
@Dead2
Copy link
Member

Dead2 commented Dec 13, 2023

I did a few quick benchmarks just to make sure there are no surprises.

On x86-64 it shows a 0.3% average compression speedup, practically no change for decompression.
On Aarch64 it shows a 0.13% compression speedup, well inside margin of error. Also almost a 1% decompression speedup.

Since these speedups are marginal and the tests were not extensive, the most interesting part is that they are not slower and that they always seem to err on the side of faster. I'd say that bodes well for this PR.
I also did a couple variations to provoke different cache alignments, they too turned out to be similarly faster overall.

## Baseline 5af152a6a377b45619c8860e2a25c68fa2199ed9  x86-64
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.722/1.741/1.753/0.011        0.435/0.451/0.459/0.008       94,127,290
 2     35.518%      2.893/2.906/2.917/0.008        0.445/0.450/0.460/0.005       75,282,961
 3     33.882%      3.430/3.471/3.499/0.024        0.414/0.424/0.433/0.007       71,816,478
 4     33.174%      3.761/3.792/3.827/0.024        0.405/0.419/0.427/0.008       70,315,668
 5     32.660%      4.119/4.177/4.255/0.045        0.402/0.409/0.417/0.006       69,225,542
 6     32.508%      4.657/4.724/4.778/0.047        0.398/0.402/0.405/0.003       68,902,222
 7     32.255%      6.033/6.122/6.195/0.064        0.389/0.405/0.411/0.007       68,366,800
 8     32.167%      8.751/8.851/8.966/0.082        0.401/0.410/0.416/0.005       68,180,776
 9     31.887%   12.260/12.364/12.423/0.050        0.393/0.401/0.409/0.007       67,586,442

 avg1  34.273%                        5.350                          0.419
 tot                                385.190                         30.175      653,804,179

   text    data     bss     dec     hex filename
 145452    1384       8  146844   23d9c libz-ng.so.2

## PR #1609
  Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.695/1.706/1.716/0.007        0.439/0.448/0.454/0.005       94,127,290
 2     35.518%      2.840/2.851/2.858/0.006        0.439/0.450/0.456/0.005       75,282,961
 3     33.882%      3.392/3.432/3.462/0.023        0.415/0.425/0.430/0.005       71,816,478
 4     33.174%      3.750/3.782/3.838/0.034        0.412/0.418/0.424/0.004       70,315,668
 5     32.660%      4.092/4.131/4.163/0.022        0.402/0.408/0.418/0.007       69,225,542
 6     32.508%      4.587/4.647/4.696/0.041        0.390/0.405/0.414/0.008       68,902,222
 7     32.255%      6.104/6.161/6.226/0.039        0.399/0.408/0.415/0.005       68,366,800
 8     32.167%      8.734/8.854/8.962/0.076        0.397/0.404/0.412/0.005       68,180,776
 9     31.887%   12.329/12.444/12.666/0.115        0.393/0.405/0.412/0.006       67,586,442

 avg1  34.273%                        5.334                          0.419
 tot                                384.062                         30.163      653,804,179

   text    data     bss     dec     hex filename
 145444    1576       8  147028   23e54 libz-ng.so.2

## Baseline 5af152a6a377b45619c8860e2a25c68fa2199ed9  aarch64
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.012/0.021/0.024/0.005        0.028/0.035/0.040/0.004       15,737,543
 1     54.165%      0.473/0.514/0.544/0.027        0.117/0.139/0.159/0.016        8,523,602
 2     43.877%      0.865/0.885/0.903/0.012        0.123/0.168/0.204/0.031        6,904,649
 3     42.389%      1.146/1.175/1.188/0.015        0.129/0.148/0.173/0.014        6,670,476
 4     41.647%      1.285/1.305/1.317/0.011        0.111/0.131/0.163/0.017        6,553,763
 5     41.217%      1.409/1.435/1.446/0.014        0.125/0.135/0.141/0.006        6,485,981
 6     41.037%      1.645/1.666/1.680/0.012        0.116/0.134/0.148/0.012        6,457,780
 7     40.778%      2.096/2.132/2.150/0.018        0.119/0.132/0.143/0.011        6,416,924
 8     40.704%      2.605/2.627/2.643/0.014        0.107/0.121/0.149/0.015        6,405,244
 9     40.409%      3.171/3.188/3.201/0.011        0.120/0.132/0.144/0.011        6,358,951

 avg1  48.623%                        1.495                          0.128
 avg2  54.026%                        1.661                          0.142
 tot                                119.582                         10.202       76,514,913

   text    data     bss     dec     hex filename
 119449    1616       8  121073   1d8f1 libz-ng.so.2

## PR #1609
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.008/0.019/0.024/0.006        0.016/0.031/0.040/0.008       15,737,543
 1     54.165%      0.494/0.524/0.538/0.015        0.101/0.130/0.163/0.021        8,523,602
 2     43.877%      0.862/0.880/0.899/0.012        0.119/0.159/0.186/0.026        6,904,649
 3     42.389%      1.156/1.173/1.185/0.011        0.103/0.139/0.161/0.018        6,670,476
 4     41.647%      1.286/1.305/1.316/0.010        0.104/0.127/0.146/0.017        6,553,763
 5     41.217%      1.384/1.413/1.432/0.017        0.120/0.137/0.161/0.012        6,485,981
 6     41.037%      1.650/1.673/1.695/0.015        0.129/0.138/0.144/0.006        6,457,780
 7     40.778%      2.104/2.127/2.155/0.014        0.109/0.136/0.154/0.016        6,416,924
 8     40.704%      2.600/2.618/2.632/0.013        0.123/0.131/0.141/0.006        6,405,244
 9     40.409%      3.176/3.197/3.209/0.011        0.112/0.135/0.160/0.018        6,358,951

 avg1  48.623%                        1.493                          0.126
 avg2  54.026%                        1.659                          0.140
 tot                                119.428                         10.107       76,514,913

   text    data     bss     dec     hex filename
 118281    1544       8  119833   1d419 libz-ng.so.2

functable.h Outdated Show resolved Hide resolved
@phprus phprus marked this pull request as draft December 15, 2023 12:35
@phprus phprus marked this pull request as ready for review December 15, 2023 14:23
@phprus
Copy link
Contributor Author

phprus commented Dec 15, 2023

All modern architectures implement atomic assignment for pointer.
We can remove Z_TLS completely.

@phprus
Copy link
Contributor Author

phprus commented Dec 15, 2023

Rebased.

@phprus
Copy link
Contributor Author

phprus commented Dec 15, 2023

CI error:

[2023-12-15T18:04:14.751Z] ['verbose'] The error stack is: Error: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 502

Again...

@nmoinvaz
Copy link
Member

Unfortunately there is not much we can do about that, except report it to Codecov.

@Dead2
Copy link
Member

Dead2 commented Dec 15, 2023

Re-queued the failed run, all clear now.

functable.c Outdated
# define FUNCTABLE_ASSIGN(VAR, FUNC_NAME) \
_InterlockedExchangePointer((void * volatile *)&(functable.FUNC_NAME), (void *)(VAR->FUNC_NAME))
#else
(void * volatile)(functable.FUNC_NAME) = (void *)(VAR->FUNC_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference that confirms that this is safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't have a public link to the discussion :(

@phprus phprus marked this pull request as draft December 16, 2023 12:33
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
@phprus
Copy link
Contributor Author

phprus commented Dec 21, 2023

New version:

  1. Atomic assignment of pointers.
  2. After all assignment, a full memory barrier.
  3. For unsupported compilers (not MSVC, Clang and GNU compatible), volatile assignment and warning print.

@phprus phprus marked this pull request as ready for review December 21, 2023 17:19
@nmoinvaz
Copy link
Member

nmoinvaz commented Dec 23, 2023

It seems ok to me. It looks like it is just trading an implementation that compiler specific for one that is platform specific.

Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2
Copy link
Member

Dead2 commented Dec 24, 2023

https://en.cppreference.com/w/c/atomic/atomic_store

Requires C11, but a possible solution.

Just a note here, zlib-ng already requires minimum C11 from version 2.1.0 and up (it was upped from C99 to C11 because of _Thread_local used in Z_TLS).

So perhaps this is all that is really needed and the selection could be simplified? It is nice to have a fallback in any case though, you never know what some compilers support. But possibly this could also be used for MSVC and M_ARM, if they really do support C11 properly.

Ref: https://github.com/zlib-ng/zlib-ng/wiki

@Dead2
Copy link
Member

Dead2 commented Dec 24, 2023

To answer my own question; Yes MSVC supports it, but only added that support in 2022.
https://devblogs.microsoft.com/cppblog/c11-atomics-in-visual-studio-2022-version-17-5-preview-2/

It would be nice if we detected this stdatomic.h support and used that if available for all platforms, only falling back to more proprietary solutions if that support is missing. That way every platform will behave as closely to each other as possible, while using a common modern standard, reducing the testing spread and lowers the risk of corner-case problems.

Anyone have thoughts around this?

@phprus
Copy link
Contributor Author

phprus commented Dec 24, 2023

@Dead2
C11 atomic require "atomic object types" (https://en.cppreference.com/w/c/language/atomic).
We will need to mark all function pointers with the _Atomic keyword.
But, it may add overhead when calling functions by pointers.

@Dead2 Dead2 merged commit dfced56 into zlib-ng:develop Dec 25, 2023
135 checks passed
@Dead2 Dead2 mentioned this pull request Jan 7, 2024
@pps83
Copy link
Contributor

pps83 commented Feb 13, 2024

Finally this TLS nonsense is removed. Thank you phprus!
I've tried to remove it a couple of years ago #1275, and got push back from mtl1979 based on his believes that this code is needed while I haven't seen any lib with similar run-time cpu detection use TLS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants