-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
Proposal Details
Background
The current implementations of the base32 and base64 codecs are not ideal for cryptographic use cases. The linked document describes practical timing side-channels against general purpose RFC 4648 codecs when used to encode/decode cryptographic secrets. You can verify the current implementation uses a table look-up in both directions, which introduces the risk of cache-timing attacks.
Specialized implementations of these codecs exist. @Sc00bz wrote a constant-time implementation of base64 in C, which I used as a basis for both base32 and base64 in PHP.
Proposed Change
New functions in encoding/base32
and encoding/base64
:
EncodeToStringConstTime
as an alternative toEncodeToString
DecodeStringConstTime
as an alternative toDecodeString
These functions could then be used in other packages that handle cryptographic secrets and auditors can be absolutely sure that timing leaks are not present.
(I'm happy to provide a patch for this, if accepted.)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
mateusz834 commentedon May 28, 2025
CC @golang/security
tob-scott-a commentedon May 28, 2025
Just in case there's any confusion: I do not think this qualifies as a security vulnerability (or I would have reported it as one, rather than filing an issue with the intent to follow up with a patch).
Jorropo commentedon May 28, 2025
👍 however do we want new functions ? It would be simpler for everyone if we made the current functions constant time.
If you could send your patch already it would be nice, I would like to benchmark it.
tob-scott-a commentedon May 28, 2025
A constant-time alternative will certainly be slower than the general-purpose functions: It replaces a map look-up with many operations.
In the context of handling cryptographic secrets, this trade-off is worth it. However, making all use-cases for a base64 codec constant-time would slow most Go programs down unnecessarily.
It is for this reason that my opening offer was to implement new functions rather than change the current behavior.
Jorropo commentedon May 28, 2025
A single map lookup is far from cheap, I don't want to risk myself to guessing the performance impact, a benchmark would really help 🙂.
Also in go we often do not mind trading a few % of performance if it means making the API easier to use.
mateusz834 commentedon May 28, 2025
We can always do this in reverse, make the current API constant-time, but add unsafe variants (if needed) for speed. Then folks would actually use the constant-time APIs, otherwise we might have low usage of such (new) APIs.
renthraysk commentedon May 28, 2025
Doesn't the fact that stdlib have configurable alphabets for the base64/32 rule out replacing them?
tob-scott-a commentedon May 29, 2025
@renthraysk It's not actually a total show-stopper (although it does make the patch a bit trickier to write). See #73909 for a first pass at a patch that adheres to the suggestions provided in this discussion so far.
gopherbot commentedon May 29, 2025
Change https://go.dev/cl/676917 mentions this issue:
encoding/base64: add constant-time behavior, enabled by default
rolandshoemaker commentedon May 29, 2025
I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs. That said we probably would want to use it by default for encoding/pem, where the contents are much more likely to be sensitive (of course we should probably still add some way to disable this, for people who i.e. need to parse large numbers of certificates etc).
I haven't thought through this too deeply, but based on my understanding of how people typically do constant time base64, could we add a generic decoder/encoder that generates the required bit shift offsets based on the alphabet, rather than needing to implement per-alphabet encoders and decoders? We'd then just need to add a
WithConstantTime
(or something) method toEncoder
.17 remaining items
tob-scott-a commentedon Jul 10, 2025
@aclements Good point. I recall @Jorropo was planning to run their own benchmarks (based on this comment). If something changes and I need to write and share a small program (and the data collected from my own machines), I'm happy to do so.
rolandshoemaker commentedon Jul 10, 2025
Benchmark based on https://go.dev/cl/676917:
Does not look great. I don't think we could reasonably accept slowing down non-sensitive base64 usages this much by default.
Didn't check the perf of @Sc00bz impl, nor did I do much investigation on whether there are easy perf optimizations still available for https://go.dev/cl/676917.
aclements commentedon Jul 10, 2025
Thanks for running those. I'm not surprised that the adding an indirect function call for every byte like https://go.dev/cl/676917 does slows things down dramatically. I think we'd have to find a different way to do this.
Just as an experiment, what happens if you inline that logic for a particular alphabet? (It might work to just use a direct function call, but be sure to check that the compiler actually inlines it if you try that.)
rolandshoemaker commentedon Jul 10, 2025
Inlining the calls improves the performance, but not massively:
Because we are operating on 8 byte chunks at a time, I do wonder if a SIMD (or other multi op packing technique) could provide a significantly more performant implementation.
randall77 commentedon Jul 10, 2025
For encoding at least, you only need a 64-byte lookup table. That can all fit on a single cache line. I would think if you properly aligned the lookup table it would be constant time as far as cache timing attacks go.
randall77 commentedon Jul 10, 2025
Decoding not too hard to fit into a 64-byte table also. See CL 687396.
rolandshoemaker commentedon Jul 11, 2025
@randall77 I quite like the idea of using the fact that the LUT fits in a single cache line in theory, but in practice I'm not entirely comfortable with the concept of relying on CPU caching semantics for constant-time properties (I'm not sure I've ever seen a case where somebody has exploited mismatches in described behavior and implement behavior here, but it also wouldn't entirely shock me). If anything, the last four years or so have made me less willing to rely on CPU designers to not silently break it.
Sc00bz commentedon Jul 12, 2025
Here's base64 encode/decode 8 characters at a time with SIMD using standard 64 bit integers.
TL;DR I'm doing 8, 7 bit math operations in a 64 bit integer. There a separator bit to prevent negative numbers from bleeding over. Also the separator bit is use to make the 8, 7 bit conditional masks.
I didn't test all the possible encode/decodes (2**48) but I tested all 3 byte values (2**24) left shifted by 0, 6, 12, 18, and 24.
SSE2/AVX2/NEON will be faster. SSE2, AVX2, NEON have compares for bytes equal and signed greater than: _mm_cmpgt_epi8/_mm256_cmpgt_epi8/vceqq_s8 (PCMPGTB/VPCMPGTB/CMEQ) and _mm_cmpeq_epi8/_mm256_cmpeq_epi8/vcgtq_s8 (PCMPEQB/VPCMPEQB/CMGT). Thus a simple conditional add is 3 SIMD instructions (compare, and, add), but adding actual SIMD seems like overkill.
I'll benchmark this tomorrow. If someone else hasn't.
The base32 version is kind of straight forward. For decode you'll get an error on the DIFFs you need to flip the sign. For encode you need to manually check for overflows and flip the appropriate signs.
I'll submit a pull request assuming these are faster—Oh I just realized I need to make a change. To save time on partial blocks. The input to the encode function should be expanded before calling. Similar for the output of the decode function.
Sc00bz commentedon Jul 14, 2025
It's slower. I'll try a few things, but I think this will be slower without SIMD. I'm pretty sure this won't change much but the benchmark is testing encode and decode of zero bytes. So it's always hitting the same elements: encode[0] and decodeMap['A'].
rolandshoemaker commentedon Jul 16, 2025
If we can get a just as fast (likely faster) CT SIMD assembly implementation, then I think the question about whether to add a CT implementation alongside the VT one somewhat goes out of the window (if it's just faster, why not always use it). That said, if it's significantly more complex to write an assembly implementation that is agnostic to alphabets, I think the question still remains.
apparentlymart commentedon Jul 16, 2025
Would it be acceptable to say that constant-time is guaranteed only for the standard alphabet, while other alphabets might use a more general implementation that is not constant-time, or does that seem too subtle for a security-related feature? 🤔
(I'm asking the above question primarily from an API design standpoint. I acknowledge that it also implies still maintaining two implementations rather than only one, but that was already true for the compromise of exposing constant-time and not-constant-time as explicitly separate functions.)
rolandshoemaker commentedon Jul 16, 2025
I think regardless of the underlying implementation we'll want to retain an API such as the
WithConstantTime
suggested in #73901 (comment)). If we decide we can only enable the CT implementation for standard alphabets, then that can return an error for non-standard alphabets.If we can come up with a fast SIMD CT implementation that can easily be extended to arbitrary alphabets that would be nice, but I don't think it's a sticking point. Really we only want this for standard alphabets (for use in encoding/pem), and if we have to carry a slower variable-time implementation for non-standard alphabets, that's probably fine.
aclements commentedon Jul 16, 2025
FWIW, the fast AVX-512 algorithms for base64 encode and decode in https://arxiv.org/pdf/1910.05109 work with arbitrary alphabets and should be straightforward to extend to base32. However, at their core is a table lookup that only AVX-512 can pull off. I haven't thought much about how to do this with smaller vectors, but will note that SIMD by its nature is generally good at turning branchy code into branchless code.
Do people use alphabets that differ in anything other than the last two elements?
magical commentedon Jul 16, 2025
Yes, there are certain password hashing schemes that use variant alphabets: most notably
bcrypt
, which puts the punctuation characters./
at the start of the alphabet instead of at the end.https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.40.0:bcrypt/base64.go;l=9
Sc00bz commentedon Jul 17, 2025
You need AVX-512's VBMI for _mm512_permutex2var_epi8 (VPERMI2B). It's also in AVX10.1 which is the first AVX10 version. (If you're not familiar, AVX10 is a consolidation of AVX-512 instruction subsets. Also AVX10/256 for limiting register width to 256 bits and a promise that future versions will include all instructions in previous versions.)
AMD's Zen4 and Zen5 have it. Intel's Cannon Lake was the first to have it. At least Icelake Intel, Icelake Xeon, Sapphire Rapids have it according to Intel Intrinsics Guide. Ah Wikipedia's compatibility table.
It might be better to target SSSE3*, AVX, and AVX2's _mm_shuffle_epi8/_mm256_shuffle_epi8 (PSHUFB/VPSHUFB). It will need 2 (base32) or 4 (base64) calls to encode and 16 calls to decode because it is limited to a 16 byte lookup table. Newer Intel's CPUs have a throughput of 2 instructions/clock which is nice.
NEON (ARM64) has vqtbl4q_s8 (TBL) which can do a 64 byte lookup table (across 4, 16 byte registers). Also it's easier than PSHUFB/VPSHUFB because anything out of bounds is set to 0. Instead of if the high bit is set then 0. For ARMv7 and ARM32 there's vtbl4_u8 (TBL) which can do a 32 byte lookup table (across 4, 8 byte registers).
* Maybe not SSSE3 since it's old and Intel only.