proposal: encoding/{base32,base64}: constant-time implementations of RFC 4648 codecs for use with cryptography keys #73901

Open

proposal: encoding/{base32,base64}: constant-time implementations of RFC 4648 codecs for use with cryptography keys#73901

#73909

Labels

LibraryProposalProposalProposal-Crypto

Milestone

Proposal

tob-scott-a

opened

on May 28, 2025

· edited by tob-scott-a

Proposal Details

Background

The current implementations of the base32 and base64 codecs are not ideal for cryptographic use cases. The linked document describes practical timing side-channels against general purpose RFC 4648 codecs when used to encode/decode cryptographic secrets. You can verify the current implementation uses a table look-up in both directions, which introduces the risk of cache-timing attacks.

Specialized implementations of these codecs exist. @Sc00bz wrote a constant-time implementation of base64 in C, which I used as a basis for both base32 and base64 in PHP.

Proposed Change

New functions in encoding/base32 and encoding/base64:

EncodeToStringConstTime as an alternative to EncodeToString
DecodeStringConstTime as an alternative to DecodeString

These functions could then be used in other packages that handle cryptographic secrets and auditors can be absolutely sure that timing leaks are not present.

(I'm happy to provide a patch for this, if accepted.)

added

added this to the Proposal milestone

on May 28, 2025

mateusz834

Member

CC @golang/security

added

Author

Just in case there's any confusion: I do not think this qualifies as a security vulnerability (or I would have reported it as one, rather than filing an issue with the intent to follow up with a patch).

added

Member

👍 however do we want new functions ? It would be simpler for everyone if we made the current functions constant time.
If you could send your patch already it would be nice, I would like to benchmark it.

tob-scott-a

Author

A constant-time alternative will certainly be slower than the general-purpose functions: It replaces a map look-up with many operations.

In the context of handling cryptographic secrets, this trade-off is worth it. However, making all use-cases for a base64 codec constant-time would slow most Go programs down unnecessarily.

It is for this reason that my opening offer was to implement new functions rather than change the current behavior.

Jorropo

Member

A single map lookup is far from cheap, I don't want to risk myself to guessing the performance impact, a benchmark would really help 🙂.
Also in go we often do not mind trading a few % of performance if it means making the API easier to use.

mateusz834

Member

Also in go we often do not mind trading a few % of performance if it means making the API easier to use.

We can always do this in reverse, make the current API constant-time, but add unsafe variants (if needed) for speed. Then folks would actually use the constant-time APIs, otherwise we might have low usage of such (new) APIs.

renthraysk

Doesn't the fact that stdlib have configurable alphabets for the base64/32 rule out replacing them?

tob-scott-a

linked a pull request that will close this issue

on May 29, 2025

encoding/base64: add constant-time behavior, enabled by default #73909

tob-scott-a

Author

@renthraysk It's not actually a total show-stopper (although it does make the patch a bit trickier to write). See #73909 for a first pass at a patch that adheres to the suggestions provided in this discussion so far.

gopherbot

Contributor

Change https://go.dev/cl/676917 mentions this issue: encoding/base64: add constant-time behavior, enabled by default

rolandshoemaker

Member

I don't think we'd want to turn on constant-time behavior by default for all uses of encoding/{base32,base64}. I suspect the vast majority of their usage is for entirely benign inputs. That said we probably would want to use it by default for encoding/pem, where the contents are much more likely to be sensitive (of course we should probably still add some way to disable this, for people who i.e. need to parse large numbers of certificates etc).

I haven't thought through this too deeply, but based on my understanding of how people typically do constant time base64, could we add a generic decoder/encoder that generates the required bit shift offsets based on the alphabet, rather than needing to implement per-alphabet encoders and decoders? We'd then just need to add a WithConstantTime (or something) method to Encoder.

17 remaining items

tob-scott-a

Author

@aclements Good point. I recall @Jorropo was planning to run their own benchmarks (based on this comment). If something changes and I need to write and share a small program (and the data collected from my own machines), I'm happy to do so.

rolandshoemaker

Member

Benchmark based on https://go.dev/cl/676917:

% benchstat vt-base64 ct-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
                       │  vt-base64  │               ct-base64               │
                       │   sec/op    │    sec/op     vs base                 │
EncodeToString/2-10      13.35n ± 1%    17.94n ± 1%   +34.47% (p=0.000 n=10)
EncodeToString/4-10      14.59n ± 1%    22.40n ± 0%   +53.58% (p=0.000 n=10)
EncodeToString/8-10      18.58n ± 1%    31.44n ± 1%   +69.19% (p=0.000 n=10)
EncodeToString/64-10     71.42n ± 1%   192.20n ± 1%  +169.13% (p=0.000 n=10)
EncodeToString/8192-10   5.907µ ± 1%   19.849µ ± 1%  +236.02% (p=0.000 n=10)
DecodeString/2-10        19.81n ± 4%    26.03n ± 0%   +31.37% (p=0.000 n=10)
DecodeString/4-10        19.28n ± 2%    32.28n ± 1%   +67.43% (p=0.000 n=10)
DecodeString/8-10        22.80n ± 1%    44.97n ± 1%   +97.21% (p=0.000 n=10)
DecodeString/64-10       54.78n ± 1%   231.20n ± 0%  +322.05% (p=0.000 n=10)
DecodeString/8192-10     4.181µ ± 2%   25.227µ ± 1%  +503.43% (p=0.000 n=10)
NewEncoding-10           145.7n ± 0%    289.3n ± 1%   +98.56% (p=0.000 n=10)
geomean                  75.33n         168.3n       +123.44%

                       │   vt-base64   │              ct-base64               │
                       │      B/s      │     B/s       vs base                │
EncodeToString/2-10       142.9Mi ± 1%   106.3Mi ± 1%  -25.61% (p=0.000 n=10)
EncodeToString/4-10       261.6Mi ± 1%   170.3Mi ± 0%  -34.89% (p=0.000 n=10)
EncodeToString/8-10       410.6Mi ± 1%   242.7Mi ± 1%  -40.90% (p=0.000 n=10)
EncodeToString/64-10      854.6Mi ± 1%   317.5Mi ± 1%  -62.85% (p=0.000 n=10)
EncodeToString/8192-10   1322.5Mi ± 1%   393.6Mi ± 1%  -70.24% (p=0.000 n=10)
DecodeString/2-10         192.6Mi ± 4%   146.6Mi ± 0%  -23.90% (p=0.000 n=10)
DecodeString/4-10         395.7Mi ± 2%   236.3Mi ± 1%  -40.28% (p=0.000 n=10)
DecodeString/8-10         502.0Mi ± 1%   254.5Mi ± 1%  -49.29% (p=0.000 n=10)
DecodeString/64-10       1532.1Mi ± 1%   363.0Mi ± 0%  -76.31% (p=0.000 n=10)
DecodeString/8192-10     2492.0Mi ± 2%   413.0Mi ± 1%  -83.43% (p=0.000 n=10)
NewEncoding-10           1675.4Mi ± 0%   843.8Mi ± 1%  -49.63% (p=0.000 n=10)
geomean                   608.6Mi        272.4Mi       -55.25%

Does not look great. I don't think we could reasonably accept slowing down non-sensitive base64 usages this much by default.

Didn't check the perf of @Sc00bz impl, nor did I do much investigation on whether there are easy perf optimizations still available for https://go.dev/cl/676917.

aclements

Member

Thanks for running those. I'm not surprised that the adding an indirect function call for every byte like https://go.dev/cl/676917 does slows things down dramatically. I think we'd have to find a different way to do this.

Just as an experiment, what happens if you inline that logic for a particular alphabet? (It might work to just use a direct function call, but be sure to check that the compiler actually inlines it if you try that.)

rolandshoemaker

Member

Inlining the calls improves the performance, but not massively:

% benchstat vt-base64 ct-inlined-base64
goos: darwin
goarch: arm64
pkg: encoding/base64
cpu: Apple M1 Pro
                       │  vt-base64  │           ct-inlined-base64           │
                       │   sec/op    │    sec/op     vs base                 │
EncodeToString/2-10      13.35n ± 1%    14.72n ± 1%   +10.34% (p=0.000 n=10)
EncodeToString/4-10      14.59n ± 1%    19.08n ± 3%   +30.82% (p=0.000 n=10)
EncodeToString/8-10      18.58n ± 1%    26.43n ± 1%   +42.25% (p=0.000 n=10)
EncodeToString/64-10     71.42n ± 1%   153.00n ± 2%  +114.24% (p=0.000 n=10)
EncodeToString/8192-10   5.907µ ± 1%   16.689µ ± 4%  +182.52% (p=0.000 n=10)
DecodeString/2-10        19.81n ± 4%    23.62n ± 1%   +19.23% (p=0.000 n=10)
DecodeString/4-10        19.28n ± 2%    29.69n ± 1%   +53.97% (p=0.000 n=10)
DecodeString/8-10        22.80n ± 1%    41.86n ± 2%   +83.60% (p=0.000 n=10)
DecodeString/64-10       54.78n ± 1%   213.35n ± 0%  +289.47% (p=0.000 n=10)
DecodeString/8192-10     4.181µ ± 2%   24.479µ ± 1%  +485.55% (p=0.000 n=10)
NewEncoding-10           145.7n ± 0%    297.7n ± 2%  +104.32% (p=0.000 n=10)
geomean                  75.33n         150.0n        +99.13%

                       │   vt-base64   │          ct-inlined-base64           │
                       │      B/s      │     B/s       vs base                │
EncodeToString/2-10       142.9Mi ± 1%   129.5Mi ± 1%   -9.37% (p=0.000 n=10)
EncodeToString/4-10       261.6Mi ± 1%   199.9Mi ± 3%  -23.56% (p=0.000 n=10)
EncodeToString/8-10       410.6Mi ± 1%   288.7Mi ± 1%  -29.70% (p=0.000 n=10)
EncodeToString/64-10      854.6Mi ± 1%   398.9Mi ± 2%  -53.33% (p=0.000 n=10)
EncodeToString/8192-10   1322.5Mi ± 1%   468.2Mi ± 4%  -64.60% (p=0.000 n=10)
DecodeString/2-10         192.6Mi ± 4%   161.5Mi ± 1%  -16.15% (p=0.000 n=10)
DecodeString/4-10         395.7Mi ± 2%   257.0Mi ± 1%  -35.06% (p=0.000 n=10)
DecodeString/8-10         502.0Mi ± 1%   273.4Mi ± 2%  -45.54% (p=0.000 n=10)
DecodeString/64-10       1532.1Mi ± 1%   393.3Mi ± 0%  -74.33% (p=0.000 n=10)
DecodeString/8192-10     2492.0Mi ± 2%   425.6Mi ± 1%  -82.92% (p=0.000 n=10)
NewEncoding-10           1675.4Mi ± 0%   820.0Mi ± 2%  -51.06% (p=0.000 n=10)
geomean                   608.6Mi        305.6Mi       -49.78%

Because we are operating on 8 byte chunks at a time, I do wonder if a SIMD (or other multi op packing technique) could provide a significantly more performant implementation.

randall77

Contributor

For encoding at least, you only need a 64-byte lookup table. That can all fit on a single cache line. I would think if you properly aligned the lookup table it would be constant time as far as cache timing attacks go.

randall77

Contributor

Decoding not too hard to fit into a 64-byte table also. See CL 687396.

rolandshoemaker

Member

@randall77 I quite like the idea of using the fact that the LUT fits in a single cache line in theory, but in practice I'm not entirely comfortable with the concept of relying on CPU caching semantics for constant-time properties (I'm not sure I've ever seen a case where somebody has exploited mismatches in described behavior and implement behavior here, but it also wouldn't entirely shock me). If anything, the last four years or so have made me less willing to rely on CPU designers to not silently break it.

Sc00bz

Here's base64 encode/decode 8 characters at a time with SIMD using standard 64 bit integers.

TL;DR I'm doing 8, 7 bit math operations in a 64 bit integer. There a separator bit to prevent negative numbers from bleeding over. Also the separator bit is use to make the 8, 7 bit conditional masks.

// Decode usage:
encoded := binary.BigEndian.Uint64(in)
data := base64StandardDecodeBlock(encoded)
out[0] = byte(data >> 40)
out[1] = byte(data >> 32)
out[2] = byte(data >> 24)
out[3] = byte(data >> 16)
out[4] = byte(data >>  8)
out[5] = byte(data >>  0)

// Encode usage:
data :=
	(uint64(in[0]) << 40) |
	(uint64(in[1]) << 32) |
	(uint64(in[2]) << 24) |
	(uint64(in[3]) << 16) |
	(uint64(in[4]) <<  8) |
	(uint64(in[5]) <<  0)
encoded := base64StandardEncodeBlock(data)
binary.BigEndian.PutUint64(out, encoded)

func base64StandardDecodeBlock(src uint64) uint64 {
	const SEPARATORS uint64 = 0x8080808080808080

	const LO1         int64 = int64('A')
	const HI1         int64 = int64('Z')
	const LO2         int64 = int64('a')
	const HI2         int64 = int64('z')
	const LO3         int64 = int64('0')
	const HI3         int64 = int64('9')
	const CHAR4       int64 = int64('+')
	const CHAR5       int64 = int64('/')
	const FULL_LO1   uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)
	const FULL_HI1   uint64 = uint64(HI1 << 56 | HI1 << 48 | HI1 << 40 | HI1 << 32 | HI1 << 24 | HI1 << 16 | HI1 << 8 | HI1) | SEPARATORS
	const FULL_LO2   uint64 = uint64(LO2 << 56 | LO2 << 48 | LO2 << 40 | LO2 << 32 | LO2 << 24 | LO2 << 16 | LO2 << 8 | LO2)
	const FULL_HI2   uint64 = uint64(HI2 << 56 | HI2 << 48 | HI2 << 40 | HI2 << 32 | HI2 << 24 | HI2 << 16 | HI2 << 8 | HI2) | SEPARATORS
	const FULL_LO3   uint64 = uint64(LO3 << 56 | LO3 << 48 | LO3 << 40 | LO3 << 32 | LO3 << 24 | LO3 << 16 | LO3 << 8 | LO3)
	const FULL_HI3   uint64 = uint64(HI3 << 56 | HI3 << 48 | HI3 << 40 | HI3 << 32 | HI3 << 24 | HI3 << 16 | HI3 << 8 | HI3) | SEPARATORS
	const FULL_CHAR4 uint64 = uint64(CHAR4 << 56 | CHAR4 << 48 | CHAR4 << 40 | CHAR4 << 32 | CHAR4 << 24 | CHAR4 << 16 | CHAR4 << 8 | CHAR4)
	const FULL_CHAR5 uint64 = uint64(CHAR5 << 56 | CHAR5 << 48 | CHAR5 << 40 | CHAR5 << 32 | CHAR5 << 24 | CHAR5 << 16 | CHAR5 << 8 | CHAR5)

	const COUNT1       int64 = 0
	const COUNT2       int64 = COUNT1 + HI1 - LO1 + 1
	const COUNT3       int64 = COUNT2 + HI2 - LO2 + 1
	const COUNT4       int64 = COUNT3 + HI3 - LO3 + 1
	const COUNT5       int64 = COUNT4 + 1
	const FULL_COUNT4 uint64 = uint64(COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4)
	const FULL_COUNT5 uint64 = uint64(COUNT5 << 56 | COUNT5 << 48 | COUNT5 << 40 | COUNT5 << 32 | COUNT5 << 24 | COUNT5 << 16 | COUNT5 << 8 | COUNT5)

	const DIFF1      uint64 = uint64( (LO1 - COUNT1))
	const DIFF2      uint64 = uint64( (LO2 - COUNT2))
	const DIFF3      uint64 = uint64(-(LO3 - COUNT3)) // Note the sign
	const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
	const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
	const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3

	const FULL_ONE uint64 = 0x0101010101010101

	var value uint64
	var mask uint64

	src2 := src | SEPARATORS
	err := src & SEPARATORS

	// value += src >= LO1 && HI1 >= src ? src - DIFF1 : 0
	mask = (src2 - FULL_LO1) & (FULL_HI1 - src) & SEPARATORS
	mask = mask - (mask >> 7)
	value += (src & mask) - (FULL_DIFF1 & mask)
	err |= mask

	// value += src >= LO2 && HI2 >= src ? src - DIFF2 : 0
	mask = (src2 - FULL_LO2) & (FULL_HI2 - src) & SEPARATORS
	mask = mask - (mask >> 7)
	value += (src & mask) - (FULL_DIFF2 & mask)
	err |= mask

	// value += src >= LO3 && HI3 >= src ? src + DIFF3 : 0 // Sign of DIFF3 was flipped
	mask = (src2 - FULL_LO3) & (FULL_HI3 - src) & SEPARATORS
	mask = mask - (mask >> 7)
	value += (src & mask) + (FULL_DIFF3 & mask) // Sign of DIFF3 was flipped
	err |= mask

	// value += src == CHAR4 ? COUNT4 : 0
	mask = ((src2 ^ FULL_CHAR4) - FULL_ONE) & SEPARATORS ^ SEPARATORS
	mask = mask - (mask >> 7)
	value += FULL_COUNT4 & mask
	err |= mask

	// value += src == CHAR5 ? COUNT5 : 0
	mask = ((src2 ^ FULL_CHAR5) - FULL_ONE) & SEPARATORS ^ SEPARATORS
	mask = mask - (mask >> 7)
	value += FULL_COUNT5 & mask
	err |= mask

	value = 
		((value & 0x3f00000000000000) >> (56 - 7 * 6)) |
		((value & 0x003f000000000000) >> (48 - 6 * 6)) |
		((value & 0x00003f0000000000) >> (40 - 5 * 6)) |
		((value & 0x0000003f00000000) >> (32 - 4 * 6)) |
		((value & 0x000000003f000000) >> (24 - 3 * 6)) |
		((value & 0x00000000003f0000) >> (16 - 2 * 6)) |
		((value & 0x0000000000003f00) >> ( 8 - 1 * 6)) |
		((value & 0x000000000000003f) >> ( 0 - 0 * 6))

	// err = err == 0x7f7f7f7f7f7f7f7f ? 0 : -1
	_, err = bits.Sub64(err ^ 0x7f7f7f7f7f7f7f7f, 1, 0)
	err = ^(0 - err)

	return value | err // returns -1 on error
}

func base64StandardEncodeBlock(src uint64) uint64 {
	const SEPARATORS uint64 = 0x8080808080808080

	const LO1       int64 = int64('A')
	const HI1       int64 = int64('Z')
	const LO2       int64 = int64('a')
	const HI2       int64 = int64('z')
	const LO3       int64 = int64('0')
	const HI3       int64 = int64('9')
	const LO4       int64 = int64('+')
	const HI4       int64 = int64('+')
	const LO5       int64 = int64('/')
	const FULL_LO1 uint64 = uint64(LO1 << 56 | LO1 << 48 | LO1 << 40 | LO1 << 32 | LO1 << 24 | LO1 << 16 | LO1 << 8 | LO1)

	const DIFF1      uint64 = uint64( (LO2 - HI1 - 1) & 0x7f)
	const DIFF2      uint64 = uint64( (LO3 - HI2 - 1) & 0x7f)
	const DIFF3      uint64 = uint64(-(LO4 - HI3 - 1) & 0x7f) // avoid overflow with 0x71
	const DIFF4      uint64 = uint64( (LO5 - HI4 - 1) & 0x7f)
	const FULL_DIFF1 uint64 = DIFF1 << 56 | DIFF1 << 48 | DIFF1 << 40 | DIFF1 << 32 | DIFF1 << 24 | DIFF1 << 16 | DIFF1 << 8 | DIFF1
	const FULL_DIFF2 uint64 = DIFF2 << 56 | DIFF2 << 48 | DIFF2 << 40 | DIFF2 << 32 | DIFF2 << 24 | DIFF2 << 16 | DIFF2 << 8 | DIFF2
	const FULL_DIFF3 uint64 = DIFF3 << 56 | DIFF3 << 48 | DIFF3 << 40 | DIFF3 << 32 | DIFF3 << 24 | DIFF3 << 16 | DIFF3 << 8 | DIFF3
	const FULL_DIFF4 uint64 = DIFF4 << 56 | DIFF4 << 48 | DIFF4 << 40 | DIFF4 << 32 | DIFF4 << 24 | DIFF4 << 16 | DIFF4 << 8 | DIFF4
	// The overflow:
	// LO1 + DIFF1 + DIFF2 - 0x0f + DIFF4 + 63 = 0xaf
	// LO1 + DIFF1 + DIFF2 + 0x71 + DIFF4 + 63 = 0x12f (overflow)

	const COUNT1      uint64 = uint64(HI1 - LO1) + 1
	const COUNT2      uint64 = uint64(HI2 - LO2) + 1 + COUNT1
	const COUNT3      uint64 = uint64(HI3 - LO3) + 1 + COUNT2
	const COUNT4      uint64 = uint64(HI4 - LO4) + 1 + COUNT3
	const FULL_COUNT1 uint64 = COUNT1 << 56 | COUNT1 << 48 | COUNT1 << 40 | COUNT1 << 32 | COUNT1 << 24 | COUNT1 << 16 | COUNT1 << 8 | COUNT1
	const FULL_COUNT2 uint64 = COUNT2 << 56 | COUNT2 << 48 | COUNT2 << 40 | COUNT2 << 32 | COUNT2 << 24 | COUNT2 << 16 | COUNT2 << 8 | COUNT2
	const FULL_COUNT3 uint64 = COUNT3 << 56 | COUNT3 << 48 | COUNT3 << 40 | COUNT3 << 32 | COUNT3 << 24 | COUNT3 << 16 | COUNT3 << 8 | COUNT3
	const FULL_COUNT4 uint64 = COUNT4 << 56 | COUNT4 << 48 | COUNT4 << 40 | COUNT4 << 32 | COUNT4 << 24 | COUNT4 << 16 | COUNT4 << 8 | COUNT4

	src =
		((src << (56 - 7 * 6)) & 0x3f00000000000000) |
		((src << (48 - 6 * 6)) & 0x003f000000000000) |
		((src << (40 - 5 * 6)) & 0x00003f0000000000) |
		((src << (32 - 4 * 6)) & 0x0000003f00000000) |
		((src << (24 - 3 * 6)) & 0x000000003f000000) |
		((src << (16 - 2 * 6)) & 0x00000000003f0000) |
		((src << ( 8 - 1 * 6)) & 0x0000000000003f00) |
		((src << ( 0 - 0 * 6)) & 0x000000000000003f)
	src2 := src | SEPARATORS
	diff := FULL_LO1

	var mask uint64
	// if   ch  >= COUNT                    { diff += DIFF }
	mask = (src2 - FULL_COUNT1) & SEPARATORS; diff += FULL_DIFF1 & (mask - (mask >> 7))
	mask = (src2 - FULL_COUNT2) & SEPARATORS; diff += FULL_DIFF2 & (mask - (mask >> 7))
	mask = (src2 - FULL_COUNT3) & SEPARATORS; diff -= FULL_DIFF3 & (mask - (mask >> 7)) // overflow avoided
	mask = (src2 - FULL_COUNT4) & SEPARATORS; diff += FULL_DIFF4 & (mask - (mask >> 7))

	return (src + diff) & 0x7f7f7f7f7f7f7f7f
}

I didn't test all the possible encode/decodes (2**48) but I tested all 3 byte values (2**24) left shifted by 0, 6, 12, 18, and 24.

SSE2/AVX2/NEON will be faster. SSE2, AVX2, NEON have compares for bytes equal and signed greater than: _mm_cmpgt_epi8/_mm256_cmpgt_epi8/vceqq_s8 (PCMPGTB/VPCMPGTB/CMEQ) and _mm_cmpeq_epi8/_mm256_cmpeq_epi8/vcgtq_s8 (PCMPEQB/VPCMPEQB/CMGT). Thus a simple conditional add is 3 SIMD instructions (compare, and, add), but adding actual SIMD seems like overkill.

I'll benchmark this tomorrow. If someone else hasn't.

The base32 version is kind of straight forward. For decode you'll get an error on the DIFFs you need to flip the sign. For encode you need to manually check for overflows and flip the appropriate signs.

I'll submit a pull request assuming these are faster—Oh I just realized I need to make a change. To save time on partial blocks. The input to the encode function should be expanded before calling. Similar for the output of the decode function.

Sc00bz

It's slower. I'll try a few things, but I think this will be slower without SIMD. I'm pretty sure this won't change much but the benchmark is testing encode and decode of zero bytes. So it's always hitting the same elements: encode[0] and decodeMap['A'].

rolandshoemaker

Member

If we can get a just as fast (likely faster) CT SIMD assembly implementation, then I think the question about whether to add a CT implementation alongside the VT one somewhat goes out of the window (if it's just faster, why not always use it). That said, if it's significantly more complex to write an assembly implementation that is agnostic to alphabets, I think the question still remains.

apparentlymart

Would it be acceptable to say that constant-time is guaranteed only for the standard alphabet, while other alphabets might use a more general implementation that is not constant-time, or does that seem too subtle for a security-related feature? 🤔

(I'm asking the above question primarily from an API design standpoint. I acknowledge that it also implies still maintaining two implementations rather than only one, but that was already true for the compromise of exposing constant-time and not-constant-time as explicitly separate functions.)

rolandshoemaker

Member

I think regardless of the underlying implementation we'll want to retain an API such as the WithConstantTime suggested in #73901 (comment)). If we decide we can only enable the CT implementation for standard alphabets, then that can return an error for non-standard alphabets.

If we can come up with a fast SIMD CT implementation that can easily be extended to arbitrary alphabets that would be nice, but I don't think it's a sticking point. Really we only want this for standard alphabets (for use in encoding/pem), and if we have to carry a slower variable-time implementation for non-standard alphabets, that's probably fine.

aclements

Member

FWIW, the fast AVX-512 algorithms for base64 encode and decode in https://arxiv.org/pdf/1910.05109 work with arbitrary alphabets and should be straightforward to extend to base32. However, at their core is a table lookup that only AVX-512 can pull off. I haven't thought much about how to do this with smaller vectors, but will note that SIMD by its nature is generally good at turning branchy code into branchless code.

Do people use alphabets that differ in anything other than the last two elements?

magical

Contributor

Yes, there are certain password hashing schemes that use variant alphabets: most notably bcrypt, which puts the punctuation characters ./ at the start of the alphabet instead of at the end.

./ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

https://cs.opensource.google/go/x/crypto/+/refs/tags/v0.40.0:bcrypt/base64.go;l=9

Sc00bz

You need AVX-512's VBMI for _mm512_permutex2var_epi8 (VPERMI2B). It's also in AVX10.1 which is the first AVX10 version. (If you're not familiar, AVX10 is a consolidation of AVX-512 instruction subsets. Also AVX10/256 for limiting register width to 256 bits and a promise that future versions will include all instructions in previous versions.)

AMD's Zen4 and Zen5 have it. Intel's Cannon Lake was the first to have it. At least Icelake Intel, Icelake Xeon, Sapphire Rapids have it according to Intel Intrinsics Guide. Ah Wikipedia's compatibility table.

It might be better to target SSSE3*, AVX, and AVX2's _mm_shuffle_epi8/_mm256_shuffle_epi8 (PSHUFB/VPSHUFB). It will need 2 (base32) or 4 (base64) calls to encode and 16 calls to decode because it is limited to a 16 byte lookup table. Newer Intel's CPUs have a throughput of 2 instructions/clock which is nice.

NEON (ARM64) has vqtbl4q_s8 (TBL) which can do a 64 byte lookup table (across 4, 16 byte registers). Also it's easier than PSHUFB/VPSHUFB because anything out of bounds is set to 0. Instead of if the high bit is set then 0. For ARMv7 and ARM32 there's vtbl4_u8 (TBL) which can do a 32 byte lookup table (across 4, 8 byte registers).

* Maybe not SSSE3 since it's old and Intel only.

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

LibraryProposalProposalProposal-Crypto

Type

No type

Projects

Proposals

Status

Active

Milestone

ProposalNo due date

Relationships

None yet

Development

encoding/base64: add constant-time behavior, enabled by defaultgolang/go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: encoding/{base32,base64}: constant-time implementations of RFC 4648 codecs for use with cryptography keys #73901

Proposal Details

Background

Proposed Change

17 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

proposal: encoding/{base32,base64}: constant-time implementations of RFC 4648 codecs for use with cryptography keys #73901

Description

Proposal Details

Background

Proposed Change

Activity

mateusz834 commented on May 28, 2025

tob-scott-a commented on May 28, 2025

Jorropo commented on May 28, 2025

tob-scott-a commented on May 28, 2025

Jorropo commented on May 28, 2025

mateusz834 commented on May 28, 2025

renthraysk commented on May 28, 2025

tob-scott-a commented on May 29, 2025

gopherbot commented on May 29, 2025

rolandshoemaker commented on May 29, 2025

17 remaining items

tob-scott-a commented on Jul 10, 2025

rolandshoemaker commented on Jul 10, 2025

aclements commented on Jul 10, 2025

rolandshoemaker commented on Jul 10, 2025

randall77 commented on Jul 10, 2025

randall77 commented on Jul 10, 2025

rolandshoemaker commented on Jul 11, 2025

Sc00bz commented on Jul 12, 2025

Sc00bz commented on Jul 14, 2025

rolandshoemaker commented on Jul 16, 2025

apparentlymart commented on Jul 16, 2025

rolandshoemaker commented on Jul 16, 2025

aclements commented on Jul 16, 2025

magical commented on Jul 16, 2025

Sc00bz commented on Jul 17, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions