Skip to content

Conversation

@JukkaL
Copy link
Collaborator

@JukkaL JukkaL commented Nov 19, 2025

The performance can be 10x faster than stdlib if input is valid base64, or if input has extra non-base64 characters only at the end of input. Similar to the base64 encode implementation I added recently, this uses SIMD instructions when available.

The implementation first tries to decode the input optimistically assuming valid base64. If this fails, we'll perform a slow path with a preprocessing step that removes extra characters, and we'll perform a strict base64 decode on the cleaned up input.

The semantics aren't 100% compatible with stdlib. First, we raise ValueError on invalid padding instead of binascii.Error, since I don't want a runtime dependency on the unrelated abinascii module. This needs to be documented, but stdlib can already raise ValueError on other conditions, so the deviation is not huge. Also, some invalid inputs are checked more strictly for padding violations. The stdlib implementation has some mysterious behaviors with invalid inputs that didn't seem worth replicating.

The function only accepts a single ASCII str or bytes argument for now, since that seems to be by the far the most common use case. The stdlib function also accepts buffer objects and a validate argument.

The slow path is still somewhat faster than stdlib (on the order of 1.3x to 2x for longer inputs), at least if the input is much smaller than L1 cache size.

Got the initial fast path implementation from ChatGPT, but did a bunch of manual edits afterwards and reviewed carefully.

@github-actions

This comment has been minimized.

Comment on lines 73 to 74
return ((c >= 'A' && c <= 'Z') | (c >= 'a' && c <= 'z') |
(c >= '0' && c <= '9') | (c == '+') | (c == '/') | (allow_padding && c == '='));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to mix logical && and bitwise |?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particularly good reason, mainly to highlight that these don't need branches at runtime (we don't want many mispredicted branches). I think this was from ChatGPT output and I thought it was okay in this use case. The semantics are identical so it's probably better to use && consistently though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some experiments, and using || might help compilers generate faster code, so I will use it.

@github-actions
Copy link
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@JukkaL JukkaL merged commit 35e843c into master Nov 19, 2025
21 checks passed
@JukkaL JukkaL deleted the mypyc-base64-4-decode branch November 19, 2025 17:18
JukkaL added a commit that referenced this pull request Nov 20, 2025
JukkaL added a commit that referenced this pull request Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants