New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ActiveStorage::Blob keys lowercase #34818
Make ActiveStorage::Blob keys lowercase #34818
Conversation
This also merits a changelog entry. Can you please add one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I think base58.downcase
is not really the right way to fix this.
The reason is that by generating base58 string, we could ended up with Aaaaa...
and aaaaa...
which when downcase will cause a key collision.
If we want to make it case insensitive, that means we have to throw away all lowercase or all uppercase characters altogether, which means we need to switch to something else. I was looking around and seems like maybe we can implement Base32
then downcase, or z-base-32
instead.
I'll let @georgeclaghorn decide what we should do next, but base58.downcase
is definitely not what we're looking for for sure.
@sikachu @georgeclaghorn thank you for your comments. @sikachu as it stands now ActiveStorage::Blob keys are not immune to collisions and this is actually a severe issue, see #34805 - but I do get your point that we should probably use the entropy source better. |
I think it's fine to add However, I'm not really sure that I actually found this implementation of base32 here which is under MIT license, so I think we may be able to use as a reference: https://github.com/stesla/base32/blob/a65a6cded0b37a61d10c5a1bef400464eb353cb1/lib/base32.rb#L56-L62 |
instead of downcasing the Base58 variant
Why would you want the key to be base64 and/or base32 conformant? It does not give you extra guarantees of uniqueness or the like, but it does force you to do bookkeeping to have N-complement key lengths (28 is 4-complement) and creates those |
Ah, I agree with you that this is a bag of key situation and hence it does not really matter. I just thought that base32 is closest to what we want to archive (single-case alphabets and numeric) and is one of the standard so adding this to the core would have a high probability of it being reused, despite this is its first usage. I think I'm going to have to ask @georgeclaghorn if "omitting the similar looking characters" is one of the feature we want or is it just that base58 is already available so we just use it. If it doesn't matter, then maybe base36 will be fine, otherwise I think base32 will be a great addition to the core as well. |
Base36 sounds fine to me, but the implementation as it stands now needs to move to the SecureRandom extension, where it can live alongside |
Braces are to avoid warnings when running tests as we start the arguments with a /
🎉 please squash-merge since there are quite a few interim commits |
Thank you! ❤️ |
# p SecureRandom.base36(24) # => "77tmhrhjfvfdwodq8w7ev2m7" | ||
def self.base36(n = 16) | ||
SecureRandom.random_bytes(n).unpack("C*").map do |byte| | ||
idx = byte % 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a nearly 50% chance of being unable to use the random byte and having to fall back to random number... it'd presumably be worth benchmarking before spending any real effort on it, but a quick shot at something that should be cheaper: (Someone want to check my maths?)
idx = byte / 7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be, I would actually rewrite both base58 and base36 to use a much more pedestrian
n.times.map { alphabet[SecureRandom.random_number(alphabet.length) }.join
More method calls but probably "less clever" is better in this case. I didn't touch the implementation because I was thinking "this byte trickery for obtaining bytes from a potentially-slow SecureRandom is here for a reason"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just didn't want to touch it since this is not what the PR is about and this would need microbenchmarks etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious enough to follow through, so:
Benchmark: https://gist.github.com/matthewd/6827623f4e0308943e8c82db490cc118
We do generally avoid programming to microbenchmarks in high level library code, but I think the other side of that policy is being more accepting of minor ugliness when we're building lower level primitives.
(As a point of comparison, it'd be entirely silly to switch to base32 just because we can generate them twice as fast. This is purely about "there's a measurably-faster way to achieve the same thing for a similar amount of code".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll open a patch for this once I'm done with #34827
Make ActiveStorage::Blob keys lowercase rails/rails#34818
Summary
When used with case-insensitive filesystems ActiveStorage blobs might cause non-portable Disk storage directories to be created, and the UNIQUE index that gets configured on the
active_storage_blobs
table is database-dependent in terms of case-sensitivity. If the database index is case-sensitive, but the underlying filesystem is not, blobs might be overwritten and the available key space for uploaded blobs shrinks drastically. This does happen on macOS filesystems already.This patch changes the ActiveStorage key generation from mixed-case base-58 to lowercase base-36, and increases the key size from 24 bytes to 28. When using base58 there are
58**24
possible key values, and to have the same or more possible keys with base36 we need to have36**28
, increasing the storage requirement for the key by 4 bytes. Since the database migrations for the blobs table are not CHAR()-size-restricted (and the fix worked for me both on SQLite and MySQL) I believe this is a reasonable compromise.Note that this will not alter the storage keys that have already been generated, their treatment wills tay unchanged. Neither will it correct the storage directory structures that were created using mixed-case keys on macOS since you need to "rekey" the stored objects if you want the storage directory to be portable to a case-sensitive filesystem.
Closes #34804 , addresses the underlying issue in #33864 for freshly-generated uploaded blobs. Unfortunately it cannot be excluded that for some people some blobs already did get overwritten if SQLite was used in combination with a case-insensitive filesystem and the Disk storage service. Neither does this change existing blob keys.
This change does have an implication that for existing S3 buckets and for other services that potentially use "naive partitioning" (first N bytes of the given object store key) partition usage is going to become less "fair". However, AWS recently dropped the recommendation on manual key distribution for the key prefix (https://www.infoq.com/news/2018/10/amazon-s3-performance-increase) as it seems they have adopted hashing for the keys the user provides, so this would not be a problem for buckets that get created.
The advantages of using a lowercase-only key: