Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ActiveStorage::Blob keys lowercase #34818

Merged
merged 12 commits into from Dec 30, 2018

Conversation

julik
Copy link
Contributor

@julik julik commented Dec 28, 2018

Summary

When used with case-insensitive filesystems ActiveStorage blobs might cause non-portable Disk storage directories to be created, and the UNIQUE index that gets configured on the active_storage_blobs table is database-dependent in terms of case-sensitivity. If the database index is case-sensitive, but the underlying filesystem is not, blobs might be overwritten and the available key space for uploaded blobs shrinks drastically. This does happen on macOS filesystems already.

This patch changes the ActiveStorage key generation from mixed-case base-58 to lowercase base-36, and increases the key size from 24 bytes to 28. When using base58 there are 58**24 possible key values, and to have the same or more possible keys with base36 we need to have 36**28, increasing the storage requirement for the key by 4 bytes. Since the database migrations for the blobs table are not CHAR()-size-restricted (and the fix worked for me both on SQLite and MySQL) I believe this is a reasonable compromise.

Note that this will not alter the storage keys that have already been generated, their treatment wills tay unchanged. Neither will it correct the storage directory structures that were created using mixed-case keys on macOS since you need to "rekey" the stored objects if you want the storage directory to be portable to a case-sensitive filesystem.

Closes #34804 , addresses the underlying issue in #33864 for freshly-generated uploaded blobs. Unfortunately it cannot be excluded that for some people some blobs already did get overwritten if SQLite was used in combination with a case-insensitive filesystem and the Disk storage service. Neither does this change existing blob keys.

This change does have an implication that for existing S3 buckets and for other services that potentially use "naive partitioning" (first N bytes of the given object store key) partition usage is going to become less "fair". However, AWS recently dropped the recommendation on manual key distribution for the key prefix (https://www.infoq.com/news/2018/10/amazon-s3-performance-increase) as it seems they have adopted hashing for the keys the user provides, so this would not be a problem for buckets that get created.

The advantages of using a lowercase-only key:

  • Less potential issues with collation-sensitive database indices
  • Removes issues with potentially case-insensitive blob stores
  • Makes it easier to download blobs from a remote store to a case-insensitive filesystem for debugging

@georgeclaghorn georgeclaghorn self-assigned this Dec 30, 2018
@georgeclaghorn
Copy link
Contributor

This also merits a changelog entry. Can you please add one?

Copy link
Member

@sikachu sikachu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I think base58.downcase is not really the right way to fix this.

The reason is that by generating base58 string, we could ended up with Aaaaa... and aaaaa... which when downcase will cause a key collision.

If we want to make it case insensitive, that means we have to throw away all lowercase or all uppercase characters altogether, which means we need to switch to something else. I was looking around and seems like maybe we can implement Base32 then downcase, or z-base-32 instead.

I'll let @georgeclaghorn decide what we should do next, but base58.downcase is definitely not what we're looking for for sure.

@julik
Copy link
Contributor Author

julik commented Dec 30, 2018

@sikachu @georgeclaghorn thank you for your comments. base58(k).downcase is a cheap solution indeed, because otherwise I would have to add a base36 implementation to SecureRandom which would increase the API surface there (while not reused anywhere else). I can copy-paste the implementation for base58 with base36 into the Blob class proper if that will be the preference.

@sikachu as it stands now ActiveStorage::Blob keys are not immune to collisions and this is actually a severe issue, see #34805 - but I do get your point that we should probably use the entropy source better.

@sikachu
Copy link
Member

sikachu commented Dec 30, 2018

I think it's fine to add base32 just for Active Storage. base58 was added only for Active Record's secure token as well. I think making SecureRandom generate base32 will still be safer than base58 and downcase.

However, I'm not really sure that base32 or base36 should be used here. Base58 actually exclude some characters that are ambiguous, same goes with base32. We may be able to use base32 and conform to RFC 4648, then just make sure that the length we choose is multiplication of 8 so we don't have to add padding.

I actually found this implementation of base32 here which is under MIT license, so I think we may be able to use as a reference: https://github.com/stesla/base32/blob/a65a6cded0b37a61d10c5a1bef400464eb353cb1/lib/base32.rb#L56-L62

@julik
Copy link
Contributor Author

julik commented Dec 30, 2018

Why would you want the key to be base64 and/or base32 conformant? It does not give you extra guarantees of uniqueness or the like, but it does force you to do bookkeeping to have N-complement key lengths (28 is 4-complement) and creates those =s. Base58 does omit chars that look similar ("lowercase-L versus 1" etc) but I don't know if this has value when you are dealing with a "bag of keys" situation at all.

@sikachu
Copy link
Member

sikachu commented Dec 30, 2018

Ah, I agree with you that this is a bag of key situation and hence it does not really matter. I just thought that base32 is closest to what we want to archive (single-case alphabets and numeric) and is one of the standard so adding this to the core would have a high probability of it being reused, despite this is its first usage.

I think I'm going to have to ask @georgeclaghorn if "omitting the similar looking characters" is one of the feature we want or is it just that base58 is already available so we just use it. If it doesn't matter, then maybe base36 will be fine, otherwise I think base32 will be a great addition to the core as well.

@georgeclaghorn
Copy link
Contributor

Base36 sounds fine to me, but the implementation as it stands now needs to move to the SecureRandom extension, where it can live alongside SecureRandom.base58.

Braces are to avoid warnings when running tests as we start the arguments with a /
@julik
Copy link
Contributor Author

julik commented Dec 30, 2018

🎉 please squash-merge since there are quite a few interim commits

@georgeclaghorn georgeclaghorn merged commit e5f4162 into rails:master Dec 30, 2018
@georgeclaghorn
Copy link
Contributor

Thank you! ❤️

# p SecureRandom.base36(24) # => "77tmhrhjfvfdwodq8w7ev2m7"
def self.base36(n = 16)
SecureRandom.random_bytes(n).unpack("C*").map do |byte|
idx = byte % 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a nearly 50% chance of being unable to use the random byte and having to fall back to random number... it'd presumably be worth benchmarking before spending any real effort on it, but a quick shot at something that should be cheaper: (Someone want to check my maths?)

idx = byte / 7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be, I would actually rewrite both base58 and base36 to use a much more pedestrian

n.times.map { alphabet[SecureRandom.random_number(alphabet.length) }.join

More method calls but probably "less clever" is better in this case. I didn't touch the implementation because I was thinking "this byte trickery for obtaining bytes from a potentially-slow SecureRandom is here for a reason"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just didn't want to touch it since this is not what the PR is about and this would need microbenchmarks etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious enough to follow through, so:

Benchmark: https://gist.github.com/matthewd/6827623f4e0308943e8c82db490cc118

We do generally avoid programming to microbenchmarks in high level library code, but I think the other side of that policy is being more accepting of minor ugliness when we're building lower level primitives.

(As a point of comparison, it'd be entirely silly to switch to base32 just because we can generate them twice as fast. This is purely about "there's a measurably-faster way to achieve the same thing for a similar amount of code".)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll open a patch for this once I'm done with #34827

@julik julik deleted the activestorage-lowercase-keys branch December 31, 2018 01:36
suketa added a commit to suketa/rails_sandbox that referenced this pull request Jul 6, 2019
Make ActiveStorage::Blob keys lowercase
rails/rails#34818
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Case sensitivity of Activestorage keys creates undefined behavior
4 participants