Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encryption: support support_unencrypted_data at a per-attribute level #49072

Merged
merged 1 commit into from Sep 13, 2023

Conversation

ghiculescu
Copy link
Member

@ghiculescu ghiculescu commented Aug 29, 2023

If ActiveRecord::Encryption.config.support_unencrypted_data == true, this allows you to do:

class User < ActiveRecord::Base
  encrypts :name, deterministic: true, support_unencrypted_data: false
  encrypts :email, deterministic: true
end

Now only the email column will allow unencrypted data (and if extend_queries is true, only email queries will get extended). Here is some back story on why you might want this.

@ghiculescu ghiculescu force-pushed the extend-queries-per-attribute branch 3 times, most recently from 5665931 to a7fa679 Compare August 29, 2023 01:13
Copy link
Contributor

@jorgemanrubia jorgemanrubia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this one @ghiculescu.

The query extension system is meant to make deterministically encrypted attributes work in queries, which is a core behavior you want in place if you use encryption. I think a global configuration flag makes sense, in case you want to completely disable the system. But I don't see the fine-grained configuration option justified. The case of gradually enabling this on a per-column basis doesn't feel enough justification to me. What's the benefit compared to enable the flag globally?

@ghiculescu
Copy link
Member Author

ghiculescu commented Aug 29, 2023

You should extend queries if you know you have a mixture of encrypted and unencrypted data in a column you’re querying by.

If you know all data in a column is encrypted then you don’t need extended queries for that column. But if you enable the configuration globally then you still have the extra overhead added to all your queries.

This PR lets you enable the config globally and then selectively opt out of it on specific columns that don’t need it. For example if you backfill a column so that it’s fully encrypted then you can stop extended queries for just that column.

(Maybe I made it confusing by talking about opting-in per column in the PR body. That doesn’t do anything different to how it works now.)

@jorgemanrubia
Copy link
Contributor

Thanks, I understand better the scenario you have in mind with your last comment. Notice that the system is also used when you use different encryption schemes (previous: option lets you define past encryption properties for the attribute).

My main concern is: the new option adds a little bit of additional complexity, both for people seeing how to use the framework and internally. Does the overhead-saving gain justifies the new config option? I haven't measured, but it would be surprising to me. Also, think that situations where you manage encrypted/unencrypted data should be a transient one. But, even if it was permanent, would the gain be justified?

@ghiculescu
Copy link
Member Author

I’m not sure if that question is rhetorical, but for the record, yeah I think the gain is worth it.

As you say it’s hopefully a transient state, but some backfills can take a very long time or not happen at all for whatever reason.

The slightly more complex API makes up for it making for a smoother process of gradually encrypting everything. It makes the transition to fully using encryption less scary.

Copy link
Contributor

@joshuay03 joshuay03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just some minor comments and questions.

@jorgemanrubia
Copy link
Contributor

I’m not sure if that question is rhetorical, but for the record, yeah I think the gain is worth it.

@ghiculescu not rhetorical. Could you elaborate why you think it's worth it? By overhead you mean a performance overhead, right? What's the performance gain?

@jorgemanrubia
Copy link
Contributor

It makes the transition to fully using encryption less scary.

Also interested in understanding how this option makes it less scary. Is it related to the referred overhead? Or to the psychological element in seeing the queries altered when using deterministic encryption + support for unencrypted data?

@ghiculescu
Copy link
Member Author

ghiculescu commented Aug 29, 2023

@jorgemanrubia The backstory for this is that we have a big application that's over a decade old. I've been gradually adding encryption to parts of it over the last year. Other teams have also been adding encryption, both to existing features and new ones.

The current state is we have a reasonable number of encrypted columns (over 30, and I still have more I want to do). About a quarter of these are deterministic. Some are fully encrypted, because encryption was enabled when the column was added. Some are fully encrypted, because all the data has been backfilled. And some are not fully encrypted.

The eventual goal is to have every encrypted column be fully encrypted, and to turn supports_encypted_data. But this seems like it won't happen for a long time, because we continue to add encryption to existing columns and then gradually backfill them. So I'm stuck with supports_encrypted_data and extend_queries being true for the foreseeable future.

This leaves me with a few issues:

  1. Without connecting to the production database and querying every row, I don't have a good source of truth on which columns are fully encrypted and which are not.
  2. Once I do backfill a column, I'm still incurring the overhead that extend_queries adds for every query to that column. This isn't much (your tests ensure it's not more than 20% slower), but it adds up, particularly on heavily used columns.

With this PR, I can tick off both of these issues.

  1. As soon as a backfill is complete, I can disable extend_queries just for that column. Now I have the code as a source of truth for this.
  2. Disabling extend_queries speeds up querying of that column. (I'm not sure how much by, but it can't slow it down.) This is also handy for columns that are fully encrypted from day one and never need a backfill.

All this leads to my comment about scariness. Let's say one day we decide there's no more deterministic columns to encrypt, and we think we've backfilled everything. We deploy a change to production that turns extend_queries off at the application level. This is a scary change, because it's hard to know ahead of time if it will break things. By moving extend_queries to the column level we can shrink the blast radius; once every deterministic column has extend_queries: false, we know it's safe to turn the application configuration off.

ps. I think I should have led with this, my PR body was pretty basic. Sorry for the lack of context, and thanks for bearing with me.

@jorgemanrubia
Copy link
Contributor

Thanks @ghiculescu, I understand context much better now 🙏. I agree that such support would help to gradually reach the situation where you can disable support for unencrypted data. I see the value in that.

I'd suggest the following:

  • Instead of naming the option extend_queries make it support_unencrypted_data.
  • When the option is set (not nil), use it here. That will make that EncryptedAttributeType#previous_types to exclude the clean_text_scheme, and would result in the system to extend deterministic queries not altering the query. As a side benefit, this would play well with the case where there are previous encryption schemes.

@ghiculescu
Copy link
Member Author

I’ll make these follow ups tomorrow 👍

@ghiculescu ghiculescu changed the title Encryption: support extend_queries being set at a per-attribute level Encryption: support support_unencrypted_data at a per-attribute level Aug 30, 2023
@ghiculescu ghiculescu force-pushed the extend-queries-per-attribute branch 4 times, most recently from 909bdeb to 02554dc Compare August 30, 2023 00:44
@ghiculescu ghiculescu force-pushed the extend-queries-per-attribute branch 2 times, most recently from 718a35d to 6cc63f0 Compare August 30, 2023 04:44
Copy link
Contributor

@jorgemanrubia jorgemanrubia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @ghiculescu 👏

Copy link
Contributor

@jorgemanrubia jorgemanrubia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ghiculescu, this is a nice addition 👏

@@ -23,8 +23,9 @@ module Encryption
# some prepared statements caching. That's why we need to intercept +ActiveRecord::Base+ as soon
# as it's invoked (so that the proper prepared statement is cached).
#
# When modifying this file run performance tests in +test/performance/extended_deterministic_queries_performance_test.rb+ to
# make sure performance overhead is acceptable.
# When modifying this file run performance tests in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in the public documentation of this module. And why we have @todo on it? Let's remove those from the public documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this module should be public API at all. Could we make the whole thing :nodoc:?

I will move the TODOs and internal notes outside of the public docs for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgemanrubia do you have any thoughts on the public/private state of the docs here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those notes are meant to be internal comments, not part of public API. I think your change here was fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But should this module be public at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think it should

@ghiculescu ghiculescu force-pushed the extend-queries-per-attribute branch 2 times, most recently from ad02964 to b1acf2a Compare September 3, 2023 23:39
@ghiculescu ghiculescu force-pushed the extend-queries-per-attribute branch 2 times, most recently from c0cf5ad to 3c1c7c9 Compare September 7, 2023 02:42
@ghiculescu
Copy link
Member Author

@rafaelfranca Could you please take another look at this one?

@rafaelfranca rafaelfranca merged commit aa268ba into rails:main Sep 13, 2023
4 checks passed
@ghiculescu ghiculescu deleted the extend-queries-per-attribute branch September 13, 2023 00:24
ghiculescu added a commit to ghiculescu/rails that referenced this pull request Feb 16, 2024
…ibute level

rails#49072 allowed you to turn `support_unencrypted_data` on a global level, then turn it off for specific attributes. But it didn't allow the inverse: you couldn't turn the config off globally, and then turn it on for a specific attribute.

This PR adds support for that.
ghiculescu added a commit to ghiculescu/rails that referenced this pull request Feb 16, 2024
…ibute level

rails#49072 allowed you to turn `support_unencrypted_data` on a global level, then turn it off for specific attributes. But it didn't allow the inverse: you couldn't turn the config off globally, and then turn it on for a specific attribute.

This PR adds support for that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants