Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Rails generated index name being too long #47753

Merged
merged 1 commit into from
Mar 29, 2023

Conversation

mscoutermarsh
Copy link
Contributor

@mscoutermarsh mscoutermarsh commented Mar 23, 2023

Motivation / Background

Frequently I find myself hitting my database's "is too long" error when adding a compound index.

When this happens, I need to manually set a shorter name to run the migration.

Detail

This updates the index name generation to always create a valid index name if one is not passed by the user.

I set the limit to 62 to ensure it works for the default configurations of sqlite, mysql & postgres.

MySQL: 64
Postgres: 63
Sqlite: 62

I considered using index_name_length. But that value could change, resulting in inconsistent index names when migrations are run.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

@mscoutermarsh
Copy link
Contributor Author

I believe the test failures are unrelated/flakes. 😄

@p8
Copy link
Member

p8 commented Mar 23, 2023

I'm wondering if we should always show the name for the generated migration (or maybe only when the default is too long?).
This would allow changing the shortened name and maybe making it less magical?

@andrewculver
Copy link
Contributor

andrewculver commented Mar 23, 2023

One really nice thing about this is even if the user doesn't like the shorter auto-generated index name, they don't have to go search for the syntax to specify the index name in the migration file... it's been added for them! All they have to do is name the index whatever they want. 👌 It's a huge improvement on what happens today (searching the error on Google, reading StackOverflow, figuring out the appropriate option for specifying an index name, coming up with an index name.)

@mscoutermarsh
Copy link
Contributor Author

Love that idea. I could look into updating the generator to add name to the add_index calls. Not sure if that makes sense to include in this PR, or separately?

@byroot
Copy link
Member

byroot commented Mar 24, 2023

👍 on principle.

@zzak
Copy link
Member

zzak commented Mar 24, 2023

I think the limit (on postgres at least) is 63 bytes, not characters. 🤔

Isn't this also a global limit for all columns? If so, shouldn't we try to enforce that everywhere?

@p8
Copy link
Member

p8 commented Mar 24, 2023

Should this check for existence of index names?
The following would be shortened to the same names:

"index_on_attribute1_with_a_very_long_name_attribute2_with_a_very_long_name_some_attribute"[..62]
=> "index_on_attribute1_with_a_very_long_name_attribute2_with_a_ver"
"index_on_attribute1_with_a_very_long_name_attribute2_with_a_very_long_name_another_attribute"[..62]
=> "index_on_attribute1_with_a_very_long_name_attribute2_with_a_ver"

Or maybe the attribute names should be shortened?
"index_on_attribute1_attribute2_another_attribute"
On the other hand this might be something for the developer to decide.

@mscoutermarsh
Copy link
Contributor Author

@zzak I just did some reading on the postgres limit and pushed a commit to swap the check to bytes, thank you! It's now 62 bytes which should safely satisfy all 3 databases.

@mscoutermarsh
Copy link
Contributor Author

Should this check for existence of index names?

@p8 Thinking about this scenario. One potential downside of checking for index names would be: It could make the index name generator non-deterministic? If for example, a developer manually added an index to their database that isn't captured in the migration files.

Might be safer to let the migration fail in that case and have them manually update it.

@byroot
Copy link
Member

byroot commented Mar 24, 2023

Yes agreed. The index name shouldn't be derived from that database state, otherwise running migration in a different order may cause different index names.

@Fryguy
Copy link
Contributor

Fryguy commented Mar 25, 2023

Reminds me of this old discussion from 2011! 😮

#1993

@rafaelfranca
Copy link
Member

I think adding new heuristics is ok, but maybe we should also take the opportunity to improve what the error message is? That way there is no need to search on google, just tell users what they need to do. This would help people to for example know what to do if the new heuristics generate an index that clashes with another one.

@matthewd
Copy link
Member

Unlike in 2011 😅, we have migration versioning, so this is safe to do without changing behaviour of existing migrations. (Please use migration versioning to maintain existing behaviour for old migrations.)

If we're going to automatically hard-truncate (and immediately drop the table name, at huge risk of moving from "name too long" to "name not unique"), what if we start with the nice version, but then jump to e.g. "ix_#{truncated}_#{hash}", a la FKs? https://github.com/rails/rails/blob/15c53ac37188e0ef45e0e4bf981b82ad32ae3569/activerecord/lib/active_record/connection_adapters/abstract/schema_statements.rb#L1676-L1678

We've previously chosen not to apply the same "meaningless mess" naming convention to indexes because it's useful to know what they're doing in EXPLAIN / hints / etc., but if it's acceptable to chop some detail in general, trading a couple of characters for automatic uniqueness seems worthwhile?

@mscoutermarsh
Copy link
Contributor Author

@matthewd Nice, I'm playing around with that idea a bit now.

This looks pretty good: index_on_foo_bar_first_name_last_name_very_long_col_f807dfa58f. The "mess" does seem worth the benefit of uniqueness. I'll push up code soon.

@mscoutermarsh
Copy link
Contributor Author

Just pushed a commit to handle the uniqueness case 😄 . Thanks @matthewd

@matthewd
Copy link
Member

So I was thinking that we'd immediately employ the digest as soon as we remove the table name. On Postgres, index names are database-global, so indexes without a table qualification are likely to clash on overlapping column names even without string truncation getting involved.

@mscoutermarsh
Copy link
Contributor Author

Updated to include the hash as soon as the table name is removed. Will now be unique database wide.

@byroot
Copy link
Member

byroot commented Mar 26, 2023

Please use migration versioning to maintain existing behaviour for old migrations.

@mscoutermarsh can you have a look at migration versioning? https://github.com/rails/rails/blob/89bd41201a060e8e6ee39591430d136e9e012c34/activerecord/lib/active_record/migration/compatibility.rb

This new behavior should only apply to migrations generated in 7.1.

@mscoutermarsh mscoutermarsh force-pushed the auto-gen-index-name branch 2 times, most recently from b1d9ce9 to 1e4139c Compare March 26, 2023 21:31
@mscoutermarsh
Copy link
Contributor Author

Thanks @byroot. Pushed an update, I think I have the migration versioning right. Tried to copy the patterns already there.

Copy link
Member

@byroot byroot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The versioning looks good to me.

As for the formats, IMO we should try to original format, and if it's too long directly digest it. Adding more intermediate formats would just be more hard to predict the behavior.

Also once you think you are done, please squash your commits.

Comment on lines 1539 to 1541
# Remove _and_ between columns
name = "index_#{table_name}_on_#{Array(column) * '_'}"
return name if name.bytesize <= MAX_GENERATED_INDEX_NAME_BYTES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this "semi-fallback" is really worth it. The behavior may be more easy to understand by users if we limit ourselves to just two formats.

# Remove table_name, add hash for uniqueness
hashed_identifier = OpenSSL::Digest::SHA256.hexdigest(name).first(10)

name = "index_on_#{Array(column) * '_'}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "index_on_#{Array(column) * '_'}"
name = ""index_#{table_name}_on_#{Array(column) * '_'}"

If we're gonna truncate anyway, I'd say the table name is the part that is the more relevant to try to include?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO in most interesting contexts where you see indexes mentioned, they're inherently table-scoped: if you're looking at which index is being used to scan the users table (or writing a hint to force it), the part that distinguishes this index from its peers is which columns it covers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then I'd rather never include the table name?

I dunno, I don't like the idea of multiplying the patterns that can be use. I'm OK with having one nice human readable one, and one fallback one, but having two "human readable" ones seems wrong to me.

Not a strong opinion though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my opinion is that we have exactly two patterns: the "there's plenty of room" historical default that includes table + columns, and the "rest of the time" fallback that only lists columns plus a digest (protecting against inter-table column collisions, plus the edge case of collisions caused by truncation).

(I'm also still tempted to give it a different shorter prefix like ix_, both to emphasize that this is a Mode Two name, and to conserve name chars given they're apparently not plentiful... also not strong on that though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I pushed an update based on this convo. I like the idea of keeping it simple with only 2 versions.

Here's how it works now:
If the original version is too long, we immediately fallback to the "short version"

short format example:

ix_on_foo_bar_first_name_last_name_administrator_5939248142

@@ -1510,6 +1530,25 @@ def valid_primary_key_options # :nodoc:
end

private
MAX_GENERATED_INDEX_NAME_BYTES = 62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not for this PR, but is it possible to ask the adapter what the max length is? Different DBs have different restrictions, so it makes sense that each adapter declare the max length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into that. I believe the risk here is that if that value does change for the DB, then running past migrations could result in different names.

That does feel pretty unlikely to happen. But mitigating that possibility is the main benefit here of hard coding it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we definitely don't want it to be per-database dynamic, but it does seem reasonable to make it static per DB adapter class. While I don't particularly care to claw back the extra one or two characters (per your analysis of consistent max lengths across the platforms we support), it does maybe make sense to accomodate other platforms that might have notably different (especially shorter) lengths.

Just to write the words: I don't think it's worth considering situations where the max length invalidates our assumptions around fixed-length parts of our generated name. But apparently Oracle's max length was 30 not terribly long ago, and now it and MSSQL are both at 128 -- which is high enough to legitimately raise an eyebrow if we start squishing to fit an imagined lower limit.

Making it easier for a database adapter to vary the maximum, either dynamically or merely across versions, does create more possibility for problems that a totally static value would avoid... but realistically I think there are plenty of other ways a custom adapter can create misadventure, so it's probably fine?

hashed_identifier = OpenSSL::Digest::SHA256.hexdigest(name).first(10)

name = "index_on_#{Array(column) * '_'}"
short_name = name.mb_chars.limit(MAX_GENERATED_INDEX_NAME_BYTES - 11).to_s.chomp("_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to move this 11 into a constant with a comment (or perhaps also the 10 above)? A future reader may not understand where this magic number comes from or that it's related to the hash length plus underscore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, thanks!

@@ -54,6 +54,10 @@ def add_column(table_name, column_name, type, **options)
super
end

def add_index(table_name, column_name, **options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't re-read enough context to check, but can this be def generate_index_name(table_name, column) with the body from legacy_index_name?

"Hidden" options like this are an available fallback when all other options fail, but they should be a last resort: they push the legacy behaviour into the modern implementation instead of isolating it in the compatibility handling -- plus it introduces a singular "legacy" state, losing the version-specific scoping we start with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be much nicer! I originally tried that approach. It might be possible but I was struggling to find a way to make it work.

The challenge was, it looks to me like the migrator will call the overridden method if available (such as add_index). If not, it falls back to method_missing.

When I tried overriding generate_index_name from Compatibility, it was never being called. I believe because the migrator is never directly calling it.

Possible idea:
Maybe I could do something like...

# in compatibility
def add_index(...)
  options[:name] = generate_index_name(...) if options[:name].nil?
  super
end

Then it would be setting the name explicitly using the "legacy" method as if that is what the user passed in.

I can play around with this more later, just sharing the context and wondering your opinion. 😄 Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that makes sense.

Yeah, that :name approach sounds viable 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it working! No more hidden option.

@mscoutermarsh mscoutermarsh force-pushed the auto-gen-index-name branch 2 times, most recently from e5f7cea to 7847f15 Compare March 27, 2023 15:29
@byroot
Copy link
Member

byroot commented Mar 28, 2023

Looking good. I think the last outstanding thing is whether to move that constant into adapters.

I think we can just define

def max_index_name_size
  62
end

And call that.

@mscoutermarsh
Copy link
Contributor Author

Thanks @byroot! Just pushed an update with that change.

I'm guessing when people implement a custom adapter, they could now override the value if they wanted?

@byroot
Copy link
Member

byroot commented Mar 28, 2023

I'm guessing when people implement a custom adapter, they could now override the value if they wanted?

Exactly.

Now everything looks good to me. I'll wait a bit to see if Matthew has any other concerns, but if not I'll merge soon.

Thanks for the feature.

@byroot
Copy link
Member

byroot commented Mar 28, 2023

@mscoutermarsh can you squash to avoid that Merge commit please? We try to keep a clean history.

This updates the index name generation to always create a valid index name if one is not passed by the user.

Set the limit to 62 bytes to ensure it works for the default configurations of Sqlite, mysql & postgres.

MySQL: 64
Postgres: 63
Sqlite: 62

When over the limit, we fallback to a "short format" that includes a hash to guarantee uniqueness in the generated index name.
@mscoutermarsh
Copy link
Contributor Author

@byroot squashed ✅

@byroot
Copy link
Member

byroot commented Mar 29, 2023

I just chatted with Matthew, he's 👍

@byroot byroot merged commit c792c71 into rails:main Mar 29, 2023
byroot added a commit that referenced this pull request Mar 29, 2023
@byroot
Copy link
Member

byroot commented Mar 29, 2023

NB: I changed the ix_ prefix for idx_: 59f2b06

@mscoutermarsh
Copy link
Contributor Author

Thanks everyone for your help on this! ❤️

@cheshire137
Copy link
Contributor

Thanks Mike! This is going to be very helpful.

@matthewd
Copy link
Member

matthewd commented Apr 3, 2023

I think we need compatibility handling on remove_index too 🤔

@mscoutermarsh
Copy link
Contributor Author

@matthewd I can look into it! 😄

@mscoutermarsh
Copy link
Contributor Author

@matthewd I looked into this and found it already works without changes.

Check this out:

if column_names.present? && !(options.key?(:name) && expression_column_name?(column_names))
checks << lambda { |i| index_name(table_name, i.columns) == index_name(table_name, column_names) }
end

Took me quite a while to understand what is happening here.

The remove_index code does not use directly use index name generation to determine which index name to remove. It instead checks if based on the table/column name, the generated index name is the same. If so, then it uses the name of the index set on those columns already.

Found a test that covers this behavior as well:

def test_remove_named_index
connection.add_index :testings, :foo, name: "index_testings_on_custom_index_name"
assert connection.index_exists?(:testings, :foo)
assert_raise(ArgumentError) { connection.remove_index(:testings, "custom_index_name") }
connection.remove_index :testings, :foo
assert_not connection.index_exists?(:testings, :foo)
end

aharpole added a commit to aharpole/rails that referenced this pull request Apr 4, 2023
Previously, if you were using a pre-7.1 migration and you were adding an index using `create_table`, the index would be named using the new index naming functionality introduced in  rails#47753.
davinlagerroos pushed a commit to umn-asr/oracle-enhanced that referenced this pull request Feb 21, 2024
ActiveRecord added max_index_name_size in rails/rails#47753
and picked a default of 62. Oracle 12.2 and greater supports 128 and
the difference broke spec/active_record/connection_adapters/oracle_enhanced/schema_statements_spec.rb:330
since our test index has a bytesize of 64, too long for active record's
defaults which meant the call to super in index_name was returning
a shortened version. The fix is to define max_index_name_size in
oracle enhanced with the 128 limit. Althought active record defines it
in SchemaStatements, I put it in OracleEnhancedAdapter were we define
other methods based on the max_identifier_length.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants