New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use single query to retrieve indexes in PostgreSQL #45381
base: main
Are you sure you want to change the base?
Conversation
pg_catalog.obj_description(i.oid, 'pg_class') AS comment, d.indisvalid, | ||
ARRAY( | ||
SELECT pg_get_indexdef(d.indexrelid, k + 1, true) | ||
FROM generate_subscripts(d.indkey, 1) AS k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 30% sure this needs an ORDER BY k
to be Technically Correct 🤔
This also feels like a behaviour change for non-column-reference expressions. Is that true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ORDER BY k
is not needed. indkey
returns a vector (in postgres terminology) with column numbers from the table. So for table users(id, name, email, created_at)
and index on (created_at, email)
it returns [4, 3]
.
This also feels like a behaviour change for non-column-reference expressions.
Can you provide an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern was for an index that mixed columns and non-column expressions, but per Discord conversation, I now see that the changed codepath (around line 125 below) is only taken for indexes that consist exclusively of table columns: as soon as any expression is present, we hit if indkey.include?(0)
, and things diverge.
(It separately seems less-than-ideal that we skip over the "richer" behaviour for those index columns that are non-expression references to table columns, but that's not relevant to this change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also happy to assume that generate_subscripts
produces rows in a defined order within the immediate subquery, even without an explicit ORDER BY
, just on the basis that set returning functions are weird.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ORDER BY k
to be sure. I see that was used in other examples on the internets, but I did not get it why we need to use it, because seems like generate_subscripts
result should already be sorted.
(It separately seems less-than-ideal that we skip over the "richer" behaviour for those index columns that are non-expression references to table columns, but that's not relevant to this change.)
Probably I should got to sleep, but I did not get what is the problem here? 🤔 I would appreciate if you can provide a concrete example where the new approach won't work.
Is getting all indexes the most common usage? Can't we make it into 1 query and iterate over the result safely (in batches)? |
I didn't get it. We do not get all indexes in a single query, we get all indexes during the execution of the program (like in |
Is for example |
You can disable specific tables for specific checks, but usually all tables in the db are checked. |
@yahonda Can you, please, take a look at this PR? |
I'd like https://github.com/rails/rails/pull/45381/files#r899190812 discussion to be resolved between @fatkodima and @matthewd . |
0f961b2
to
758ad99
Compare
Previously, to get all indexes for the database in PostgreSQL, it was needed
# of tables + # of indexes
sql queries. Now, it is only# of tables
sql queries.For example, for gitlab's source base, which has 500 tables and 2000 indexes, it was needed 2500 queries, now 500.
This PR greatly reduces the time needed to generate a
schema.rb
file, for example. Or in other tools, where it is needed to retrieve all the indexes from the database. For example, I was able to speedupactive_record_doctor
by3x
(from 126 seconds to 39 seconds) using schema caching - gregnavis/active_record_doctor#101. And with this PR, it reduces execution time by another 8 seconds.