Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single query to retrieve indexes in PostgreSQL #45381

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fatkodima
Copy link
Member

Previously, to get all indexes for the database in PostgreSQL, it was needed # of tables + # of indexes sql queries. Now, it is only # of tables sql queries.

For example, for gitlab's source base, which has 500 tables and 2000 indexes, it was needed 2500 queries, now 500.

This PR greatly reduces the time needed to generate a schema.rb file, for example. Or in other tools, where it is needed to retrieve all the indexes from the database. For example, I was able to speedup active_record_doctor by 3x (from 126 seconds to 39 seconds) using schema caching - gregnavis/active_record_doctor#101. And with this PR, it reduces execution time by another 8 seconds.

pg_catalog.obj_description(i.oid, 'pg_class') AS comment, d.indisvalid,
ARRAY(
SELECT pg_get_indexdef(d.indexrelid, k + 1, true)
FROM generate_subscripts(d.indkey, 1) AS k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 30% sure this needs an ORDER BY k to be Technically Correct 🤔

This also feels like a behaviour change for non-column-reference expressions. Is that true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ORDER BY k is not needed. indkey returns a vector (in postgres terminology) with column numbers from the table. So for table users(id, name, email, created_at) and index on (created_at, email) it returns [4, 3].

This also feels like a behaviour change for non-column-reference expressions.

Can you provide an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern was for an index that mixed columns and non-column expressions, but per Discord conversation, I now see that the changed codepath (around line 125 below) is only taken for indexes that consist exclusively of table columns: as soon as any expression is present, we hit if indkey.include?(0), and things diverge.

(It separately seems less-than-ideal that we skip over the "richer" behaviour for those index columns that are non-expression references to table columns, but that's not relevant to this change.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm also happy to assume that generate_subscripts produces rows in a defined order within the immediate subquery, even without an explicit ORDER BY, just on the basis that set returning functions are weird.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ORDER BY k to be sure. I see that was used in other examples on the internets, but I did not get it why we need to use it, because seems like generate_subscripts result should already be sorted.

(It separately seems less-than-ideal that we skip over the "richer" behaviour for those index columns that are non-expression references to table columns, but that's not relevant to this change.)

Probably I should got to sleep, but I did not get what is the problem here? 🤔 I would appreciate if you can provide a concrete example where the new approach won't work.

@simi
Copy link
Contributor

simi commented Jun 16, 2022

Is getting all indexes the most common usage? Can't we make it into 1 query and iterate over the result safely (in batches)?

@fatkodima
Copy link
Member Author

fatkodima commented Jun 16, 2022

Is getting all indexes the most common usage? Can't we make it into 1 query and iterate over the result safely (in batches)?

I didn't get it. We do not get all indexes in a single query, we get all indexes during the execution of the program (like in active_record_doctor, mentioned in the PR description). We get all indexes per table at a time.

@simi
Copy link
Contributor

simi commented Jun 16, 2022

Is getting all indexes the most common usage? Can't we make it into 1 query and iterate over the result safely (in batches)?

I didn't get it. We do not get all indexes in a single query, we get all indexes during the execution of the program (like in active_record_doctor, mentioned in the PR description). We get all indexes per table at a time.

Is for example active_record_doctor getting info for all tables?

@fatkodima
Copy link
Member Author

You can disable specific tables for specific checks, but usually all tables in the db are checked.

@fatkodima
Copy link
Member Author

@yahonda Can you, please, take a look at this PR?

@yahonda
Copy link
Member

yahonda commented Aug 23, 2022

I'd like https://github.com/rails/rails/pull/45381/files#r899190812 discussion to be resolved between @fatkodima and @matthewd .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants