Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MySQL loadTables performance #6820

Closed

Conversation

ronits-cx
Copy link
Contributor

@ronits-cx ronits-cx commented Oct 1, 2020

This change will fix a performance issue in MySQL loadTables, which is especially bad in multi-tenanted environments.

loadTables performs queries to INFORMATION_SCHEMA, which is not optimized.
MySQL recommends adding a where clause with the schema and table for better performance - https://dev.mysql.com/doc/refman/5.7/en/information-schema-optimization.html.

There are 2 queries with a JOIN that have a where clause on the schema and table, but the where clause is only for one of the tables in the join.

For large databases with many schemas, this can be a crucial improvement.
In our database, for example, it improved each query from about 7.5 seconds to around 200 milliseconds.

@imnotjames
Copy link
Contributor

Is it possible to split these two changes apart? The performance vs the FK query check?

It's nicer to have that from a changelog perspective & so if we have to roll things back we can isolate what we're rolling back.

@ronits-cx
Copy link
Contributor Author

ronits-cx commented Oct 2, 2020

@imnotjames Sure, I split only the foreign key fix into another PR

@imnotjames
Copy link
Contributor

imnotjames commented Oct 4, 2020

Is the subquery needed if we're already doing this in the where clause in the main query?

Does the EXPLAIN show this problem? If so, can you post them as a reply - just for documentation purposes if we ever come back to the change via a git blame, get to this PR & wonder "huh - what was this trying to fix exactly?"

Copy link
Contributor

@imnotjames imnotjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the test for this a duplicate of the one for 6168?

@ronits-cx
Copy link
Contributor Author

ronits-cx commented Oct 5, 2020

The subquery is needed for the indices because it's a left join. If it were an inner join, like the foreign keys, then we could use the main where clause, but since it's a left join we also need in the results the rows where REFERENTIAL_CONSTRAINTS columns are null. I also tried adding it in the main where clause with an OR IS NULL, but that doesn't get the performance improvement.

The explain shows the problem. For example for the foreign key query it has 2 rows, one for kcu and one for rc.
For kcu we get: Using where; Open_full_table; Scanned 0 databases
For rc we get: Using where; Open_full_table; **Scanned all databases**; Using join buffer (Block Nested Loop)
After adding the where clause for the table, we get: Using where; Open_full_table; **Scanned 1 database**; Using join buffer (Block Nested Loop)
I'll add screenshots of both explains

@ronits-cx
Copy link
Contributor Author

ronits-cx commented Oct 5, 2020

The test isn't exactly a duplicate, because the 6168 PR only touched the foreign keys, so I only tested the foreign keys there. Here I also added index checks to see they're unaffected by the change.
For example the index checks will fail if we add to the main where clause instead of a subquery, because no indices will be found.

@ronits-cx
Copy link
Contributor Author

Before - scanned all databases:
Before - scanned all databases

After - scanned 1 database:
After - scanned 1 database

@ronits-cx
Copy link
Contributor Author

@imnotjames @pleerock Hi, I noticed you merged this refactor yesterday.
I think adding the condition in the ON will make it impossible to have a performance improvement, even with the extra conditions in the WHERE clause (not to mention they're duplicate of one another).

See explain only with the WHERE clause conditions (rc scanned 0 databases):
Screen Shot 2020-10-06 at 9 36 42

Explain with both WHERE clause and ON conditions (rc scanned all databases):
Screen Shot 2020-10-06 at 9 36 19

However I did just notice that I used KEY_COLUMN_USAGE's columns TABLE_SCHEMA and CONSTRAINT_SCHEMA interchangeably, which is a possible bug. If TABLE_SCHEMA !== CONSTRAINT_SCHEMA then it's wrong to add the rc constraints inside the foreignKeysCondition like I did here. So I'm not sure how it's possible to add this performance improvement...

@imnotjames imnotjames self-requested a review October 9, 2020 09:36
@imnotjames
Copy link
Contributor

imnotjames commented Oct 11, 2020

I'm finding that having an OR in the WHERE clause at all forces a data directory scan. This is even with a simpler query like:

        SELECT
            *
        FROM `INFORMATION_SCHEMA`.`KEY_COLUMN_USAGE` `kcu`
        WHERE
            (`kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'question')
           OR
            (`kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'category')

Causes:

Using where; Open_full_table; Scanned all databases

However,

EXPLAIN SELECT
  *
FROM `INFORMATION_SCHEMA`.`KEY_COLUMN_USAGE` `kcu`
WHERE
    `kcu`.`TABLE_SCHEMA` = 'test' AND `kcu`.`TABLE_NAME` = 'question'

Leaves me with

Using where; Open_full_table; Scanned 0 databases

Hypothesis: It might be faster to either run multiple queries - one per database & schema - or to construct a nightmarish frankenstein-query that unions a bunch of these together with single lookups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants