Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize data sources sql for MySQL #45504

Merged

Conversation

fatkodima
Copy link
Member

Fixes #45503.

This uses queries like SELECT table_name FROM information_schema.tables WHERE table_schema = database() AND table_name = "schema_migrations" AND table_name IN (SELECT table_name FROM information_schema.tables WHERE table_schema = database()) (note the IN).

With these changes for 1000 runs I got:

  1. For the original query: 72s
  2. For the query before changes in Use a subquery when filtering information_schema.tables by table_name. #39712: 1.53s
  3. For this PR: 6.8s

@blowfishpro Can you verify these changes improved the case you provided?

@eileencodes eileencodes merged commit 31b1403 into rails:main Jul 1, 2022
@blowfishpro
Copy link

This does appear to help a lot. I think it's reasonable compromise between the various MySQL bugs we're fighting against.

@mfoo
Copy link

mfoo commented Sep 12, 2022

Hello! Just to say on an upgrade from 5.2 to 6 this issue has bitten us quite badly. Our application for better or for worse stores batch experiment results in MySQL and each batch has a different schema, which we use a different table before. We're halfway through a MySQL 5.6->5.7 migration and some databases have hundreds of thousands of tables in them.

Deploying this fix as a monkey patch resolved the issue, thanks!

Note that only our 5.6 databases were affected. The 5.7 hosts which had not switched off derived_merge were fine.

@mfoo
Copy link

mfoo commented Nov 16, 2022

Hi - revisiting this since it hit us again just after upgrading one of the last 5.6 servers to 5.7. The application immediately started experiencing latency issues and the query produced by the patch was taking in the region of 4 seconds.

Taking a closer look at the patch, the new query being produced looked like this:

SELECT table_name FROM information_schema.tables
WHERE table_schema = database()
AND table_name = 'my_table'
AND table_name IN (
  SELECT table_name FROM information_schema.tables
  WHERE table_schema = database()
)
AND table_type = 'BASE TABLE'

In this case the inner select was causing the original performance issue and removing it fixed it. This seems safe since the AND table_name IN (...) section of the query seems entirely superfluous. Would the query output not be identical simply by removing sql << " AND table_name IN (SELECT table_name FROM information_schema.tables WHERE table_schema = #{scope[:schema]})"?

Perhaps there's a reason for it to be there that was an intended performance benefit that I'm not seeing.

Any comments would be useful, especially since it's in rails:main now - we may have to keep the patch post upgrading Rails. Thanks!

@blowfishpro
Copy link

@mfoo I think the reason the subquery is needed is because of #39712

If you don't use table level permissions it should be safe to monkey patch out the subquery, but I see why rails needs to maintain it.

@blowfishpro
Copy link

Although maybe this goes back to what I suggested in #45503 that maybe the behavior should be configurable

@eileencodes
Copy link
Member

@mfoo, I am having trouble following your report and am not sure what you're asking for. What version of Rails is there an issue? Are you asking us if we're willing backport this fix to 6-0-stable / 6-1-stable? Or are you asking for something else?

@mfoo
Copy link

mfoo commented Dec 6, 2022

Hi, sorry for the delay.

No backporting is necessary, it's not the Rails project's fault that we're behind on version updates!

The MySQL 5.7 server in question has around one million innodb tables in it. This makes the subquery method extremely slow. We have not disabled deep_merge, as has been mentioned to be an issue that others experienced. We have monkey-patched out the subquery and things are back to normal.

I acknowledge that this is a fairly high number of tables (although innodb permits up to 4 billion) and I don't think Rails/ActiveRecord should necessarily be making changes to support such an unusual use-case, but potentially we can find a solution that works everywhere.

Just for clarity, the sequence of events we experienced was that the monkey-patch here (moving from a join to a subquery) fixed some serious performance regressions in Rails 6 on MySQL 5.6, but it introduced performance regressions for us on 5.7 on instances with more tables. We just didn't initially notice because the instances using 5.7 were smaller. Upon upgrading the remaining 5.6 instances we experienced a performance regression and had to monkey-patch out the inner select.

I think I understand the inconsistency in MySQL < 8 raised in #39712, but what I don't understand is why it's a problem for the Rails project. What's the use case where Rails needs to know about the existence of tables that it can't read data from? With my current understanding it feels more correct to revert back to the simple version without either the join or the subquery (which is very fast):

SELECT table_name FROM information_schema.tables WHERE table_schema = database() AND table_name = 'some_table'

and to intentionally rely on MySQL's behaviour (hiding tables you don't have SELECT permission granted for).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data_source_sql query is unnecessarily slow on MySQL with many tables and derived_merge=off
4 participants