Optimize data sources sql for MySQL #45504

fatkodima · 2022-07-01T10:38:25Z

This uses queries like SELECT table_name FROM information_schema.tables WHERE table_schema = database() AND table_name = "schema_migrations" AND table_name IN (SELECT table_name FROM information_schema.tables WHERE table_schema = database()) (note the IN).

With these changes for 1000 runs I got:

For the original query: 72s
For the query before changes in Use a subquery when filtering information_schema.tables by table_name. #39712: 1.53s
For this PR: 6.8s

@blowfishpro Can you verify these changes improved the case you provided?

blowfishpro · 2022-07-02T23:14:27Z

This does appear to help a lot. I think it's reasonable compromise between the various MySQL bugs we're fighting against.

mfoo · 2022-09-12T13:20:19Z

Hello! Just to say on an upgrade from 5.2 to 6 this issue has bitten us quite badly. Our application for better or for worse stores batch experiment results in MySQL and each batch has a different schema, which we use a different table before. We're halfway through a MySQL 5.6->5.7 migration and some databases have hundreds of thousands of tables in them.

Deploying this fix as a monkey patch resolved the issue, thanks!

Note that only our 5.6 databases were affected. The 5.7 hosts which had not switched off derived_merge were fine.

mfoo · 2022-11-16T10:04:43Z

Hi - revisiting this since it hit us again just after upgrading one of the last 5.6 servers to 5.7. The application immediately started experiencing latency issues and the query produced by the patch was taking in the region of 4 seconds.

Taking a closer look at the patch, the new query being produced looked like this:

SELECT table_name FROM information_schema.tables
WHERE table_schema = database()
AND table_name = 'my_table'
AND table_name IN (
  SELECT table_name FROM information_schema.tables
  WHERE table_schema = database()
)
AND table_type = 'BASE TABLE'

In this case the inner select was causing the original performance issue and removing it fixed it. This seems safe since the AND table_name IN (...) section of the query seems entirely superfluous. Would the query output not be identical simply by removing sql << " AND table_name IN (SELECT table_name FROM information_schema.tables WHERE table_schema = #{scope[:schema]})"?

Perhaps there's a reason for it to be there that was an intended performance benefit that I'm not seeing.

Any comments would be useful, especially since it's in rails:main now - we may have to keep the patch post upgrading Rails. Thanks!

blowfishpro · 2022-11-17T16:22:33Z

@mfoo I think the reason the subquery is needed is because of #39712

If you don't use table level permissions it should be safe to monkey patch out the subquery, but I see why rails needs to maintain it.

blowfishpro · 2022-11-17T18:31:05Z

Although maybe this goes back to what I suggested in #45503 that maybe the behavior should be configurable

eileencodes · 2022-11-18T14:00:27Z

@mfoo, I am having trouble following your report and am not sure what you're asking for. What version of Rails is there an issue? Are you asking us if we're willing backport this fix to 6-0-stable / 6-1-stable? Or are you asking for something else?

mfoo · 2022-12-06T17:16:56Z

Hi, sorry for the delay.

No backporting is necessary, it's not the Rails project's fault that we're behind on version updates!

The MySQL 5.7 server in question has around one million innodb tables in it. This makes the subquery method extremely slow. We have not disabled deep_merge, as has been mentioned to be an issue that others experienced. We have monkey-patched out the subquery and things are back to normal.

I acknowledge that this is a fairly high number of tables (although innodb permits up to 4 billion) and I don't think Rails/ActiveRecord should necessarily be making changes to support such an unusual use-case, but potentially we can find a solution that works everywhere.

Just for clarity, the sequence of events we experienced was that the monkey-patch here (moving from a join to a subquery) fixed some serious performance regressions in Rails 6 on MySQL 5.6, but it introduced performance regressions for us on 5.7 on instances with more tables. We just didn't initially notice because the instances using 5.7 were smaller. Upon upgrading the remaining 5.6 instances we experienced a performance regression and had to monkey-patch out the inner select.

I think I understand the inconsistency in MySQL < 8 raised in #39712, but what I don't understand is why it's a problem for the Rails project. What's the use case where Rails needs to know about the existence of tables that it can't read data from? With my current understanding it feels more correct to revert back to the simple version without either the join or the subquery (which is very fast):

SELECT table_name FROM information_schema.tables WHERE table_schema = database() AND table_name = 'some_table'

and to intentionally rely on MySQL's behaviour (hiding tables you don't have SELECT permission granted for).

Optimize data sources sql for MySQL

20a58fd

rails-bot bot added the activerecord label Jul 1, 2022

eileencodes merged commit 31b1403 into rails:main Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize data sources sql for MySQL #45504

Optimize data sources sql for MySQL #45504

fatkodima commented Jul 1, 2022

blowfishpro commented Jul 2, 2022

mfoo commented Sep 12, 2022

mfoo commented Nov 16, 2022

blowfishpro commented Nov 17, 2022

blowfishpro commented Nov 17, 2022

eileencodes commented Nov 18, 2022

mfoo commented Dec 6, 2022

Optimize data sources sql for MySQL #45504

Optimize data sources sql for MySQL #45504

Conversation

fatkodima commented Jul 1, 2022

blowfishpro commented Jul 2, 2022

mfoo commented Sep 12, 2022

mfoo commented Nov 16, 2022

blowfishpro commented Nov 17, 2022

blowfishpro commented Nov 17, 2022

eileencodes commented Nov 18, 2022

mfoo commented Dec 6, 2022