New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint information is not being cached #101

Closed
atrigent opened this Issue Nov 8, 2016 · 9 comments

Comments

Projects
None yet
2 participants
@atrigent

atrigent commented Nov 8, 2016

In RedshiftDialect, there are three methods, _get_all_relation_info, _get_all_column_info, and _get_all_constraint_info, which query information about the whole cluster. These methods are called by several other methods which pass their info_cache argument down to one of the above methods. These methods, in turn, are called by other methods which, in most cases, pass down all of their keyword arguments. The one exception is _get_redshift_constraints - this method is never called with the keyword arguments of its caller, and therefore never gets the info_cache object that was passed in by SQLAlchemy. The result of this is that constraint information is never cached. When running something like alembic which does a lot of reflection, this can slow things down considerably because the dialect winds up querying information about the entire database every time alembic asks for anything.

@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

Thanks for catching the lack of caching on _get_redshift_constraints. Sounds like an easy optimization.

When running something like alembic which does a lot of reflection, this can slow things down considerably because the dialect winds up querying information about the entire database every time alembic asks for anything

I believe there's a fundamental constraint in SQLAlchemy's API for reflection that allows caching only within the context of reflecting a single table. I originally wrote the reflection code in this dialect with bulk queries like _get_all_relation_info hoping that we could reconstruct the entire database schema with just a few queries, but I don't believe this is possible.

I do hope that using cache_info for the constraints query will speed things up a bit, but reflection across many tables is likely going to continue to be slow.

Collaborator

jklukas commented Nov 8, 2016

Thanks for catching the lack of caching on _get_redshift_constraints. Sounds like an easy optimization.

When running something like alembic which does a lot of reflection, this can slow things down considerably because the dialect winds up querying information about the entire database every time alembic asks for anything

I believe there's a fundamental constraint in SQLAlchemy's API for reflection that allows caching only within the context of reflecting a single table. I originally wrote the reflection code in this dialect with bulk queries like _get_all_relation_info hoping that we could reconstruct the entire database schema with just a few queries, but I don't believe this is possible.

I do hope that using cache_info for the constraints query will speed things up a bit, but reflection across many tables is likely going to continue to be slow.

jklukas added a commit that referenced this issue Nov 8, 2016

@jklukas jklukas referenced this issue Nov 8, 2016

Merged

Pass in info_cache when querying constraints. #102

4 of 4 tasks complete
@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

PR for this in #102

Collaborator

jklukas commented Nov 8, 2016

PR for this in #102

@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

Also see #103 for more thoughts on persisting schema info across multiple tables.

Collaborator

jklukas commented Nov 8, 2016

Also see #103 for more thoughts on persisting schema info across multiple tables.

@atrigent

This comment has been minimized.

Show comment
Hide comment
@atrigent

atrigent Nov 8, 2016

I think your approach of doing just a few queries actually does work. You're right that the reflection methods that make up the public interface of the dialect generally require a table name, and that caching the result of those methods by itself won't reduce the number of queries that you have to make, but caching the result of your _get_all_* methods, which return information about the whole database, does definitely speed things up considerably. With the change you made in PR #102, a test that we have for making sure that our migrations are up to date goes from 2+ minutes to about 3 seconds.

atrigent commented Nov 8, 2016

I think your approach of doing just a few queries actually does work. You're right that the reflection methods that make up the public interface of the dialect generally require a table name, and that caching the result of those methods by itself won't reduce the number of queries that you have to make, but caching the result of your _get_all_* methods, which return information about the whole database, does definitely speed things up considerably. With the change you made in PR #102, a test that we have for making sure that our migrations are up to date goes from 2+ minutes to about 3 seconds.

@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

a test that we have for making sure that our migrations are up to date goes from 2+ minutes to about 3 seconds

Wow. That is much more improvement than I expected.

Collaborator

jklukas commented Nov 8, 2016

a test that we have for making sure that our migrations are up to date goes from 2+ minutes to about 3 seconds

Wow. That is much more improvement than I expected.

@atrigent

This comment has been minimized.

Show comment
Hide comment
@atrigent

atrigent Nov 8, 2016

Well, previously that query in _get_all_constraint_info was being run over and over and over again. Now it only gets run once.

atrigent commented Nov 8, 2016

Well, previously that query in _get_all_constraint_info was being run over and over and over again. Now it only gets run once.

@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

Thank you again for inspecting the code enough to figure out what was going on. I look forward to seeing whether this significantly improves some of my own use cases.

Collaborator

jklukas commented Nov 8, 2016

Thank you again for inspecting the code enough to figure out what was going on. I look forward to seeing whether this significantly improves some of my own use cases.

@atrigent

This comment has been minimized.

Show comment
Hide comment
@atrigent

atrigent Nov 8, 2016

And thank you for the quick response. :)

It's an especially large gain for us because we often have a bunch of stale schemas in this cluster. Therefore, the _get_all_constraint_info query slows down slightly, and the cumulative effect of running that slightly slower query over and over again can cause the test that I mentioned to take 2 minutes or longer. I guess you probably won't notice as much of a speedup, but probably still some.

atrigent commented Nov 8, 2016

And thank you for the quick response. :)

It's an especially large gain for us because we often have a bunch of stale schemas in this cluster. Therefore, the _get_all_constraint_info query slows down slightly, and the cumulative effect of running that slightly slower query over and over again can cause the test that I mentioned to take 2 minutes or longer. I guess you probably won't notice as much of a speedup, but probably still some.

@jklukas

This comment has been minimized.

Show comment
Hide comment
@jklukas

jklukas Nov 8, 2016

Collaborator

This change is now merged to master.

Collaborator

jklukas commented Nov 8, 2016

This change is now merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment