Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propose include_name hooks for autogenerate, allow schemas/tables to be included by name, API + CLI #650

Closed
eeshugerman opened this issue Jan 31, 2020 · 10 comments
Assignees
Labels
autogenerate - detection high priority use case not quite a feature and not quite a bug, something we just didn't think of

Comments

@eeshugerman
Copy link

I use Alembic to manage a Redshift database with many columns (~13,000, across ~400 tables). alembic revision --autogenerate is (understandably), quite slow -- it usually takes about five minutes to complete.

I imagine most of this time is spent waiting for responses from the database as Alembic does its comparison. Would it be possible to run some of these queries concurrently? Perhaps concurrent.futures.ThreadPoolExecutor would be appropriate here.

@eeshugerman
Copy link
Author

Or would this need to be implemented in SQLAlchemy?

@zzzeek
Copy link
Member

zzzeek commented Feb 1, 2020

hi there -

it depends on if you actually have 400 Table metadata objects in your source code with all 13000 columns physically defined or not. This would already be unusual, and the discussion at #580 covers pretty much the same issue, with the conclusion that a MetaData model usually does not represent such a large schema at once and the real issue is being able to skip all the ignored tables without having to load them first.

If you in fact intend to skip the vast majority of these tables or columns, #580 proposes new "include_name" hooks that would take place before an "include_object" hook does, and would be oriented towards not actually loading a Table object if it isn't needed. the issue was closed because the user didn't need it but the option to implement "include_name" to skip the majority of tables that aren't part of a model is still the best first choice option.

As far as reflection slowness overall, on the SQLAlchemy side there is sqlalchemy/sqlalchemy#4379, however this is specific to the metadata.reflect() method and would involve making the reflection queries themselves work across multiple tables at once. If this were to be implemented, Alembic would still require changes since it currently reflects individual Table objects one at a time.

Multiprocessing is an option as well but I don't think that's a first resort, because we mostly need better ways to just avoid reflecting thousands of tables in the first place since metadata models don't actually look like this, and for those tables we do want to reflect, implementing 4379 above and then reworking Alembic to include use of this new API would be best (if multiprocessing were supported on the SQLAlchemy side, it would be the same API in any case, that is, Alembic would feed a list of table names at once to a consumer). The reflection operation is going to be primarily CPU bound on most backends, and while multiprocessing can help with CPU-bound issues in general, this data still needs to be aggregated into a migration structure in a single process, which will still be a performance bottleneck with pickling and such added and also this merge process would be a lot of new complexity. It additionally would not work at all with the current execution model for Alembic which is that all database-related operations proceed within a single transaction that is established for a single connection inside of env.py so this would require major conceptual / API changes at all levels and doesn't really seem worth it.

All of that is then complicated by a specific element of your issue which is that within these 400 tables there are 13000 columns, and depending on how those columns are distributed through the tables, there's no clear solution to that, as there's no way in SQLAlchemy to skip columns when reflecting a table. Not really sure what to do with that, if that's the source of the performance issue.

I think all of that said, at the end of the day assuming your metadata model is actually all 400 tables + 13000 columns, an autogenerate for you is doing a huge diff and we'd ideally want to give you some way to just tell --autogenerate a list of tables you care about on the command line, because if you worked on the model in some specific area of the application, it shouldn't really have to scan 400 tables to search for the handful of columns you know you just modified.

@zzzeek zzzeek added the question usage and API questions label Feb 1, 2020
@zzzeek
Copy link
Member

zzzeek commented Feb 1, 2020

TL;DR; I think "include_name" hooks are the way to go here with additional options on the --autogenerate commandline in order to affect include_name dynamically when you know which tables/columns you are trying to autogenerate for.

@eeshugerman
Copy link
Author

eeshugerman commented Feb 1, 2020

Hi, thanks for a quick and thorough response!

We do indeed have all of these tables and columns defined in our metadata / source code.

Batching queries (sqlalchemy#4379) does sound promising, and makes sense that it would require changes on the Alembic side to take advantage of it (and that the same changes would be required to take advantage of multiprocessing on the SQLAlchemy side).

The reflection operation is going to be primarily CPU bound on most backends

This is a surprise to me; I assumed it is IO bound, thus my threading suggestion. (EDIT: Come to think of it, this probably depends on whether the database is on the same machine as Alembic, or is remote. Mine is remote (Redshift) so there's a fair bit of overhead per response.)

It additionally would not work at all with the current execution model for Alembic which is that all database-related operations proceed within a single transaction that is established for a single connection inside of env.py so this would require major conceptual / API changes at all levels and doesn't really seem worth it.

Ah, yes, in that case... agreed.

I think "include_name" hooks are the way to go here with additional options on the --autogenerate commandline in order to affect include_name dynamically when you know which tables/columns you are trying to autogenerate for.

True, that would work too! Sounds like both this and batching of queries would be worthwhile. The latter would be a bit trickier to implement, less effective for this particular issue (slow autogenerate), but perhaps beneficial elsewhere too (not sure what else reflection is used for)? I may take a stab at the former :).

Feel free to close this, or whatever you see fit.

@zzzeek zzzeek changed the title Autogenerate is slow for large databases propose include_name hooks for autogenerate, allow schemas/tables to be included by name, API + CLI Feb 1, 2020
@zzzeek
Copy link
Member

zzzeek commented Feb 1, 2020

The basic idea of include_name would include that someone can write callable functions that do "include_name" in a similar way as "include_object", but they run before the Table object is reflected, and similarly one can specify include_name on the commandline, and I think the best way is as a...regex? expression that matches "schema.tablename", or "tablename" for tables that are referred to without being schema qualified:

alembic revision -m "foo" --autogenerate --include_name="myschema.*"
alembic revision -m "foo" --autogenerate --include_name='some_table|some_other_table'

the include_name API hook would also take effect even if the include_name CLI option were used, the CLI option runs first then the include_name hook in the env.py.

@zzzeek zzzeek added use case not quite a feature and not quite a bug, something we just didn't think of and removed question usage and API questions labels Feb 1, 2020
@zzzeek zzzeek added high priority motivated volunteers requested a feature that has noone to implement; can reopen a 'wontfix' labels Sep 3, 2020
@odesenfans
Copy link

Thanks for your answer on #733. Question on this: is there really a need to have two separate filters, one before and one after table reflection? Why not change the spec of include_objects so that it runs before the reflection step?
Or is it better in your opinion to have an inclusive filter with include_name early on and an exclusive filter with include_object?

@zzzeek
Copy link
Member

zzzeek commented Sep 3, 2020

Thanks for your answer on #733. Question on this: is there really a need to have two separate filters, one before and one after table reflection? Why not change the spec of include_objects so that it runs before the reflection step?

SQLAlchemy doesn't have a "Schema" object, so it's not clear what would be passed to this function when we wish to exclude an entire schema. The contract of this function is that the whole reflected object is passed in. if we hadn't reflected the Table yet, there would be nothing to call this function with....except the table's name (and schema), which would be the new API include_name :)

Or is it better in your opinion to have an inclusive filter with include_name early on and an exclusive filter with include_object?

these filters cost nothing to have in the internals, so having just "include_name" and "include_object", as opposed to individual arguments for "schemas", "tablenames" etc. is still a more succinct, generalized approach to the problem. "include_name" would be he name of the thing we're about to work with, before we've done any DB work to get it, and "include_object" is the afterwards hook. seems pretty reasonable.

@RamonWill
Copy link

Hi @zzzeek ,

Happy to look into this issue.

From my current understanding reading the comments (on #580 #733 and this issue), users are using large databases and when autogenerate is ran without any "include_" hooks it goes through all of the tables in the database taking a considerable amount of time.

So from this understanding you would like a hook called include_name. Which will be a list of object names that the user would like to be included in the autogenerate (All other object names will be ignored). Then after the table name is captured the existing include_objects hook is used to capture the objects type thus only objects of name "x" and type "x" will be used for autogeneration, If this understanding is correct how would include_name differ from the deprecated include_symbols?

Kind Regards,
Ramon

@zzzeek
Copy link
Member

zzzeek commented Sep 22, 2020

hi @RamonWill -

include_name would be a function that is passed a single name at at time, like a filter, which is given the names of schemas, tables, and indexes before the work is done to reflect the objects for those names.

It does in fact look a lot like include_symbol, but with these differences:

  1. critically it runs before the thing it names is reflected and should always prevent additional work from taking place for the given object when filtered.

  2. it is passed the same "type" object that's passed to include_object so that you know if the thing you're looking at is a table, index, schema, or other thing.

  3. it probably should be passed a "schema" argument so that's present for all tables/ indexes etc.

  4. it should probably be passed "reflected", the boolean that indicates if we're about to work on this name from a reflection standpoint or a model standpoint.

@zzzeek zzzeek removed the motivated volunteers requested a feature that has noone to implement; can reopen a 'wontfix' label Jan 18, 2021
@zzzeek zzzeek self-assigned this Jan 18, 2021
@sqla-tester
Copy link
Collaborator

Mike Bayer has proposed a fix for this issue in the master branch:

implement include_name hook https://gerrit.sqlalchemy.org/c/sqlalchemy/alembic/+/2485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autogenerate - detection high priority use case not quite a feature and not quite a bug, something we just didn't think of
Projects
None yet
Development

No branches or pull requests

5 participants