-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
propose include_name hooks for autogenerate, allow schemas/tables to be included by name, API + CLI #650
Comments
Or would this need to be implemented in SQLAlchemy? |
hi there - it depends on if you actually have 400 Table metadata objects in your source code with all 13000 columns physically defined or not. This would already be unusual, and the discussion at #580 covers pretty much the same issue, with the conclusion that a MetaData model usually does not represent such a large schema at once and the real issue is being able to skip all the ignored tables without having to load them first. If you in fact intend to skip the vast majority of these tables or columns, #580 proposes new "include_name" hooks that would take place before an "include_object" hook does, and would be oriented towards not actually loading a Table object if it isn't needed. the issue was closed because the user didn't need it but the option to implement "include_name" to skip the majority of tables that aren't part of a model is still the best first choice option. As far as reflection slowness overall, on the SQLAlchemy side there is sqlalchemy/sqlalchemy#4379, however this is specific to the metadata.reflect() method and would involve making the reflection queries themselves work across multiple tables at once. If this were to be implemented, Alembic would still require changes since it currently reflects individual Table objects one at a time. Multiprocessing is an option as well but I don't think that's a first resort, because we mostly need better ways to just avoid reflecting thousands of tables in the first place since metadata models don't actually look like this, and for those tables we do want to reflect, implementing 4379 above and then reworking Alembic to include use of this new API would be best (if multiprocessing were supported on the SQLAlchemy side, it would be the same API in any case, that is, Alembic would feed a list of table names at once to a consumer). The reflection operation is going to be primarily CPU bound on most backends, and while multiprocessing can help with CPU-bound issues in general, this data still needs to be aggregated into a migration structure in a single process, which will still be a performance bottleneck with pickling and such added and also this merge process would be a lot of new complexity. It additionally would not work at all with the current execution model for Alembic which is that all database-related operations proceed within a single transaction that is established for a single connection inside of env.py so this would require major conceptual / API changes at all levels and doesn't really seem worth it. All of that is then complicated by a specific element of your issue which is that within these 400 tables there are 13000 columns, and depending on how those columns are distributed through the tables, there's no clear solution to that, as there's no way in SQLAlchemy to skip columns when reflecting a table. Not really sure what to do with that, if that's the source of the performance issue. I think all of that said, at the end of the day assuming your metadata model is actually all 400 tables + 13000 columns, an autogenerate for you is doing a huge diff and we'd ideally want to give you some way to just tell --autogenerate a list of tables you care about on the command line, because if you worked on the model in some specific area of the application, it shouldn't really have to scan 400 tables to search for the handful of columns you know you just modified. |
TL;DR; I think "include_name" hooks are the way to go here with additional options on the --autogenerate commandline in order to affect include_name dynamically when you know which tables/columns you are trying to autogenerate for. |
Hi, thanks for a quick and thorough response! We do indeed have all of these tables and columns defined in our metadata / source code. Batching queries (sqlalchemy#4379) does sound promising, and makes sense that it would require changes on the Alembic side to take advantage of it (and that the same changes would be required to take advantage of multiprocessing on the SQLAlchemy side).
This is a surprise to me; I assumed it is IO bound, thus my threading suggestion. (EDIT: Come to think of it, this probably depends on whether the database is on the same machine as Alembic, or is remote. Mine is remote (Redshift) so there's a fair bit of overhead per response.)
Ah, yes, in that case... agreed.
True, that would work too! Sounds like both this and batching of queries would be worthwhile. The latter would be a bit trickier to implement, less effective for this particular issue (slow autogenerate), but perhaps beneficial elsewhere too (not sure what else reflection is used for)? I may take a stab at the former :). Feel free to close this, or whatever you see fit. |
The basic idea of include_name would include that someone can write callable functions that do "include_name" in a similar way as "include_object", but they run before the Table object is reflected, and similarly one can specify include_name on the commandline, and I think the best way is as a...regex? expression that matches "schema.tablename", or "tablename" for tables that are referred to without being schema qualified:
the include_name API hook would also take effect even if the include_name CLI option were used, the CLI option runs first then the include_name hook in the env.py. |
Thanks for your answer on #733. Question on this: is there really a need to have two separate filters, one before and one after table reflection? Why not change the spec of |
SQLAlchemy doesn't have a "Schema" object, so it's not clear what would be passed to this function when we wish to exclude an entire schema. The contract of this function is that the whole reflected object is passed in. if we hadn't reflected the Table yet, there would be nothing to call this function with....except the table's name (and schema), which would be the new API include_name :)
these filters cost nothing to have in the internals, so having just "include_name" and "include_object", as opposed to individual arguments for "schemas", "tablenames" etc. is still a more succinct, generalized approach to the problem. "include_name" would be he name of the thing we're about to work with, before we've done any DB work to get it, and "include_object" is the afterwards hook. seems pretty reasonable. |
Hi @zzzeek , Happy to look into this issue. From my current understanding reading the comments (on #580 #733 and this issue), users are using large databases and when autogenerate is ran without any "include_" hooks it goes through all of the tables in the database taking a considerable amount of time. So from this understanding you would like a hook called Kind Regards, |
hi @RamonWill - include_name would be a function that is passed a single name at at time, like a filter, which is given the names of schemas, tables, and indexes before the work is done to reflect the objects for those names. It does in fact look a lot like include_symbol, but with these differences:
|
Mike Bayer has proposed a fix for this issue in the master branch: implement include_name hook https://gerrit.sqlalchemy.org/c/sqlalchemy/alembic/+/2485 |
I use Alembic to manage a Redshift database with many columns (~13,000, across ~400 tables).
alembic revision --autogenerate
is (understandably), quite slow -- it usually takes about five minutes to complete.I imagine most of this time is spent waiting for responses from the database as Alembic does its comparison. Would it be possible to run some of these queries concurrently? Perhaps
concurrent.futures.ThreadPoolExecutor
would be appropriate here.The text was updated successfully, but these errors were encountered: