Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqlalchemy] Improve the performance of TrinoDialect.get_view_names #267

Merged
merged 1 commit into from
Nov 2, 2022

Conversation

john-bodley
Copy link
Contributor

@john-bodley john-bodley commented Oct 13, 2022

Description

Querying the information_schema.views table seems to be sub-performant for determining which entities are views—I speculate this is because all the view definitions need to be extracted/compiled. A more performant approach is to use information_schema.tables with the appropriate type filter.

For example for a very large schema at Airbnb (comprising of over 100k entities) the following query,

SELECT
    table_name
FROM 
    information_schema.tables
WHERE 
    table_schema = '<schema>' AND 
    table_type = 'VIEW'

took ~ 5 seconds, whereas:

SELECT
    table_name
FROM
    information_schema.views
WHERE    
    table_schema = '<schema>'

was still running after 10 minutes.

Note there were no prior integration tests for the TrinoDialect.get_view_names method so I added some to cover this logic and threw in a few extra for free which should cover all the various scenarios.

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

* Improve the performance of get_view_names in SQLAlchemy. ({issue}`267`)

@cla-bot cla-bot bot added the cla-signed label Oct 13, 2022
@john-bodley john-bodley marked this pull request as ready for review October 13, 2022 19:19
@john-bodley john-bodley force-pushed the john-bodley--fix-get-view-names branch 17 times, most recently from a5a6c75 to e0de7b9 Compare October 14, 2022 00:56
@john-bodley john-bodley changed the title [sqlachemy] Improve the performance of get_view_names [sqlachemy] Improve the performance of TrinoDialect.get_view_names Oct 14, 2022
@john-bodley john-bodley changed the title [sqlachemy] Improve the performance of TrinoDialect.get_view_names [sqlalchemy] Improve the performance of TrinoDialect.get_view_names Oct 14, 2022
f"trino://test@{host}:{port}/{request.param}",
connect_args={
"max_attempts": 1,
"schema": "test",
Copy link
Contributor Author

@john-bodley john-bodley Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to explicitly set the schema as part of the engine otherwise Trino barfed when trying to create the views with the

Schema must be specified when session schema is not set

error. I tried a number of things including explicitly including i) executing the USE test statement, and ii) including the schema name in the view name, i.e., test.my_view however this failed like because this was escaped to be "test.my_view" rather than "test"."my_view".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first sight it can be done by doing something like:

create_view(
            ...,
            schema="test")

See https://github.com/kvesteri/sqlalchemy-utils/blob/1f6e2d6c029f5c578be0e48c7baa98e6984b7bf8/sqlalchemy_utils/view.py#L63

I also think you could include the schema in the pytest param @pytest.mark.parametrize('trino_connection', ['memory/test'], indirect=True) but personally I would prefer to have it in the create_view call.

Copy link
Contributor Author

@john-bodley john-bodley Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdesmet the create_view method does not support setting an explicit schema. Regrettably also defining,

metadata = sqla.MetaData(schema=schema)

did not work and thus the only solution I was able to find—I spent hours trying to get this to work—was to explicitly set the schema at the time of the engine creation.

Additionally setting the schema in the engine ensures that _get_default_schema_name returns a non-None value which is necessary for the additional tests I added to test existing logic which wasn't previously tested.

@john-bodley john-bodley requested review from mdesmet and removed request for hovaesco October 14, 2022 01:19
@hovaesco
Copy link
Member

@hashhar are you aware from where the performance difference comes from? It might be that query to information_schema.tables do some kind of predicate pushdown due to table_type = 'VIEW'?

@@ -368,3 +379,45 @@ def test_get_table_comment(trino_connection):
assert actual['text'] is None
finally:
metadata.drop_all(engine)


@pytest.mark.parametrize('trino_connection', ['memory'], indirect=True)
Copy link
Contributor

@mdesmet mdesmet Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overriding the schema for any test you can just do

Suggested change
@pytest.mark.parametrize('trino_connection', ['memory'], indirect=True)
@pytest.mark.parametrize('trino_connection', ['memory/test'], indirect=True)

metadata.drop_all(engine)


@patch('trino.sqlalchemy.dialect.TrinoDialect._get_default_schema_name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the default schema from the fixture you can rid yourself of this @patch

@john-bodley john-bodley force-pushed the john-bodley--fix-get-view-names branch from e0de7b9 to a11c234 Compare October 19, 2022 16:24
@john-bodley
Copy link
Contributor Author

john-bodley commented Oct 19, 2022

@mdesmet thanks for the review. I've addressed your comments.

@john-bodley john-bodley force-pushed the john-bodley--fix-get-view-names branch 2 times, most recently from 71ebc98 to 4662c06 Compare October 20, 2022 19:53
@mdesmet
Copy link
Contributor

mdesmet commented Oct 27, 2022

@hashhar : PTAL I think we should make this change based on your feedback on Slack

hashhar
For views we call metadata.getViews in InformationSchemaPageSource - that methods ends up retrieving view definitions as well.
For tables we call listTables + listViews (in InformationSchemaPageSource) - neither of which need view definitions. So actually the tables is faster since it just lists things.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the commit title as "Improve the performance of get_view_names in SQLAlchemy"

WHERE "table_schema" = :schema
AND "table_type" = 'VIEW'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth leaving code comment why this query uses information_schema.tables.

setup.py Outdated
@@ -39,6 +39,7 @@
"pytest",
"pytest-runner",
"click",
"sqlalchemy-utils",
Copy link
Member

@hashhar hashhar Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we're now starting to test integration with a non-SQLAlchemy library. This is one of many other alternative libraries which enhance SQLAlchemy.

(This is needed because SQLAlchemy itself doesn't provide inbuilt functionality to create views - it does provide extension points which the sqlalchemy-utils library implements).

@john-bodley john-bodley force-pushed the john-bodley--fix-get-view-names branch 2 times, most recently from 99ca619 to 5f8d4e2 Compare November 2, 2022 00:14
@ebyhr ebyhr force-pushed the john-bodley--fix-get-view-names branch from 5f8d4e2 to 08f7a1c Compare November 2, 2022 00:26
@ebyhr ebyhr merged commit b008848 into trinodb:master Nov 2, 2022
@ebyhr ebyhr mentioned this pull request Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants