Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan can take DAYS on large database clusters #190

Open
LiamStorkey opened this issue Jun 21, 2022 · 2 comments
Open

Scan can take DAYS on large database clusters #190

LiamStorkey opened this issue Jun 21, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@LiamStorkey
Copy link

I have recently tried to implement the piicatcher library using api mode into my workflow to identfy any Pii data in our databases. I am running in AWS RDS Clusters which can have up to 660 databases inside it. When running PIICatcher in this environment, the library tries to make queries to the INFORMATION_SCHEMA table which holds information about every other database in the rds cluster as well. Here is the exact query it runs

SELECT
        lower(c.column_name) AS col_name,
        c.column_comment AS col_description,
        lower(c.data_type) AS col_type,
        lower(c.ordinal_position) AS col_sort_order,
        c.table_catalog AS cluster,
        lower(c.table_schema) AS "schema",
        lower(c.table_name) AS name,
        t.table_comment AS description,
        case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
        FROM
        INFORMATION_SCHEMA.COLUMNS AS c
        LEFT JOIN
        INFORMATION_SCHEMA.TABLES t
            ON c.TABLE_NAME = t.TABLE_NAME
            AND c.TABLE_SCHEMA = t.TABLE_SCHEMA

        WHERE
            c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')

        ORDER by cluster, "schema", name, col_sort_order

This query still runs even when you are trying to scan a single database and pass include/exclude schema flags.
When you have a large number of databases, this query can take up to 100+ hours to complete.

I have come up with a more efficient query that can take up to 10min to run (even with a huge INFORMATION_SCHEMA table)

SELECT
    c.column_name AS col_name,
    c.column_comment AS col_description,
    c.data_type AS col_type,
    c.ordinal_position AS col_sort_order,
    c.table_catalog AS cluster,
    c.table_schema AS "schema",
    c.table_name AS name,
    t.table_comment AS description,
    case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
    FROM
    INFORMATION_SCHEMA.TABLES t
    left outer JOIN INFORMATION_SCHEMA.COLUMNS AS c
        on c.TABLE_SCHEMA = t.TABLE_SCHEMA
        and c.TABLE_NAME = t.TABLE_NAME
	where c.TABLE_SCHEMA="<insert_requested_db_name>";

I believe that this should be implemented into the lib in order to speed up the scanning of databases in cluster configurations (like RDS). This would be a huge time-saving method and I can see it increasing the adoption of the project. I would like to contribute to this, however, I have no idea where this needs to be added to the lib. I have taken a look into the PII catcher code and could not find where this query comes from. Is this from a dependency (sqlalchemy perhaps?).

@tokernjan
Copy link

tokernjan commented Jun 22, 2022

Hi @LiamStorkey,

I believe this query is being generated using the databuilder package from the Admunson project: https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/mysql_metadata_extractor.py#L28-L47

The databuilder metadata extractor is used via the dbcat package, which is part of Tokern: https://github.com/tokern/dbcat/blob/main/dbcat/catalog/db.py#L242-L263

The only way to customise that query is via the where_clause_suffix, which is being set in the dbcat _create_mysql_extractor method. It defaults to

WHERE
            c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')

Maybe you can try changing that clause to filter on only the selected DB names? I'm not sure if that would yield the same performance improvement as the modified query you suggested, though.

@vrajat
Copy link
Member

vrajat commented Jun 23, 2022

This is a useful optimization. @tokernjan is right about the flow control. Another possibility is to implement a custom extractor which is similar to mysql_metadata_extractor and rewrite the query on the information schema.

For example, tokern/dbcat implements https://github.com/tokern/dbcat/blob/main/dbcat/catalog/sqlite_extractor.py

Then the fast extractor can be instantiated using a flag in https://github.com/tokern/dbcat/blob/main/dbcat/catalog/db.py#L242-L263

@jhecking jhecking added the enhancement New feature or request label May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants