Scan can take DAYS on large database clusters #190

LiamStorkey · 2022-06-21T22:53:57Z

I have recently tried to implement the piicatcher library using api mode into my workflow to identfy any Pii data in our databases. I am running in AWS RDS Clusters which can have up to 660 databases inside it. When running PIICatcher in this environment, the library tries to make queries to the INFORMATION_SCHEMA table which holds information about every other database in the rds cluster as well. Here is the exact query it runs

SELECT
        lower(c.column_name) AS col_name,
        c.column_comment AS col_description,
        lower(c.data_type) AS col_type,
        lower(c.ordinal_position) AS col_sort_order,
        c.table_catalog AS cluster,
        lower(c.table_schema) AS "schema",
        lower(c.table_name) AS name,
        t.table_comment AS description,
        case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
        FROM
        INFORMATION_SCHEMA.COLUMNS AS c
        LEFT JOIN
        INFORMATION_SCHEMA.TABLES t
            ON c.TABLE_NAME = t.TABLE_NAME
            AND c.TABLE_SCHEMA = t.TABLE_SCHEMA

        WHERE
            c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')

        ORDER by cluster, "schema", name, col_sort_order

This query still runs even when you are trying to scan a single database and pass include/exclude schema flags.
When you have a large number of databases, this query can take up to 100+ hours to complete.

I have come up with a more efficient query that can take up to 10min to run (even with a huge INFORMATION_SCHEMA table)

SELECT
    c.column_name AS col_name,
    c.column_comment AS col_description,
    c.data_type AS col_type,
    c.ordinal_position AS col_sort_order,
    c.table_catalog AS cluster,
    c.table_schema AS "schema",
    c.table_name AS name,
    t.table_comment AS description,
    case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
    FROM
    INFORMATION_SCHEMA.TABLES t
    left outer JOIN INFORMATION_SCHEMA.COLUMNS AS c
        on c.TABLE_SCHEMA = t.TABLE_SCHEMA
        and c.TABLE_NAME = t.TABLE_NAME
	where c.TABLE_SCHEMA="<insert_requested_db_name>";

I believe that this should be implemented into the lib in order to speed up the scanning of databases in cluster configurations (like RDS). This would be a huge time-saving method and I can see it increasing the adoption of the project. I would like to contribute to this, however, I have no idea where this needs to be added to the lib. I have taken a look into the PII catcher code and could not find where this query comes from. Is this from a dependency (sqlalchemy perhaps?).

The text was updated successfully, but these errors were encountered:

tokernjan · 2022-06-22T01:55:01Z

Hi @LiamStorkey,

I believe this query is being generated using the databuilder package from the Admunson project: https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/mysql_metadata_extractor.py#L28-L47

The databuilder metadata extractor is used via the dbcat package, which is part of Tokern: https://github.com/tokern/dbcat/blob/main/dbcat/catalog/db.py#L242-L263

The only way to customise that query is via the where_clause_suffix, which is being set in the dbcat _create_mysql_extractor method. It defaults to

WHERE
            c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')

Maybe you can try changing that clause to filter on only the selected DB names? I'm not sure if that would yield the same performance improvement as the modified query you suggested, though.

vrajat · 2022-06-23T03:15:50Z

This is a useful optimization. @tokernjan is right about the flow control. Another possibility is to implement a custom extractor which is similar to mysql_metadata_extractor and rewrite the query on the information schema.

For example, tokern/dbcat implements https://github.com/tokern/dbcat/blob/main/dbcat/catalog/sqlite_extractor.py

Then the fast extractor can be instantiated using a flag in https://github.com/tokern/dbcat/blob/main/dbcat/catalog/db.py#L242-L263

jhecking added the enhancement New feature or request label May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan can take DAYS on large database clusters #190

Scan can take DAYS on large database clusters #190

LiamStorkey commented Jun 21, 2022

tokernjan commented Jun 22, 2022 •

edited

vrajat commented Jun 23, 2022

Scan can take DAYS on large database clusters #190

Scan can take DAYS on large database clusters #190

Comments

LiamStorkey commented Jun 21, 2022

tokernjan commented Jun 22, 2022 • edited

vrajat commented Jun 23, 2022

tokernjan commented Jun 22, 2022 •

edited