Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving table/column for UNION queries #13

Closed
john-bodley opened this issue Mar 24, 2017 · 3 comments
Closed

Resolving table/column for UNION queries #13

john-bodley opened this issue Mar 24, 2017 · 3 comments

Comments

@john-bodley
Copy link

@brianv0 I was leveraging the logic you provided in the examples/gather_columns.py script, and agree with your logic for a query of the form,

SELECT foo, b.bar FROM a JOIN b ON a.id = b.id

which results in a table/column resolution of:

SingleColumn(expression=QualifiedNameReference(name=QualifiedName(foo))): [Table(name=QualifiedName(a)), Table(name=QualifiedName(b))]
SingleColumn(expression=QualifiedNameReference(name=QualifiedName(b.bar))): [Table(name=QualifiedName(b))]

i.e., from the query one cannot decipher whether the column foo is from a or b, however for a UNION type query

SELECT foo AS bar FROM a UNION SELECT bar FROM b

the table/column resolution is:

SingleColumn(alias='bar', expression=QualifiedNameReference(name=QualifiedName(foo))): [Table(name=QualifiedName(b)), Table(name=QualifiedName(a))]
SingleColumn(expression=QualifiedNameReference(name=QualifiedName(bar))): [Table(name=QualifiedName(b)), Table(name=QualifiedName(a))]

although it's apparent from the query that the foo column is from table a and the bar column is from table b. Do you know of any resolution to this issue?

@brianv0
Copy link
Member

brianv0 commented Mar 24, 2017

The visitor needs to be made more complicated so that only the tables for the query in context are checked. I've tested the following snippet out a bit. Some of the logic from print_column_resolution_order could be put into the visitor, I did some of that myself.

Another note: This doesn't currently handle subqueries, though it could (and probably should). I maybe should work on adding a few other examples and clarifying the scope of each use case.

def check_extracted_columns(query, with_resolution=False):

    class ColumnGroup(object):
        def __init__(self, columns, tables):
            self.columns = columns
            self.tables = tables

        def __repr__(self):
            return repr(self.columns)

        def __str__(self):
            return str(self.columns)

    class TableAndColumnExtractor(DefaultTraversalVisitor):

        def __init__(self):
            self.column_groups = []

        def visit_query_specification(self, node, context):
            columns = []
            maybe_tables = []
            tables = OrderedDict()

            if node.from_:
                if isinstance(node.from_, Join):
                    relation = node.from_
                    maybe_tables.append(relation.right)
                    while isinstance(relation.left, Join):
                        relation = relation.left
                        maybe_tables.append(relation.right)
                    maybe_tables.append(relation.left)
                else:
                    maybe_tables.append(node.from_)
                maybe_tables.reverse()

            # Make it easy to refer to tables
            for table in maybe_tables:
                handle, table = self._handle_and_table(table)
                if handle:
                    tables[handle] = table
        
            for item in node.select.select_items:
                if isinstance(item, SingleColumn):
                    columns.append(item)

            self.column_groups.append(ColumnGroup(columns, tables))

        def _handle_and_table(self, relation):
            handle = None
            table = None
            if isinstance(relation, AliasedRelation):
                if isinstance(relation.relation, Table):
                    handle = relation.alias
                    table = relation.relation
                else:
                    print("WARNING: Aliased Relation is not a table and "
                          "is omitted")
            else:
                handle = ".".join(relation.name.parts)
                table = relation
            return handle, table

    def print_column_resolution_order(column_group):
        print("\nTable Column Resolution for column group:")
        tables = column_group.tables
        for i in range(len(column_group.columns)):
            column = column_group.columns[i]
            if not isinstance(column.expression, QualifiedNameReference):
                print("Warning: Skipping column at ordinal %d" % i)
                continue
            names = column.expression.name.parts
            resolution = []
            if len(names) > 1:
                qualified_table_name = ".".join(names[:-1])
                if qualified_table_name in tables:
                    resolution.append(tables[qualified_table_name])
            else:
                resolution = [v for v in tables.values()]
            print(repr(column) + ": " + str(resolution))

    print("\n\nChecking query:\n" + query)
    visitor = TableAndColumnExtractor()
    visitor.process(parser.parse(query), None)
    print(visitor.column_groups)
    if with_resolution:
        for column_group in visitor.column_groups:
            print_column_resolution_order(column_group)

@john-bodley
Copy link
Author

Thanks @brianv0 this was super helpful.

@brianv0
Copy link
Member

brianv0 commented Mar 31, 2017

I'm going to close this for now. I've created a new issue as a placeholder to help determine if we need an simple API that can perform a canned set of query analysis jobs (like extracting columns) and how that might be implemented.

@brianv0 brianv0 closed this as completed Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants