Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNION ALL not supported #320

Open
commercial-hippie opened this issue Nov 29, 2018 · 16 comments
Open

UNION ALL not supported #320

commercial-hippie opened this issue Nov 29, 2018 · 16 comments
Assignees

Comments

@commercial-hippie
Copy link
Contributor

I saw that there was a previous task created for this but I think that Verdict has been reworked a lot and support for unions was probably lost somewhere along the way.

Here are some tests I ran.. Excuse the formatting it's from a proxy that returns JSON.

I currently have a workaround implemented for our use case but this would be awesome to have in the core. 😄

-- "SELECT count(distinct mycolumn) FROM table_b_orc"
{
    "results": [
        {
            "cd2": "4.0"
        }
    ]
}

-- "SELECT count(distinct mycolumn) FROM table_a_orc"
{
    "results": [
        {
            "cd2": "6.0"
        }
    ]
}

-- "SELECT count(distinct mycolumn) FROM table_b_orc UNION ALL SELECT count(distinct mycolumn) FROM table_a_orc"
{
    "results": [
        {
            "cd2": "4.0"
        }
    ]
}

-- "BYPASS SELECT count(distinct mycolumn) FROM table_a_orc UNION ALL SELECT count(distinct mycolumn) FROM table_b_orc"
{
    "results": [
        {
            "_col0": "4"
        },
        {
            "_col0": "6"
        }
    ]
}
@commercial-hippie commercial-hippie changed the title UNION ALL not supporte UNION ALL not supported Nov 29, 2018
@pyongjoo
Copy link
Member

@Beastjoe Could you take a look at this issue (add relevant test cases too)? I think this should be a simple logical issue.

@Beastjoe
Copy link
Contributor

@pyongjoo Previously, I think we haven't considered the case of set operation inside user query. You can check visitQuery_expression() method in RelationGen class. It may be a bit complex to fix this. One solution is to treat the SetOperationRelation as a SelectQuery, but I am not sure what to do when scramble tables are involved. Do you have any suggestions? Thanks.

@pyongjoo
Copy link
Member

SetOperationRelation must not be SelectQuery simply because it is not. SetOperationRelation should just be a subclass of AbstractRelation.

Doesn't the antlr grammar already cover the case that a relation can include UNION? Otherwise, we can simply extend it.

@Beastjoe
Copy link
Contributor

Yes, the grammer has already covered the case. QueryNode only handles SelectQuery, I think some similar classes need to be created to handle the set operation inside user query.

Besides, I am not sure what should we do when both relations in set operation involves scramble tables, such as

SELECT count(distinct l_suppkey) FROM lineitem UNION ALL SELECT count(distinct s_suppkey) FROM supplier

or

SELECT count(distinct l_suppkey) FROM lineitem UNION ALL SELECT count(distinct l_suppkey) FROM lineitem 

@pyongjoo
Copy link
Member

Hmm. I thought the query is like

select count(distinct col) from
( select * from tableA UNION ALL select * from tableB) t

which was the reason I said the union must not be treated as a select query. Let me double-check the OP’s another query example and let you know.

In this case, however, I agree that the union query should also be a select query.

@pyongjoo
Copy link
Member

@commercial-hippie I just noticed that the example query you posted here is different from the one I saw before. The one I saw before was:

select count(distinct some_id)
from (
  (select some_id
   from table1)
  UNION ALL
  (select some_id
   from table2)
)

Can you confirm which one we should focus on? The query you posted here is nothing but two select queries.

@commercial-hippie
Copy link
Contributor Author

@pyongjoo I think the query you posted is probably the way to go..

@pyongjoo
Copy link
Member

pyongjoo commented Dec 5, 2018

@Beastjoe Can you quickly test if the query as follows runs without errors?

select count(distinct some_id)
from (
  (select some_id
   from table1)
  UNION ALL
  (select some_id
   from table2)
)

@Beastjoe
Copy link
Contributor

Beastjoe commented Dec 5, 2018

No. VerdictDB currently neglects the right side of the union all operation and only takes the leftmost query. I will start to fix this.

@pyongjoo
Copy link
Member

pyongjoo commented Dec 6, 2018

Thanks!

@pyongjoo
Copy link
Member

pyongjoo commented Dec 7, 2018

@Beastjoe Can you comment on the current status? No rush, but just wondering. Thanks.

@Beastjoe
Copy link
Contributor

Beastjoe commented Dec 7, 2018

I have finished standardizing query containing the set operations. The next step is to add support of creating AsyncAggExecutionPlan for set operations.

For instance, if a query is

select count(distinct some_id)
from (
  (select some_id
   from table1)
  UNION ALL
  (select some_id
   from table2)
)

my implementation will change the query to

select count(distinct some_id) 
from (
  select some_id 
  from (
    (select some_id
     from table1)
    UNION ALL
    (select some_id
     from table2)
))

If the set operation contain some scramble tables, we can then add the constraint in the inner query.

@pyongjoo
Copy link
Member

pyongjoo commented Dec 7, 2018

Instead of adding an extra depth for the inner query, could we simply make UNION as an instance of SelectQuery?

@Beastjoe
Copy link
Contributor

Beastjoe commented Dec 7, 2018

I think it is feasible.
For instance,

select count(distinct some_id)
from (
  (select some_id
   from table1)
  UNION ALL
  (select some_id
   from table2)
)

If table1 and table2 are scramble tables, and we wish to add the filter later, such as verdictdbblock, is it correct to add them directly in the set operation queries?

@pyongjoo
Copy link
Member

pyongjoo commented Dec 7, 2018

I think SetOperationQuery can be an instance of SelectQuery.

I was wrong previously.

@pyongjoo
Copy link
Member

@Beastjoe I believe these are the todos we discussed. Please pursue these items as separate pull requests.

  1. No automatic table name replacements; simply require users to explicitly specify scrambles in place of the original tables.
  2. Ensure that all unioned tables have the same sampling probabilities.
  3. Implement async aggregation plan for unioned tables.

@pyongjoo pyongjoo pinned this issue Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants