Skip to content

mutate() is incorrectly inlined after distinct() #1119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vadim-cherepanov opened this issue Feb 1, 2023 · 5 comments · Fixed by #1120
Closed

mutate() is incorrectly inlined after distinct() #1119

vadim-cherepanov opened this issue Feb 1, 2023 · 5 comments · Fixed by #1120
Milestone

Comments

@vadim-cherepanov
Copy link

I just updated to dplyr v1.1.0 + dbplyr 2.3.0, and my code broke. Upon investigation it seems that now a wrong SQL query is generated. Namely, in my case, I have distinct prior to the group_by block.

> remote_table %>% distinct(x, y) %>% count(y) %>% filter(y == "YYY") 
# Source:     SQL [1 x 2]
# Database:   postgres  [.../postgres]
# Ordered by: x, y
  y        n
  <chr>  <int64>
1 YYY       1
> remote_table %>% distinct(x, y) %>% group_by(y) %>% tally() %>% ungroup() %>% filter(y == "YYY") 
# Source:     SQL [1 x 2]
# Database:   postgres  [.../postgres]
# Ordered by: x, y
  y        n
  <chr>  <int64>
1 YYY       1
> remote_table %>% distinct(x, y) %>% group_by(y) %>% mutate(n = n()) %>% ungroup() %>% filter(y == "YYY") 
# Source:     SQL [1 x 3]
# Database:   postgres  [.../postgres]
# Ordered by: x, y
  x y        n
      <int64> <chr>  <int64>
1        XXX YYY       2

If I add compute between them, i.e. the result of distinct is actually computed and stored in a temporary table, n() returns values as expected.

> remote_table %>% distinct(x, y) %>% compute() %>% group_by(y) %>% mutate(n = n()) %>% ungroup() %>% filter(y == "YYY") 
# Source:     SQL [1 x 3]
# Database:   postgres  [.../postgres]
# Ordered by: x, y
  x y        n
      <int64> <chr>  <int64>
1        XXX YYY       1

This is the SQL query generated by dplyr v1.1.0 + dbplyr 2.3.0:

> remote_table %>% distinct(x, y) %>% group_by(y) %>% mutate(n = n()) %>% ungroup() %>% filter(y == "YYY") %>% show_query() 
<SQL>
SELECT DISTINCT
  "x",
  "y",
  COUNT(*) OVER (PARTITION BY "y") AS "n"
FROM "dbplyr_026"
WHERE ("y" = 'YYY')

Note how it extended distinct onto the subsequent query.

And this one is generated (as expected) by dplyr v1.0.10 + dbplyr 2.2.1:

> remote_table %>% distinct(x, y) %>% group_by(y) %>% mutate(n = n()) %>% ungroup() %>% filter(y == "YYY") %>% show_query() 
<SQL>
SELECT *
FROM (
  SELECT *, COUNT(*) OVER (PARTITION BY "y") AS "n"
  FROM (
    SELECT DISTINCT "x", "y"
    FROM "dbplyr_026"
  ) "q01"
) "q02"
WHERE ("y" = 'YYY')
@hadley hadley transferred this issue from tidyverse/dplyr Feb 1, 2023
@hadley
Copy link
Member

hadley commented Feb 1, 2023

Moved to dbplyr

@mgirlich mgirlich added this to the 2.3.1 milestone Feb 2, 2023
@mgirlich mgirlich changed the title [bug] Wrong SQL generated by distinct + group_by in dplyr v1.1.0 + dbplyr 2.3.0. mutate() is incorrectly inlined after distinct() Feb 2, 2023
@mgirlich
Copy link
Collaborator

mgirlich commented Feb 2, 2023

More minimal reprex

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

memdb_frame(x = 1:2) %>% 
  distinct(x) %>% 
  mutate(x = 0) %>% 
  collect()
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1     0

Created on 2023-02-02 with reprex v2.0.2

@mgirlich
Copy link
Collaborator

mgirlich commented Feb 3, 2023

@vadim-cherepanov Thanks for filing the issue. Should be fixed in the dev version 😄

@vadim-cherepanov
Copy link
Author

Thank you for taking care of this. I am just wondering what changes behind the scene caused this inlining of mutate, and are we sure the same changes did not break something else in other cases.

@mgirlich
Copy link
Collaborator

mgirlich commented Feb 9, 2023

The queries generated by dbplyr used to consist of a lot of subqueries. For version 2.3.0 we tried to reduce the number of subqueries. This improves the readability, often quite a lot, but can also result in faster queries (e.g. in the case of multiple joins).
Unfortunately, this caused a couple of issues but with the help of the community we should be relatively quick to identify and fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants