Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake translation error: dropped filter with anti_join #1474

Closed
fh-afrachioni opened this issue Mar 7, 2024 · 3 comments · Fixed by #1475
Closed

Snowflake translation error: dropped filter with anti_join #1474

fh-afrachioni opened this issue Mar 7, 2024 · 3 comments · Fixed by #1475

Comments

@fh-afrachioni
Copy link
Contributor

fh-afrachioni commented Mar 7, 2024

Greetings, dbplyr friends. I'd like to report an issue we're experiencing with Snowflake translations, related to filter() criteria dropped from lazy tables when semi_join()ed. This seems specifically limited to filter() applied to columns which result from summarize(), that are not selected for inclusion in the result.

Here's a small example, reproducing the behavior:

sim_1 <- dbplyr::lazy_frame(a = 5, b = 5, con = dbplyr::simulate_snowflake())
sim_2 <- dbplyr::lazy_frame(a = 5, b = 5, con = dbplyr::simulate_snowflake())

sim_2_transformed <- sim_2 %>% 
  group_by(a) %>% 
  summarize(group_count = n()) %>% 
  filter(group_count > 3) %>% 
  select(a)

join_result <- semi_join(sim_1, sim_2_transformed, by = 'a')

If we inspect the query produced for sim_2_transformed, it looks correct:

> sim_2_transformed
<SQL>
SELECT `a`
FROM `df`
GROUP BY `a`
HAVING (COUNT(*) > 3.0)

However, the join drops the HAVING criterion leading to an incorrect result; join_result is

> join_result
<SQL>
SELECT `df_LHS`.*
FROM `df` AS `df_LHS`
WHERE EXISTS (
  SELECT 1 FROM `df` AS `df_RHS`
  WHERE (`df_LHS`.`a` = `df_RHS`.`a`)
)

Notably, if I remove select(a) from the definition of sim_2_transformed, the HAVING clause is included as expected:

> join_result_without_final_select
<SQL>
SELECT `df`.*
FROM `df`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT `a`, COUNT(*) AS `group_count`
  FROM `df`
  GROUP BY `a`
  HAVING (COUNT(*) > 3.0)
) AS `RHS`
  WHERE (`df`.`a` = `RHS`.`a`)
)

Thanks for the continuing support of Snowflake backends; I'm happy to help with any testing that might be valuable.

Release version of dbplyr, 2.4.0. cc @fh-mthomson

@ejneer
Copy link
Contributor

ejneer commented Mar 7, 2024

Does devtools::install_github("tidyverse/dbplyr#1475") solve your issue? It seems to work for your reprex / my local tests.

@fh-afrachioni
Copy link
Contributor Author

That fixes it!

> devtools::install_github("tidyverse/dbplyr#1475")
> # ... as above ...
> join_result
<SQL>
SELECT `df`.*
FROM `df`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT `a`
  FROM `df`
  GROUP BY `a`
  HAVING (COUNT(*) > 3.0)
) AS `RHS`
  WHERE (`df`.`a` = `RHS`.`a`)
)

I really appreciate the quick update here, and am hopeful this will make it into the upcoming release!

@fh-mthomson
Copy link
Contributor

Adding that (1) the behavior isn't backend-specific and (2) the impact is that semi_join() and inner_join() can have alarmingly-different results:

library(dplyr, warn.conflicts = FALSE)

table <- tibble::tibble(a = c("x", "y"), b = c(1, 2))

con <- DBI::dbConnect(RSQLite::SQLite())
copy_to(con, table, "table")
tbl_lazy <- tbl(con, "table")

# only `y` meets condition:
filtered <- tbl_lazy %>% 
  group_by(a) %>% 
  summarize(include = all(b > 1)) %>% 
  filter(include) %>% 
  select(a)

# correct: only `y` returned
tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source:   SQL [2 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 x         1
#> 2 y         2

Created on 2024-03-15 with reprex v2.1.0

Confirmed that the PR resolves the issue:

tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

@hadley @mgirlich would it be possible to include #1475 in the next release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants