Skip to content

Malfunctioning SQL generation with dbplyr 2.3.0 #1101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lschneiderbauer opened this issue Jan 18, 2023 · 5 comments · Fixed by #1102
Closed

Malfunctioning SQL generation with dbplyr 2.3.0 #1101

lschneiderbauer opened this issue Jan 18, 2023 · 5 comments · Fixed by #1102

Comments

@lschneiderbauer
Copy link

lschneiderbauer commented Jan 18, 2023

Good day,

with dbplyr 2.3.0 I managed to construct a statement which does not correctly translate to SQL.

Consider the following example:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

table1 <- tbl_lazy(tibble(key = 1), dbplyr::simulate_hana()) %>%
  rename_with(~ paste0("table1.", .))
table2 <- tbl_lazy(tibble(key = 2, parent = 3), dbplyr::simulate_hana()) %>%
  rename_with(~ paste0("table2.", .))
table3 <- tbl_lazy(tibble(key = 3, parent = 4), dbplyr::simulate_hana()) %>%
  rename_with(~ paste0("table3.", .))
table4 <- tbl_lazy(tibble(key = 4, parent = 5), dbplyr::simulate_hana()) %>%
  rename_with(~ paste0("table4.", .))

table1 %>%
  inner_join(table2, by = c("table1.key" = "table2.parent")) %>%
  inner_join(table3, by = c("table2.key" = "table3.parent")) %>%
  inner_join(table4, by = c("table3.key" = "table4.parent"))
#> <SQL>
#> SELECT
#>   `df...1`.`key` AS `table1.key`,
#>   `df...2`.`key` AS `table2.key`,
#>   `df...3`.`key` AS `table3.key`,
#>   `df...3`.`key` AS `table4.key`
#> FROM `df` AS `df...1`
#> INNER JOIN `df` AS `df...2`
#>   ON (`df...1`.`key` = `df...2`.`parent`)
#> INNER JOIN `df` AS `df...3`
#>   ON (`df...2`.`key` = `df...3`.`parent`)
#> INNER JOIN `df` AS `df...4`
#>   ON (`df...3`.`key` = `df...4`.`parent`)

Created on 2023-01-18 with reprex v2.0.2

Notice the fourth line after SELECT: df...3.key AS table4.key. I would expect it to read df...4.key AS table4.key (4 instead of 3), i.e.

#> SELECT
#>   `df...1`.`key` AS `table1.key`,
#>   `df...2`.`key` AS `table2.key`,
#>   `df...3`.`key` AS `table3.key`,
#>   `df...4`.`key` AS `table4.key`
#> FROM `df` AS `df...1`

Somehow the same column is selected twice with different names.
But I really want "table4.key" from "table4" to be selected.

Unfortunately this behavior seems to break some of my queries which were working before.

@lschneiderbauer
Copy link
Author

Very similarly,
this works:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

table1 <- tbl_lazy(tibble(key1 = 1), dbplyr::simulate_hana())
table2 <- tbl_lazy(tibble(key2 = 2, parent2 = 3), dbplyr::simulate_hana())
table3 <- tbl_lazy(tibble(key3 = 3, parent3 = 4), dbplyr::simulate_hana())
table4 <- tbl_lazy(tibble(key4 = 4, parent4 = 5), dbplyr::simulate_hana())

table1 %>%
  inner_join(table2, by = c("key1" = "parent2")) %>%
  inner_join(table3, by = c("key2" = "parent3")) %>%
  inner_join(table4, by = c("key3" = "parent4"))
#> <SQL>
#> SELECT `key1`, `key2`, `key3`, `key4`
#> FROM `df` AS `df...1`
#> INNER JOIN `df` AS `df...2`
#>   ON (`df...1`.`key1` = `df...2`.`parent2`)
#> INNER JOIN `df` AS `df...3`
#>   ON (`df...2`.`key2` = `df...3`.`parent3`)
#> INNER JOIN `df` AS `df...4`
#>   ON (`df...3`.`key3` = `df...4`.`parent4`)

Created on 2023-01-18 with reprex v2.0.2

While this does not work: (changing column name "key4" to "key3"

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

table1 <- tbl_lazy(tibble(key1 = 1), dbplyr::simulate_hana())
table2 <- tbl_lazy(tibble(key2 = 2, parent2 = 3), dbplyr::simulate_hana())
table3 <- tbl_lazy(tibble(key3 = 3, parent3 = 4), dbplyr::simulate_hana())
table4 <- tbl_lazy(tibble(key3 = 4, parent4 = 5), dbplyr::simulate_hana())

table1 %>%
  inner_join(table2, by = c("key1" = "parent2")) %>%
  inner_join(table3, by = c("key2" = "parent3")) %>%
  inner_join(table4, by = c("key3" = "parent4"))
#> <SQL>
#> SELECT `key1`, `key2`, `df...3`.`key3` AS `key3.x`, `df...3`.`key3` AS `key3.y`
#> FROM `df` AS `df...1`
#> INNER JOIN `df` AS `df...2`
#>   ON (`df...1`.`key1` = `df...2`.`parent2`)
#> INNER JOIN `df` AS `df...3`
#>   ON (`df...2`.`key2` = `df...3`.`parent3`)
#> INNER JOIN `df` AS `df...4`
#>   ON (`df...3`.`key3` = `df...4`.`parent4`)

Created on 2023-01-18 with reprex v2.0.2

@mgirlich
Copy link
Collaborator

I think I found the issue and will fix this soon.

@mgirlich
Copy link
Collaborator

Wow, this is not that easy to trigger 😄

You need:

  1. at least three joins
  2. a variable that
    • appears in at least two tables
    • is renamed before the join
    • and the variable is not a join key

Could you install the dev version install_github("tidyverse/dbplyr#1102") and confirm this fixes the issue.

@lschneiderbauer
Copy link
Author

You are fast! :)
Thank you, I tested it and this is working now as intended. (although another ticket might come soon, so you will not get bored. ;) )

@mgirlich
Copy link
Collaborator

After such big changes to the SQL generation I expected some blocker bugs to show up... So I also try to fix them fast 😄
A good reprex like in this issue also helps a lot, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants