Failure to handle column renaming before grouping #928

jasonmhoule · 2022-06-30T14:18:11Z

Hi, this is my first issue so hopefully I have provided enough of a reprex to work with - please let me know if I can provide additional information.

Brief description of the problem

Rendering a query for Oracle from dplyr, I rename columns in a SELECT statement before doing a GROUP BY. The SELECT is tricky because I rename my grouping column to take the name of one of the non-grouping columns (which is also renamed). This appears to confuse dbplyr: it groups on what should be the non-grouping column and drops my intended grouping column.

Expected output

I would expect the SQL for dbplyr to handle the renaming before the grouping, and thus to group by (and retain when summarizing) the just-renamed column. This is the behavior observed in the non-remote dplyr example on mtcars below.

Please note that as far as I can tell, the query ran as expected with version 2.1.1 of dbplyr.

library(tidyverse)
library(odbc)
library(DBI)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(dplyr)

packageVersion("tidyverse")
#> [1] '1.3.1'
packageVersion("odbc")
#> [1] '1.3.2'
packageVersion("DBI")
#> [1] '1.1.1'
packageVersion("dbplyr")
#> [1] '2.2.1'
packageVersion("dplyr")
#> [1] '1.0.9'

# Query run on Oracle DB

con <- DBI::dbConnect(odbc::odbc(),
                      Driver = "Oracle in OraClient64_home1",
                      DBQ   = *redacted*,
                      UID    = *redacted*,
                      PWD    = *redacted*)

licsfull <- tbl(con, in_schema("GLBL_MAIN", "G_MATL_LIC")) %>%
  select(MATL_ID = REF_MATL_ID, JOIN_ID = MATL_ID, VALID_TO_DT) %>%
  group_by(MATL_ID) %>%
  summarise(JOIN_ID = max(JOIN_ID, na.rm = TRUE),
            VALID_TO_DT = max(VALID_TO_DT, na.rm = TRUE))

# The generated SQL drops the desired grouping column and joins on a different column

licsfull %>% show_query()
#> <SQL>
#> SELECT MAX("JOIN_ID") AS "JOIN_ID", MAX("VALID_TO_DT") AS "VALID_TO_DT"
#> FROM (
#>   SELECT "REF_MATL_ID" AS "MATL_ID", "MATL_ID" AS "JOIN_ID", "VALID_TO_DT"
#>   FROM ("GLBL_MAIN"."G_MATL_LIC") 
#> ) "q01"
#> GROUP BY "JOIN_ID"

# Note that this is different from the behavior in vanilla dplyr on a local tibble:

mtcars %>% 
  select(vs = cyl, new_vs = vs, gear) %>%
  group_by(vs) %>% 
  summarise(new_vs = max(new_vs),
            gear = max(gear))
#> # A tibble: 3 x 3
#>      vs new_vs  gear
#>   <dbl>  <dbl> <dbl>
#> 1     4      1     5
#> 2     6      1     5
#> 3     8      0     5

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

mgirlich · 2022-07-19T12:29:19Z

Thanks for catching this!

mgirlich mentioned this issue Jul 19, 2022

Fix grouping by renamed columns #943

Merged

mgirlich closed this as completed in #943 Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to handle column renaming before grouping #928

Failure to handle column renaming before grouping #928

jasonmhoule commented Jun 30, 2022 •

edited

Loading

mgirlich commented Jul 19, 2022

Failure to handle column renaming before grouping #928

Failure to handle column renaming before grouping #928

Comments

jasonmhoule commented Jun 30, 2022 • edited Loading

mgirlich commented Jul 19, 2022

jasonmhoule commented Jun 30, 2022 •

edited

Loading