Skip to content

do on tbl_sql duplicates grouping column #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kismsu opened this issue Oct 9, 2014 · 3 comments
Closed

do on tbl_sql duplicates grouping column #673

kismsu opened this issue Oct 9, 2014 · 3 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@kismsu
Copy link

kismsu commented Oct 9, 2014

I've updated to dplyr_0.3.0.1 and noticed change in behaviour of do function on tbl_sql objects. Function creates the copy of grouping column.
For SQLite

library(nycflights13)
library(dplyr)
my_db <- src_sqlite("my_db.sqlite3", create = T)
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE)

flights_sqlite %>% 
  group_by(tailnum) %>%
  do(
    data.frame(head(.))
    )
Source: local data frame [22,656 x 17]
Groups: tailnum

   tailnum year month day dep_time dep_delay arr_time arr_delay carrier tailnum.1 flight origin dest
1          2013     1   2       NA        NA       NA        NA      AA              133    JFK  LAX
2          2013     1   2       NA        NA       NA        NA      UA              623    EWR  ORD
3          2013     1   3       NA        NA       NA        NA      UA              714    EWR  MIA
4          2013     1   3       NA        NA       NA        NA      UA              719    EWR  DFW
5          2013     1   4       NA        NA       NA        NA      9E             3405    JFK  DCA
6          2013     1   4       NA        NA       NA        NA      9E             3716    EWR  DTW
7   D942DN 2013     2  11     1508        68     1807        91      DL    D942DN   2247    LGA  ATL
8   D942DN 2013     3  23     1340        40     1638        44      DL    D942DN   1685    LGA  MCO
9   D942DN 2013     3  24      859        24     1142         2      DL    D942DN   1959    JFK  MCO
10  D942DN 2013     7   5     1253        -6     1518       -11      DL    D942DN    781    LGA  ATL
..     ...  ...   ... ...      ...       ...      ...       ...     ...       ...    ...    ...  ...
Variables not shown: air_time (dbl), distance (dbl), hour (dbl), minute (dbl)

For PortgeSQL

pg_db <- src_postgres()

flights_pg <- copy_to(pg_db, flights, temporary = FALSE)

flights_pg %>% 
  group_by(tailnum) %>%
  do(
    data.frame(head(.))
  )
ource: local data frame [22,656 x 17]
Groups: tailnum

   tailnum year month day dep_time dep_delay arr_time arr_delay carrier tailnum.1 flight origin dest air_time distance hour minute
1          2013     9  12       NA        NA       NA        NA      UA              689    LGA  ORD       NA      733   NA     NA
2          2013     2   9       NA        NA       NA        NA      US             1036    EWR  CLT       NA      529   NA     NA
3          2013     8   9       NA        NA       NA        NA      9E             3385    JFK  BOS       NA      187   NA     NA
4          2013     7   9       NA        NA       NA        NA      US             2189    LGA  DCA       NA      214   NA     NA
5          2013     8   9       NA        NA       NA        NA      9E             3807    EWR  CVG       NA      569   NA     NA
6          2013     2   9       NA        NA       NA        NA      US               35    JFK  PHX       NA     2153   NA     NA
7   D942DN 2013     2  11     1508        68     1807        91      DL    D942DN   2247    LGA  ATL      131      762   15      8
8   D942DN 2013     7   5     1253        -6     1518       -11      DL    D942DN    781    LGA  ATL      112      762   12     53
9   D942DN 2013     3  24      859        24     1142         2      DL    D942DN   1959    JFK  MCO      143      944    8     59
10  D942DN 2013     3  23     1340        40     1638        44      DL    D942DN   1685    LGA  MCO      153      950   13     40
..     ...  ...   ... ...      ...       ...      ...       ...     ...       ...    ...    ...  ...      ...      ...  ...    ...

As you can see tailnum appears twice.

Is this a planed thing or just a bug?

@kismsu
Copy link
Author

kismsu commented Oct 13, 2014

I've made more investigation and find out that if you apply do to derived table than you'll get identical column names

g <- function(x) {
    browser()
}

flights_pg %>% 
    select(origin, flight, distance, dep_delay, hour, minute) %>%
    group_by(flight, distance, dep_delay) %>%
    do(
        data.frame(g(.))
    )

Inside g() call head(x) gives

flight distance dep_delay origin flight distance dep_delay hour minute
1      1      719         1    EWR      1      719         1   14      0

I believe this comes from the following part of do.tbl_sql

  # Create ungrouped data frame suitable for chunked retrieval
  chunky <- update(.data,
    select = c(group_by, .data$select),
    order_by = c(unname(group_by), .data$order_by),
    group_by = NULL
  )

Here you select all grouped columns and selected, if there are the same, databases gives them twices. Should it has unique()?

@kismsu
Copy link
Author

kismsu commented Oct 15, 2014

I've checked changes in update.tbl_sql function, you've replaced

if (is.ident(object$from)) {
    var_names <- table_fields(object$src$con, object$from)      
} else {        
    var_names <- qry_fields(object$src$con, object$from)        
}

by

var_names <- db_query_fields(object$src$con, object$from) 

where

db_query_fields.DBIConnection <- function(con, sql, ...) {
  fields <- build_sql("SELECT * FROM ", sql, " WHERE 0=1", con = con)

  qry <- dbSendQuery(con, fields)
  on.exit(dbClearResult(qry))

  dbGetInfo(qry)$fieldDescription[[1]]$name
}

Could this be a reason?

@hadley hadley added the bug an unexpected problem or unintended behavior label Oct 30, 2014
@hadley hadley added this to the 0.3.1 milestone Oct 30, 2014
@hadley hadley self-assigned this Oct 30, 2014
@hadley
Copy link
Member

hadley commented Dec 1, 2014

Here's a somewhat faster test case

airlines <- nycflights13_sqlite() %>% tbl("airlines")
airlines %>% group_by(carrier) %>% do(data.frame(.))
airlines %>% group_by(carrier) %>% select(carrier, name) %>% do(data.frame(.))

@hadley hadley closed this as completed in c4d2c20 Dec 1, 2014
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants