Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.frame() does not respect column order for DB backends #3216

Closed
JohnMount opened this issue Nov 21, 2017 · 1 comment · Fixed by tidyverse/dbplyr#62
Closed

as.data.frame() does not respect column order for DB backends #3216

JohnMount opened this issue Nov 21, 2017 · 1 comment · Fixed by tidyverse/dbplyr#62
Labels
bug an unexpected problem or unintended behavior

Comments

@JohnMount
Copy link

JohnMount commented Nov 21, 2017

as.data.frame() does not respect column order for DB backends (or at least is not stable, or does not use the same column order as colnames()). A lot of code depends on column order, so I would consider this a bug (regardless of the documented or not-documented invariants of as.data.frame()). The issue occurs with more than one DB data provider (RSQlite and Sparklyr) and seems to change if a compute() is issued.

With the issue any code of the following form may be wrong.

cn <- colnames(d)
dL <- as.data.frame(d)

As it will not be equivalent to:

dL <- as.data.frame(d)
cn <- colnames(dL)

knitr currently has such code, and is producing erroneous reports.

Right or wrong I think users expect stable column order.

Included reprex is long as the issue is subtle, yet important (and we need to establish issue on more than one backend).

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-21

packageVersion("knitr")
#> [1] '1.17'
packageVersion("dplyr")
#> [1] '0.7.4'
packageVersion("dbplyr")
#> [1] '1.1.0'
packageVersion("sparklyr")
#> [1] '0.6.4'
packageVersion("rlang")
#> [1] '0.1.4'
packageVersion("tidyselect")
#> [1] '0.2.3'
suppressPackageStartupMessages(library("dplyr"))


# Sparklyr example

rm(list=ls())

sc <- sparklyr::spark_connect(version = '2.2.0', 
                              master = "local")
d <- starwars %>% 
  select(., name, height, mass,
         hair_color ) %>% 
  copy_to(sc, ., name = 'starwars')

res <- d %>% 
  mutate(., 
         condition1 := height>=150,
         mass := ifelse(condition1, 
                        mass + 10, mass),
         hair_color := ifelse(condition1, 
                              'brown', hair_color) ) %>%
  head()

class(res)
#> [1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"

# look at order of columns
colnames(res)
#> [1] "name"       "height"     "mass"       "hair_color" "condition1"

# look at order of columns
res
#> # Source:   lazy query [?? x 5]
#> # Database: spark_connection
#>             name height condition1  mass hair_color
#>            <chr>  <int>      <lgl> <dbl>      <chr>
#> 1 Luke Skywalker    172       TRUE    87      brown
#> 2          C-3PO    167       TRUE    85      brown
#> 3          R2-D2     96      FALSE    32       <NA>
#> 4    Darth Vader    202       TRUE   146      brown
#> 5    Leia Organa    150       TRUE    59      brown
#> 6      Owen Lars    178       TRUE   130      brown

# look at order of columns
as.data.frame(res)
#>             name height condition1 mass hair_color
#> 1 Luke Skywalker    172       TRUE   87      brown
#> 2          C-3PO    167       TRUE   85      brown
#> 3          R2-D2     96      FALSE   32       <NA>
#> 4    Darth Vader    202       TRUE  146      brown
#> 5    Leia Organa    150       TRUE   59      brown
#> 6      Owen Lars    178       TRUE  130      brown

# look at order of columns
as.data.frame(compute(res))
#>             name height mass hair_color condition1
#> 1 Luke Skywalker    172   87      brown       TRUE
#> 2          C-3PO    167   85      brown       TRUE
#> 3          R2-D2     96   32       <NA>      FALSE
#> 4    Darth Vader    202  146      brown       TRUE
#> 5    Leia Organa    150   59      brown       TRUE
#> 6      Owen Lars    178  130      brown       TRUE


sparklyr::spark_disconnect(sc)


# RSQlite example

rm(list=ls())

my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

d <- starwars %>% 
  select(., name, height, mass,
         hair_color ) %>% 
  copy_to(my_db, ., name = 'starwars')

res <- d %>% 
  mutate(., 
         condition1 := height>=150,
         mass := ifelse(condition1, 
                        mass + 10, mass),
         hair_color := ifelse(condition1, 
                              'brown', hair_color) ) %>%
  head()

class(res)
#> [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl"

# look at order of columns
colnames(res)
#> [1] "name"       "height"     "mass"       "hair_color" "condition1"

# look at order of columns
res
#> # Source:   lazy query [?? x 5]
#> # Database: sqlite 3.19.3 [:memory:]
#>             name height condition1  mass hair_color
#>            <chr>  <int>      <int> <dbl>      <chr>
#> 1 Luke Skywalker    172          1    87      brown
#> 2          C-3PO    167          1    85      brown
#> 3          R2-D2     96          0    32       <NA>
#> 4    Darth Vader    202          1   146      brown
#> 5    Leia Organa    150          1    59      brown
#> 6      Owen Lars    178          1   130      brown

# look at order of columns
as.data.frame(res)
#>             name height condition1 mass hair_color
#> 1 Luke Skywalker    172          1   87      brown
#> 2          C-3PO    167          1   85      brown
#> 3          R2-D2     96          0   32       <NA>
#> 4    Darth Vader    202          1  146      brown
#> 5    Leia Organa    150          1   59      brown
#> 6      Owen Lars    178          1  130      brown

# look at order of columns
as.data.frame(compute(res))
#>             name height mass hair_color condition1
#> 1 Luke Skywalker    172   87      brown          1
#> 2          C-3PO    167   85      brown          1
#> 3          R2-D2     96   32       <NA>          0
#> 4    Darth Vader    202  146      brown          1
#> 5    Leia Organa    150   59      brown          1
#> 6      Owen Lars    178  130      brown          1

DBI::dbDisconnect(my_db)
@JohnMount JohnMount changed the title Column names mis-ordered when passing dbplyr data to knitr::kable() as.data.frame() does not respect column order Nov 21, 2017
@JohnMount JohnMount changed the title as.data.frame() does not respect column order as.data.frame() does not respect column order for DB backends Nov 21, 2017
@hadley
Copy link
Member

hadley commented Nov 21, 2017

Minimal reprex

library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

df <- data.frame(x = 1, y = 2)
db <- copy_to(con, df)
db2 <- mutate(db, x1 = x + 1, y = x1 + y)

colnames(db2)
#> [1] "x"  "y"  "x1"
colnames(collect(db2))
#> [1] "x"  "x1" "y"

Due to mutate() potential having to create multiple subqueries. Likely to be bug in op_vars.op_mutate

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants