Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing columns cause errors with union_all in postgres #183

Closed
karldw opened this issue Nov 14, 2018 · 7 comments
Milestone

Comments

@karldw
Copy link
Contributor

@karldw karldw commented Nov 14, 2018

When I try to use union_all to append two remote postgres tables with mismatched column names, I get an error. The error message is "Cannot pass NA to dbQuoteIdentifier()".

I expected the missing column to be filled in with NAs, as happens with SQLite. Or, if that's not possible, it might be helpful to have a clearer error message.

library(dplyr)
con <- DBI::dbConnect(RPostgres::Postgres())
DBI::dbWriteTable(con, "iris", iris, temporary=TRUE)
iris_db <- tbl(con, "iris")
union_all(select(iris_db, -Species), iris_db)
# Error: Cannot pass NA to dbQuoteIdentifier()
@hadley hadley added this to the v1.4.0 milestone Jan 10, 2019
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

Database free reprex:

library(dplyr, warn.conflicts = FALSE)
iris_db <- dbplyr::tbl_lazy(iris)

select(iris_db, -Species) %>% 
  union_all(iris_db) %>% 
  show_query()
#> Error: Cannot pass NA to dbQuoteIdentifier()

Created on 2019-01-10 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

If I hack around the proximate cause of the failure, I get this SQL:

(SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "" AS "Species"
  FROM (SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"
    FROM (SELECT * FROM "df") "dbplyr_ccjmuxwkcj") "dbplyr_wrovrqwfqa")

UNION ALL

(SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"
FROM (SELECT * FROM "df") "dbplyr_iabrmbvcsk")
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

I think the problem is that I'm missing a step somewhere — if you UNION to tables together and the columns don't match, we need to fill in extra columns with missing values/NULLs. It looks like I'm trying to do that in sql_build.op_set_op(), but I think that's too late, which is what's causing the noisy SQL.

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

I'm now generating:

<SQL> (SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"
FROM (SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", NULL AS "Species"
FROM (SELECT "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"
FROM (SELECT *
FROM "df") "dbplyr_zpdajlqbkx") "dbplyr_sovivecxvg") "dbplyr_astyzxyszk")
UNION ALL
(SELECT *
FROM "df")

which looks correct (and #193 + #213 will reduce the verbosity of the SQL)

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

library(dplyr, warn.conflicts = FALSE)
iris_db <- copy_to(dbplyr::src_memdb(), iris)
union_all(select(iris_db, -Species), iris_db) 
#> # Source:   lazy query [?? x 5]
#> # Database: sqlite 3.25.3 [:memory:]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1          5.1         3.5          1.4         0.2 <NA>   
#>  2          4.9         3            1.4         0.2 <NA>   
#>  3          4.7         3.2          1.3         0.2 <NA>   
#>  4          4.6         3.1          1.5         0.2 <NA>   
#>  5          5           3.6          1.4         0.2 <NA>   
#>  6          5.4         3.9          1.7         0.4 <NA>   
#>  7          4.6         3.4          1.4         0.3 <NA>   
#>  8          5           3.4          1.5         0.2 <NA>   
#>  9          4.4         2.9          1.4         0.2 <NA>   
#> 10          4.9         3.1          1.5         0.1 <NA>   
#> # … with more rows

Created on 2019-01-10 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Jan 10, 2019

Hmmm, it works on other database backends but only sql_quote() turns NA into NULL which does seem a bit risky when working with identifiers. I think my new approach is safer, but I can't figure out a simple unit test to illustrate it

@hadley hadley closed this in ed31b05 Jan 10, 2019
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Feb 6, 2019

With experimental #213, I now see:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
lf <- lazy_frame(g = 0, x = 1, y = 2)

lf %>% 
  union(lf %>% select(-y)) %>% 
  show_query()
#> <SQL>
#> (SELECT *
#> FROM `df`)
#> UNION
#> (SELECT `g`, `x`, NULL AS `y`
#> FROM `df`)

🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.