Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbplyr on HIVE produces invalid SQL when using select_*() or select helpers on an in_schema() object #3200

Closed
sz-cgt opened this issue Nov 9, 2017 · 13 comments
Assignees

Comments

@sz-cgt
Copy link

sz-cgt commented Nov 9, 2017

Simple example

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT `zzz10.mcidx` AS `zzz10.mcidx`, `zzz10.dcg_dmgrphc_risk_nbr` AS `zzz10.dcg_dmgrphc_risk_nbr`, `zzz10.dcg_unwgtd_retrospctv_risk_nbr` AS `zzz10.dcg_unwgtd_retrospctv_risk_nbr`, `zzz10.dcg_wgtd_retrospctv_risk_nbr` AS `zzz10.dcg_wgtd_retrospctv_risk_nbr`, `zzz10.dcg_unwgtd_prsptv_risk_nbr` AS `zzz10.dcg_unwgtd_prsptv_risk_nbr`, `zzz10.dcg_wgtd_prsptv_risk_nbr` AS `zzz10.dcg_wgtd_prsptv_risk_nbr`, `zzz10.dcg_incrd_prd_bgn_dt` AS `zzz10.dcg_incrd_prd_bgn_dt`, `zzz10.dcg_incrd_prd_end_dt` AS `zzz10.dcg_incrd_prd_end_dt`, `zzz10.file_name` AS `zzz10.file_name`, `zzz10.load_id` AS `zzz10.load_id`
FROM 1001_2.dxcg_raw

Note the missing table alias zzz10

Remove the schema reference and it works

db <- dbConnect(odbc::odbc(), "PRD")
odbc::dbSendStatement(db2, "use 1001_2")
db %>% 
  tbl("dxcg_raw") %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT `mcidx` AS `mcidx`, `dcg_dmgrphc_risk_nbr` AS `dcg_dmgrphc_risk_nbr`, `dcg_unwgtd_retrospctv_risk_nbr` AS `dcg_unwgtd_retrospctv_risk_nbr`, `dcg_wgtd_retrospctv_risk_nbr` AS `dcg_wgtd_retrospctv_risk_nbr`, `dcg_unwgtd_prsptv_risk_nbr` AS `dcg_unwgtd_prsptv_risk_nbr`, `dcg_wgtd_prsptv_risk_nbr` AS `dcg_wgtd_prsptv_risk_nbr`, `dcg_incrd_prd_bgn_dt` AS `dcg_incrd_prd_bgn_dt`, `dcg_incrd_prd_end_dt` AS `dcg_incrd_prd_end_dt`, `file_name` AS `file_name`, `load_id` AS `load_id`
FROM `dxcg_raw`

Swap to PostgreSQL and it works

con <- dbConnect(odbc::odbc(), "Validation")
con %>% 
  tbl(dbplyr::in_schema("c1022_1", "clmdx")) %>% 
  select_all() %>% 
  show_query()

produces

<SQL>
SELECT "claim_id" AS "claim_id", "sv_stat" AS "sv_stat", "serviceyearmonth" AS "serviceyearmonth", "claim_in_network" AS "claim_in_network", "dx" AS "dx"
FROM c1022_1.clmdx

Here's the connection information for the two systems (minus any personal details of course)

> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"
$sourcename
[1] "PRD"
$servername
[1] "Hive"
$drivername
[1] "Hortonworks Hive ODBC Driver"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "2.1.10.1014"
$odbcdriver.version
[1] "03.80"
$supports.transactions
[1] FALSE
attr(,"class")
[1] "Hive"        "driver_info" "list"
> odbc::dbGetInfo(con)
$dbname
[1] "validation"
$dbms.name
[1] "PostgreSQL"
$db.version
[1] "9.6.3"
$sourcename
[1] "Validation"
$servername
[1] "lprdawswks0001.cedargatepartners.pvc"
$drivername
[1] "PSQLODBC35W.DLL"
$odbc.version
[1] "03.80.0000"
$driver.version
[1] "10.00.0000"
$odbcdriver.version
[1] "03.51"
$supports.transactions
[1] TRUE
attr(,"class")
[1] "PostgreSQL"  "driver_info" "list"       
@sz-cgt
Copy link
Author

sz-cgt commented Nov 9, 2017

A little more digging suggests that dbplyr is using select * from schema.table as alias to retrieve column names from tables

The problem with this approach is the hive parameter hive.resultset.use.unique.column.names, which forces result set names to be unique by pre-pending the table name or table alias to all field names when the * operator is used.

As this field is set true by default (see the TEZ documentation and HIVE-6687, dbplyr will need to work around this somehow if it wants to effectively support Hive (which seems to be part of RStudio's plans).

@edgararuiz-zz
Copy link

Hi, what happens if we remove the select_all() step, not sure what it adds to the example:

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw"))

@sz-cgt
Copy link
Author

sz-cgt commented Nov 10, 2017

You mean this?

db <- dbConnect(odbc::odbc(), "PRD")
db %>% 
  tbl(dbplyr::in_schema("1001_2", "dxcg_raw")) %>% 
  select_all() %>% 
  show_query()

That gives select * from 1001_2.dxcg_raw because dplyr is making no attempt to discover the field names. Swapping to collect() gives a tible with alias.column field names. That's also not the use case.

To summarise:

  • no field level activities prior to collect() works. It just gives the returned tibble weird column names
  • select() works because the user is supplying the list of fields.
  • select_all(), select_if() and select_at() all have the same problem because they're interrogating the database to get the list of fields.

I suspect all *_all(), *_if(), *_at() functions would fail in a similar way, but I have not tried them.

@edgararuiz-zz
Copy link

Ok, can you help me with clarifying the first point in the summarize list?

The field names on a quick query I ran look fine to me:

> t <- tbl(con, "customers") %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "id"                "name"              "email_preferences" "addresses"         "orders"     

@sz-cgt
Copy link
Author

sz-cgt commented Nov 10, 2017

Two questions/observations:

  1. you didn't include a schema in your example. That's the same as my second scenario in my original post
  2. you don't mention what database back-end you're using. As I noted originally, this is a Hive-specific problem. It doesn't occur in PostgreSQL for example

@edgararuiz-zz
Copy link

You're correct, sorry, it is a Hive connection. I just added a new DB, and a new table to that DB, here's the code and its results:

> con
<OdbcConnection> cloudera@Hive
  Database: HIVE
  Hive Version: 1.1.0-cdh5.7.0
> 
> t <- tbl(con, in_schema("newdb", "states")) %>%
+   head(10) %>%
+   collect()
> 
> colnames(t)
[1] "abbrev" "match1" "match2" "file"  

@sz-cgt
Copy link
Author

sz-cgt commented Nov 10, 2017

what does dbGetQuery(db, "set") %>% filter(grepl(pattern = "unique.column", set)) give if you?

As I mentioned in my second post, the parameter hive.resultset.use.unique.column.names appears to be the issue. When it's true (the default according to the Hive documentation), you get the following:

hive.resultset.use.unique.column.names
Default Value: true
Added In: Hive 0.13.0 with HIVE-6687
Make column names unique in the result set by qualifying column names with table alias if needed. Table alias will be added to column names for queries of type "select *" or if query explicitly uses table alias "select r1.x..".

@hadley
Copy link
Member

hadley commented May 20, 2018

@edgararuiz can you please close the loop here?

@edgararuiz-zz
Copy link

Yes, I'll be glad to. My Hive service is down now, so I'll circle back as soon as I have somewhere to test @sz-cgt suggestion

@edgararuiz-zz edgararuiz-zz self-assigned this May 23, 2018
@edgararuiz-zz
Copy link

@sz-cgt - here is the result:

> dbGetQuery(con, "set") %>% 
+   filter(grepl(pattern = "unique.column", set))
                                          set
1 hive.resultset.use.unique.column.names=true

Do you think it is because we're using different versions of Hive?

This is what I'm using:

> dbGetInfo(con)$db.version
[1] "1.1.0-cdh5.7.0"

This is what I see yours is (copy-pasted from you previous post):

> odbc::dbGetInfo(db)
$dbname
[1] "HIVE"
$dbms.name
[1] "Hive"
$db.version
[1] "1.2.1.2.3.4.7-4"

@sz-cgt
Copy link
Author

sz-cgt commented Jun 3, 2018

OK. so you have the default. That means this query select * from some_table should give back results like the following:

some_table.foo some_table.bar ...
a b ...

If you put an alias on the end, like this select * from some_table asdf, you get

asdf.foo asdf.bar ...
a b ...

Is that what happens or do you get undecorated column names?

@ghost
Copy link

ghost commented Jun 25, 2018

This issue was moved by krlmlr to tidyverse/dbplyr/issues/110.

@ghost ghost closed this as completed Jun 25, 2018
@lock
Copy link

lock bot commented Dec 22, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Dec 22, 2018
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants