Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teradata Translation - Feature Request #3040

Closed
happyshows opened this issue Aug 22, 2017 · 21 comments
Closed

Teradata Translation - Feature Request #3040

happyshows opened this issue Aug 22, 2017 · 21 comments
Labels
feature a feature request or enhancement wip work in progress

Comments

@happyshows
Copy link


Adding Teradata SQL Translation to dbplyr package, will start to work with Edgar on this.

@hadley hadley added database feature a feature request or enhancement labels Aug 23, 2017
@edgararuiz-zz
Copy link

Hi @happyshows ,

Thank you for looking into this.

The translation should be a new R script in the dbplyr repo called db-odbc-teradata.R: https://github.com/tidyverse/dbplyr. I'd suggest to start with a copy of the MSSQL translation
(https://github.com/tidyverse/dbplyr/blob/master/R/db-odbc-mssql.R) and modify it to fit Teradata's SQL syntax. I suggest MSSQL because it looks like the syntax to select the top rows are similar for both variances.

For more detailed info about how translations work, please also review the SQL Translation article: http://db.rstudio.com/translation

I'm curious, at this point, when you attempt to use dplyr with a Teradata connection, where does it fail? It seems that the SQL statements are standard enough where some operations should work, is that correct?

@happyshows
Copy link
Author

I don't have a way to dig deeper as the tbl command failed in the first place.

I believe the S3 dispatch assigned the tbl function to default with uses LIMIT syntax.

> tbl(conn,'FISCAL_DAY')
Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: 42000: [Teradata][ODBC Teradata Driver][Teradata Database] Syntax error: expected something between the word 'FISCAL_DAY' and the 'LIMIT' keyword.  

@happyshows
Copy link
Author

happyshows commented Aug 23, 2017

I also tried to trick a way to pass the first step but no luck

> class(conn)
[1] "Teradata"
attr(,"package")
[1] ".GlobalEnv"
> class(conn)<-'Microsoft SQL Server'
> class(conn)
[1] "Microsoft SQL Server"
> tbl(conn,'FISCAL_DAY')
Error in UseMethod("tbl") : 
  no applicable method for 'tbl' applied to an object of class "Microsoft SQL Server"

@edgararuiz-zz
Copy link

Ok, yeah, the resulting SQL is something like SELECT * FROM FISCAL_DAY LIMIT 6.

I think pointing MSSQL's select function to Teradata may let us run some queries

sql_select.Teradata<- `sql_select.Microsoft SQL Server`
tbl(conn,'FISCAL_DAY')

@happyshows
Copy link
Author

After importing many internal functions in dbplyr, I can run the tbl command now. So what's the suggested next step? Should I folk the dbplyr and create the R file. Any test steps I need to follow?

@edgararuiz-zz
Copy link

Yes, the two main functions to customize are sql_translate_env and sql_select.

As far as testing goes, I usually just have an RMarkdown with multiple dplyr code chunks to make sure it's working. For more automated testing you can use the https://github.com/rstudio/dbtest package. The package is still wip, and creates a table in the database is testing against, so you will need write access to it.

Thank you again for working on this!

@olwagees
Copy link

olwagees commented Sep 5, 2017

Hello,

@happyshows

I ran into this same error trying to work with teradata today and am wondering if you could help. I'm fairly new to this.

Thank you!

@happyshows
Copy link
Author

@olwagees
I'm trying to get sometime to work on this. Bascially follow edgar's 1st post. Download the package, create a teradata file and rename the function based on sqlserver, it should get you connect to td fine.

@edgararuiz-zz
Copy link

Hi @happyshows , if your version is working fine, would you mind sending a PR our way so we can make it part of the package? I can help with the testing if that's what's holding you up.

@happyshows
Copy link
Author

@edgararuiz what I did was changing the function name so that dispatch will fine correctly, but I remember some basic translation has problem, will get back to you on this next week.

@edgararuiz-zz
Copy link

Sounds good, thanks

@dfalbel
Copy link

dfalbel commented Sep 26, 2017

The only problem I found was LIMIT keyword does not exist in Teradata, so we should use SAMPLE.
Following the same thing that was done for oracle I reimplemented sql_select for Teradata like this:

sql_select.Teradata<- function(con, select, from, where = NULL,
                             group_by = NULL, having = NULL,
                             order_by = NULL,
                             limit = NULL,
                             distinct = FALSE,
                             ...) {
  out <- vector("list", 7)
  names(out) <- c("select", "from", "where", "group_by", "having", "order_by",
                  "limit")
  
  out$select    <- dbplyr:::sql_clause_select(select, con, distinct)
  out$from      <- dbplyr:::sql_clause_from(from, con)
  out$where     <- dbplyr:::sql_clause_where(where, con)
  out$group_by  <- dbplyr:::sql_clause_group_by(group_by, con)
  out$having    <- dbplyr:::sql_clause_having(having, con)
  out$order_by  <- dbplyr:::sql_clause_order_by(order_by, con)
  
  # Using Sample instead of limit
  if (!is.null(limit) && !identical(limit, Inf)) {
    assertthat::assert_that(is.numeric(limit), length(limit) == 1L, limit > 0)
    out$limit <- build_sql(
      "SAMPLE ", sql(format(trunc(limit), scientific = FALSE)),
      con = con
    )
  }
  
  escape(unname(dbplyr:::compact(out)), collapse = "\n", parens = FALSE, con = con)
}

And everything works fine... @edgararuiz any chance this could be added to dbplyr?

@bogdanrau
Copy link

Also interested in Teradata. Following.

@hadley hadley added the wip work in progress label Oct 23, 2017
hadley pushed a commit to tidyverse/dbplyr that referenced this issue Oct 25, 2017
@weixing777
Copy link

Hi, thank you for adding Teradata translation. I installed the development version of dbplyr (1.1.0.9000), and it fixed the head translation I had issue with. However, I don't think "not equal to" has been properly translated.

For instance, if I run the following code:

f_app_dt_flat %>%
    filter(rec_sys_id!=3, app_sbmt_dt=="2017-12-01") %>%
    select(app_num , loan_num) %>%
    show_query()

Here is the query it generated:

<SQL>
SELECT "app_num", "loan_num"
FROM "f_app_dt_flat"
WHERE (("rec_sys_id" != 3.0) AND ("app_sbmt_dt" = '2017-12-01'))

I don't think Teradata accept "!=" as "not equal to".

@edgararuiz-zz
Copy link

Hi @weixing777 , I have a fix for this in this branch: devtools::install_github("edgararuiz/dbplyr", ref = "fix-ter") can you take a look to confirm it works on your side before a sent a PR over? Thanks for reporting it!

@weixing777
Copy link

@edgararuiz Thank you for the quick response! It works!

@edgararuiz-zz
Copy link

Awesome, I just opened the PR for it: tidyverse/dbplyr#61

@jakefrost
Copy link

Hi Edgar, thanks for all your work on the Teradata translation. One issue I've come across is that PARTITION BY window functions generated by dbplyr don't include an ORDER BY clause, but Teradata requires one for the query to run.

For example, if I run this code:

flights %>% 
  group_by(record_id, record_create_dt, acct_num) %>% 
  mutate(rn = row_number()) %>% 
  show_query()

this is SQL generated:

SELECT "record_id", "record_create_dt", "acct_num", "dep_dt", "origin"
	,row_number() OVER (PARTITION BY "record_id", "record_create_dt", "acct_num", "dep_dt") AS "rn"
FROM (SELECT "record_id", "record_create_dt", "acct_num", "dep_dt", "origin"
FROM cdw.flights) "iljxeikdep"
WHERE (("dep_dt" = '2017-01-21') AND ("origin" = 'DEN'))

However, it runs with the following error:

Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: 42000: [Teradata][ODBC Teradata Driver][Teradata Database] Syntax error: expected something between the word 'dep_dt' and ')'.

I then tweaked the generated SQL to add an ORDER BY clause and ran it directly in Teradata:

SELECT "record_id", "record_create_dt", "acct_num", "dep_dt", "origin"
	,row_number() OVER (PARTITION BY "record_id", "record_create_dt", "acct_num", "dep_dt" ORDER BY "record_id", "record_create_dt") AS "rn"
FROM (SELECT "record_id", "record_create_dt", "acct_num", "dep_dt", "origin"
FROM cdw.flights) "iljxeikdep"
WHERE (("dep_dt" = '2017-01-21') AND ("origin" = 'DEN'))

and it ran without error.

Could an update be made where an ORDER BY clause is generated within PARTITION BY window functions, at least for tbl_teradata objects?

@happyshows
Copy link
Author

@jakefrost can you file a separate issue for better visibility? Also I don't believe ORDER BY is required to pair with Partition by...

@jakefrost
Copy link

@happyshows Sure thing--I opened a new issue here. I did a bit more research, and you're right about it not being a requirement of Partition by. Maybe it has to do with row_number()? I'm not sure. Thanks for the response!

@lock
Copy link

lock bot commented Aug 7, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Aug 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement wip work in progress
Projects
None yet
Development

No branches or pull requests

8 participants