Skip to content

Add support for Databricks backend #1377

@edgararuiz

Description

@edgararuiz

Thanks to the default "DBI" translation, most of the dplyr/dbplyr operations work. There are Databricks specific operations, for example var(), currently marked as not-supported by dbplyr, due to the fact that var() is not in the default translation. Databricks uses "Spark SQL", which is underpinned by the Hive SQL syntax. My suggestion would be to use the same SQL variance currently in use for the Hive backend support.

The second issue uncovered, is that copy_to() does not work. The Databricks back-end does not support transactions. So I think a custom db_copy_to() function will be needed here.

Reprex

The following reprex contains the specific code that can be used to connect to Databricks using a PAT. It also contains confirmation that var() is supported in Databricks. The confirmation is in calling SQL directly via DBI, and then attempting
to do the same via dplyr, which errors out. It also contains the error received when trying to use copy_to()

# https://docs.databricks.com/en/sql/language-manual/index.html

library(dbplyr)
library(dplyr)
library(DBI)

con <- dbConnect(
  odbc::odbc(),
  Driver = "/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib",
  Host = "rstudio-partner-posit-default.cloud.databricks.com",
  Port = 443,
  AuthMech = 3,
  HTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e",
  Protocol = "https",
  ThriftTransport = 2,
  SSL = 1,
  UID = "token",
  PWD = Sys.getenv("DATABRICKS_TOKEN")
)

# https://docs.databricks.com/en/sql/language-manual/functions/variance.html

dbGetQuery(con, "Select variance(trip_distance) as var1 from samples.nyctaxi.trips where trip_distance > 20")
#>       var1
#> 1 3.369306

trips <- tbl(con, in_catalog("samples", "nyctaxi", "trips"))

trips %>% 
  filter(trip_distance > 20) %>% 
  summarise(var1 = var(trip_distance, na.rm = TRUE))
#> Error in `var()`:
#> ! `var()` is not available in this SQL variant.
#> Backtrace:
#>      ▆
#>   1. ├─base::tryCatch(...)
#>   2. │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   3. │   ├─base (local) tryCatchOne(...)
#>   4. │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   5. │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
#>   6. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   8. ├─base::withCallingHandlers(...)
#>   9. ├─base::saveRDS(...)
#>  10. ├─base::do.call(...)
#>  11. ├─base (local) `<fn>`(...)
#>  12. ├─global `<fn>`(input = base::quote("hexed-tuna_reprex.R"))
#>  13. │ └─rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")
#>  14. │   └─knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
#>  15. │     └─knitr:::process_file(text, output)
#>  16. │       ├─base::withCallingHandlers(...)
#>  17. │       ├─base::withCallingHandlers(...)
#>  18. │       ├─knitr:::process_group(group)
#>  19. │       └─knitr:::process_group.block(group)
#>  20. │         └─knitr:::call_block(x)
#>  21. │           └─knitr:::block_exec(params)
#>  22. │             └─knitr:::eng_r(options)
#>  23. │               ├─knitr:::in_input_dir(...)
#>  24. │               │ └─knitr:::in_dir(input_dir(), expr)
#>  25. │               └─knitr (local) evaluate(...)
#>  26. │                 └─evaluate::evaluate(...)
#>  27. │                   └─evaluate:::evaluate_call(...)
#>  28. │                     ├─evaluate (local) handle(...)
#>  29. │                     │ └─base::try(f, silent = TRUE)
#>  30. │                     │   └─base::tryCatch(...)
#>  31. │                     │     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  32. │                     │       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  33. │                     │         └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  34. │                     ├─base::withCallingHandlers(...)
#>  35. │                     ├─base::withVisible(value_fun(ev$value, ev$visible))
#>  36. │                     └─knitr (local) value_fun(ev$value, ev$visible)
#>  37. │                       └─knitr (local) fun(x, options = options)
#>  38. │                         ├─base::withVisible(knit_print(x, ...))
#>  39. │                         ├─knitr::knit_print(x, ...)
#>  40. │                         └─rmarkdown:::knit_print.tbl_sql(x, ...)
#>  41. │                           ├─context$df_print(x)
#>  42. │                           └─dbplyr:::print.tbl_sql(x)
#>  43. │                             ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
#>  44. │                             │ ├─base::cat(paste0(..., "\n"), sep = "")
#>  45. │                             │ └─base::paste0(..., "\n")
#>  46. │                             ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
#>  47. │                             └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
#>  48. │                               └─pillar:::format_tbl(...)
#>  49. │                                 └─pillar::tbl_format_setup(...)
#>  50. │                                   ├─pillar:::tbl_format_setup_dispatch(...)
#>  51. │                                   └─pillar:::tbl_format_setup.tbl(...)
#>  52. │                                     └─pillar:::df_head(x, n + 1)
#>  53. │                                       ├─base::as.data.frame(head(x, n))
#>  54. │                                       └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
#>  55. │                                         ├─base::as.data.frame(collect(x, n = n))
#>  56. │                                         ├─dplyr::collect(x, n = n)
#>  57. │                                         └─dbplyr:::collect.tbl_sql(x, n = n)
#>  58. │                                           ├─dbplyr::db_sql_render(x$src$con, x, cte = cte)
#>  59. │                                           └─dbplyr:::db_sql_render.DBIConnection(x$src$con, x, cte = cte)
#>  60. │                                             ├─dbplyr::sql_render(sql, con = con, ..., cte = cte)
#>  61. │                                             └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., cte = cte)
#>  62. │                                               ├─dbplyr::sql_render(...)
#>  63. │                                               └─dbplyr:::sql_render.lazy_query(...)
#>  64. │                                                 ├─dbplyr::sql_build(query, con = con, ...)
#>  65. │                                                 └─dbplyr:::sql_build.lazy_select_query(query, con = con, ...)
#>  66. │                                                   └─dbplyr:::get_select_sql(...)
#>  67. │                                                     └─dbplyr::translate_sql_(select_expr, con, window = FALSE, context = list(clause = "SELECT"))
#>  68. │                                                       └─base::lapply(...)
#>  69. │                                                         └─dbplyr (local) FUN(X[[i]], ...)
#>  70. │                                                           ├─dbplyr::escape(eval_tidy(x, mask), con = con)
#>  71. │                                                           └─rlang::eval_tidy(x, mask)
#>  72. └─dbplyr (local) var(trip_distance, na.rm = TRUE)
#>  73.   └─cli::cli_abort("{.fun {f}} is not available in this SQL variant.")
#>  74.     └─rlang::abort(...)

# Copying data does not work

tbl_mtcars <- copy_to(con, mtcars)
#> Error in eval(expr, envir, enclos): nanodbc/nanodbc.cpp:1296: 00000: [Simba][ODBC] (11470) Transactions are not supported.

dbDisconnect(con)

Created on 2023-10-24 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions