Apache Spark SQL has a Thrift JDBC/ODBC server mode which implements the HiveServer2 in Hive 0.13. Please see the document of Apache Spark: Distributed SQL Engine - Running the Thrift JDBC/ODBC server for details.
dplyrSparkSQL is an experimental project to build a Spark SQL backend for dplyr.
- Download the prebuild binary manually from https://spark.apache.org/downloads.html.
- Checkout
https://github.com/wush978/dplyrSparkSQL"
todplyrSparkSQL
. - Extract the
.tgz
file and copy the jars in<spark-home>/lib
todplyrSparkSQL/inst/drv
- Install the package from
dplyrSparkSQL
.
library(devtools)
install_github("bridgewell/dplyrSparkSQL")
If you install in this way, the dplyrSparkSQL
will trying to download the spark binaries automatically to
retrieve the driver.
src <- src_spark_sql(host = "localhost", port = "10000", user = Sys.info()["nodename"])
Please change the host
, port
and user
accordingly.
The following command create the table people
from JSON.
db_create_table(src, "people", stored_as = "JSON", temporary = TRUE,
location = sprintf("file://%s", system.file(file.path("examples", "people.json"), package = "dplyrSparkSQL")))
Note that the src
here connects to a spark in local mode, so it could access
the file in local file system. If we connect to a real spark cluster, then the
location
should be a directory or a file on HDFS.
The following command create a table users
from Parquet.
db_create_table(src, "users", stored_as = "PARQUET", temporary = TRUE,
location = sprintf("file://%s", system.file("examples/users.parquet", package = "dplyrSparkSQL")))
The dplyr obtains the tbl object via
people <- tbl(src, "people")
users <- tbl(src, "users")
We could apply the verbs of dplyr on people
and users
people
nrow(people)
filter(people, age < 20)
select(users, name, favorite_color)
mutate(users, test_column = 1)
mutate(users, test_column = 1) %>% collect