dplyr backend for Spark SQL

Introduction

Apache Spark SQL has a Thrift JDBC/ODBC server mode which implements the HiveServer2 in Hive 0.13. Please see the document of Apache Spark: Distributed SQL Engine - Running the Thrift JDBC/ODBC server for details.

dplyrSparkSQL is an experimental project to build a Spark SQL backend for dplyr.

Getting Started

Install manually

Download the prebuild binary manually from https://spark.apache.org/downloads.html.
Checkout https://github.com/wush978/dplyrSparkSQL" to dplyrSparkSQL.
Extract the .tgz file and copy the jars in <spark-home>/lib to dplyrSparkSQL/inst/drv
Install the package from dplyrSparkSQL.

Install via devtools

library(devtools)
install_github("bridgewell/dplyrSparkSQL")

If you install in this way, the dplyrSparkSQL will trying to download the spark binaries automatically to retrieve the driver.

Connect to the Spark Thrift Server

src <- src_spark_sql(host = "localhost", port = "10000", user = Sys.info()["nodename"])

Please change the host, port and user accordingly.

Create Table

The following command create the table people from JSON.

db_create_table(src, "people", stored_as = "JSON", temporary = TRUE, 
                location = sprintf("file://%s", system.file(file.path("examples", "people.json"), package = "dplyrSparkSQL")))

Note that the src here connects to a spark in local mode, so it could access the file in local file system. If we connect to a real spark cluster, then the location should be a directory or a file on HDFS.

The following command create a table users from Parquet.

db_create_table(src, "users", stored_as = "PARQUET", temporary = TRUE, 
                location = sprintf("file://%s", system.file("examples/users.parquet", package = "dplyrSparkSQL")))

The dplyr obtains the tbl object via

people <- tbl(src, "people")
users <- tbl(src, "users")

dplyr features

We could apply the verbs of dplyr on people and users

people
nrow(people)
filter(people, age < 20)
select(users, name, favorite_color)
mutate(users, test_column = 1)
mutate(users, test_column = 1) %>% collect

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
R		R
inst		inst
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dplyr backend for Spark SQL

Introduction

Getting Started

Install manually

Install via devtools

Connect to the Spark Thrift Server

Create Table

dplyr features

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

bridgewell/dplyrSparkSQL

Folders and files

Latest commit

History

Repository files navigation

dplyr backend for Spark SQL

Introduction

Getting Started

Install manually

Install via devtools

Connect to the Spark Thrift Server

Create Table

dplyr features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages