Skip to content

bridgewell/dplyrSparkSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dplyr backend for Spark SQL

Build Status Coverage Status

Introduction

Apache Spark SQL has a Thrift JDBC/ODBC server mode which implements the HiveServer2 in Hive 0.13. Please see the document of Apache Spark: Distributed SQL Engine - Running the Thrift JDBC/ODBC server for details.

dplyrSparkSQL is an experimental project to build a Spark SQL backend for dplyr.

Getting Started

Install manually

  1. Download the prebuild binary manually from https://spark.apache.org/downloads.html.
  2. Checkout https://github.com/wush978/dplyrSparkSQL" to dplyrSparkSQL.
  3. Extract the .tgz file and copy the jars in <spark-home>/lib to dplyrSparkSQL/inst/drv
  4. Install the package from dplyrSparkSQL.

Install via devtools

library(devtools)
install_github("bridgewell/dplyrSparkSQL")

If you install in this way, the dplyrSparkSQL will trying to download the spark binaries automatically to retrieve the driver.

Connect to the Spark Thrift Server

src <- src_spark_sql(host = "localhost", port = "10000", user = Sys.info()["nodename"])

Please change the host, port and user accordingly.

Create Table

The following command create the table people from JSON.

db_create_table(src, "people", stored_as = "JSON", temporary = TRUE, 
                location = sprintf("file://%s", system.file(file.path("examples", "people.json"), package = "dplyrSparkSQL")))

Note that the src here connects to a spark in local mode, so it could access the file in local file system. If we connect to a real spark cluster, then the location should be a directory or a file on HDFS.

The following command create a table users from Parquet.

db_create_table(src, "users", stored_as = "PARQUET", temporary = TRUE, 
                location = sprintf("file://%s", system.file("examples/users.parquet", package = "dplyrSparkSQL")))

The dplyr obtains the tbl object via

people <- tbl(src, "people")
users <- tbl(src, "users")

dplyr features

We could apply the verbs of dplyr on people and users

people
nrow(people)
filter(people, age < 20)
select(users, name, favorite_color)
mutate(users, test_column = 1)
mutate(users, test_column = 1) %>% collect

About

Spark SQL backend of dplyr

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •