This package implements
hive backends for the
dplyr package, providing a powerful and intuitive DSL to manipulate large datasets on two powerful big data platforms. It is a simple package: simple to learn if you have any familiarity with
dplyr or even just R and SQL, simple to deploy: just a few packages to install on a single machine, as long as your Spark or Hive installations comes with JDBC support.
The current state of the project is:
- adds some
spark-specific goodies, like caching tables.
- can go succesfully through tutorials for
dplyrlike any other database backend.
- test suite is work in progress. Nonetheless we expect it to inherit much of its correctness, scalability and robustness from its main dependencies,
- we don't recommend production use yet
cd <spark root> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests -Phive -Phive-thriftserver clean package
It may work with other hadoop versions, but we need the hive and hive-thriftserver support. Spark 1.5 and later is highly recommended because of bugs that affect this package. The package is able to start the thrift server but can also connect to a running one.
For Hive, any recent Hadoop distribution should do. We did some testing with the latest HDP as provided in the Hortonworks sandbox.
rJava. Make sure that you have
rJava working with:
install.packages("rJava") library(rJava) .jinit()
This is only a test, in general you don't need to repeat it before loading
On the mac
rJava required two different versions of java installed, for real, and in particular this shell variable set
The specific path may be different, particularly the version numbers. To start Rstudio (optional, you can use a different GUI or none at all), which doesn't read environment variables, you can enter the following command:
DYLD_FALLBACK_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/server/ open -a rstudio
HADOOP_JAR environment variable needs to be set to the main hadoop JAR file, something like
"<spark home>/assembly/target/scala-2.10/spark-assembly-*-hadoop*.jar". The packaging of Spark is under review (SPARK-11157) so expect changes here. We then may be able to bundle the necessary jars with this package, at least for the most recent versions, thus removing the need to install spark on client machines. For hive users, the path will be a little different but the principle is the same. This is needed for the instantiation of a JDBC driver, which needs to be able to find class
org.apache.hive.jdbc.HiveDriver fot both Spark and Hive, which share some of these components.
To start the thrift server from R, you need one more variable set,
SPARK_HOME, as the name suggests pointing to the root of the Spark installation. If you are connecting with a running server, you just need host and port information. Those can be stored in environment variable as well, see help documentation. The option to start a server from R is not available for Hive.
The current version is 0.6.0 .
To install the latest release and its dependencies, enter the following expression:
install.packages("dplyr.spark.hive", repos = c("http://archive.rzilla.org", unlist(options("repos"))))
You can find a number of examples derived from @hadley's own tutorials for dplyr looking under the tests directory, files
For new releases, subscribe to
dplyr.spark.hive's Release notes feed.
If you notice any problems, please create an item in the issue tracker.
Backend specific features
For highly technical reasons which I won't bother you with,
copy_to is not supported on Spark and slow on Hive. To make up, here are more useful features available only with these backends:
: create a table in spark or hive from an external data set, such as an hdfs file or JDBC connection. More extensive capabilities on Spark, check the manual. Scalable, unlike
: keep a data set in distributed memory as far as possible, spark-only.
: create a
tbl object for each table in SparkSQL or Hive.