R Markdown with Scala/Spark

This shows you how to use knitr language engine to bring Spark data analytics to R Markdown documents.

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.

Apache Spark™ is a fast and general engine for large-scale data processing.

Requiremetns

jvmr

We use the wonderful jvmr package from David B. Dahl. It allows you evaluate Java/Scala expression and it provides an interpreter interface that builds upon rJava.

While CRAN has archived jvmr, you can still install it by:

devtools::install_github("cran/jvmr")

The latest version use Scala 2.11.2. Since Spark still compiles with 2.10 by default, you might want to install jvmr by:

devtools::install_url("http://cran.r-project.org/src/contrib/Archive/jvmr/jvmr_1.0.4.tar.gz")

Java/Spark

Of course you will need Java and Spark as usual. We pick up the Spark location by SPARK_HOME environment variable. SparkR might make this part easier and more powerful in the future.

Overview

The motivation behind this is that R remains great in data analytics and statistical modeling however unable to handle large dataset is something at root of R language design. Aside from the analytics capabilities, R is also wonderful in terms of visualization and interactivity. Spark excels in terms of scalability but lacks the interactivity. We wonder if we can get the best of both world.

It turns out this can be as simple as this:

library(jvmr)
library(knitr)
scala <- scalaInterpreter()
knit_engines$set(scalar = function(options) {
  code <- paste(options$code, collapse = "\n")
  output <- capture.output(interpret(scala, code, echo.output = TRUE))
  engine_output(options, options$code, output)
})

This pipes Scala expressions into interpreter and return results which can be potentially coerced into R datatype.

Examples

scala.Rmd

A quick example on how this works with regular Scala interpreter.

spark-textclass.Rmd

reproduce the results from SimpleTextClassificationPipeline and plots a ROC curve.

Note: This doesn't work just yet. jvmr and rJava either hangs indefinitely or throws java.lang.OutOfMemoryError: PermGen space exeception.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scala_files/figure-html		scala_files/figure-html
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.rmd		README.rmd
SimpleTextClassificationPipeline.scala		SimpleTextClassificationPipeline.scala
log4j.properties		log4j.properties
scala.Rmd		scala.Rmd
scala.md		scala.md
scalarmd.Rproj		scalarmd.Rproj
spark-textclass.Rmd		spark-textclass.Rmd
spark-textclass.md		spark-textclass.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R Markdown with Scala/Spark

Requiremetns

jvmr

Java/Spark

Overview

Examples

scala.Rmd

spark-textclass.Rmd

Related Works

About

Releases

Packages

Languages

License

saurfang/scalarmd

Folders and files

Latest commit

History

Repository files navigation

R Markdown with Scala/Spark

Requiremetns

jvmr

Java/Spark

Overview

Examples

scala.Rmd

spark-textclass.Rmd

Related Works

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages