# SparkR Basics

RDD
---
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the SparkR-pkg source directory:

RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

Now let’s use a transformation. We will use the filterRDD transformation to return a new RDD with a subset of the items in the file.

We can chain together transformations and actions:

RDD Operations
---

RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

There are two functions here: lapply and reduce. The inner function (lapply) maps a line to an integer value, creating a new RDD. The outer function (reduce) is called on the new RDD to find the largest line count. In this case, the arguments to both functions are passed as anonymous functions, but we can also define R functions beforehand and pass them as arguments to the RDD functions. For example, we’ll define a max function to make this code easier to understand:

One common data flow pattern is MapReduce, as popularized by Hadoop. MapReduce flows are easily implemented in SparkR:

Here, we combined the flatMap, lapply and reduceByKey transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the collect action:

Caching
---

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting the SparkR shell to a cluster, an example of which is described in the [SparkR on EC2 wiki page](https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-on-EC2)

Standalone Applications
---

Now we'll walk through the process of writing and executing a standalone application in SparkR. As an example, we'll create a simple R script, SimpleApp.R:

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file and returns the counts as a string on the command line. In this application, we use the sparkR.init() function to initialize a SparkContext which is then used to create RDDs. We can pass R functions to Spark where they are automatically serialized along with any variables they reference.

To run this application, execute the following from the SparkR-pkg directory:

Other Examples
---

In addition, SparkR includes several samples in the examples directory. To run one of them, use ./sparkR <filename> <args>. For example: