# RDDs, Spark's Distributed Collection

RDDs seem a lot like immutable sequential or parallel Scala collections. 

```scala
abstract class RDD[T]  { 
    def map[U](f: T => U): RDD[U] = ... 
    def flatMap[U](f: T =>  TraversableOnce[U]: RDD[U] = ... 
    def filter(f: T =>  Boolean): RDD[T] = ... 
    def reduce(f: (T,  T) =>  T): T = ... 
```

Shown here only few operations.


Most operations on RDDs, like Scala's immutable List, and Scala's parallel collections, are higher-order functions. 

That is,  methods that work on  RDDs, taking a  function as an argument, and which typically return RDDs.

While their signatures differ a bit, their semantics (macroscopically) are
the same:

```scala
map[B](f: A=> B): List[B] // Scala List
map[B](f: A=> B): RDD[B] // Spark ROD
flatMap[B](f: A=> TraversableOnce[B]): List[B] // Scala List
flatMap[B](f: A=> TraversableOnce[B]): RDD[B] // Spark ROD
filter(pred: A=> Boolean): List[A] // Scala List
filter(pred: A=> Boolean): RDD[A] // Spark ROD
reduce(op: (A, A)=> A): A// Scala List
reduce(op: (A, A)=> A): A// Spark RDD
fold(z: A)(op: (A, A)=> A): A// Scala List
fold(z: A)(op: (A, A)=> A): A// Spark RDD
aggregate[B](z: => B)(seqop: (B, A)=> B, combop: (B, B) => B): B // Scala
aggregate[B](z: B)(seqop: (B, A)=> B, combop: (B, B) => B): B // Spark RDD         
```

Using RDDs in Spark feels a lot like normal Scala sequential/parallel
collections, with the added knowledge that your data is distributed across
several machines.


## How to Create an RDD?
RDDs can be created in two ways:
- Transforming an existing RDD .
- From a SparkContext ( or SparkSession) object.

Transforming an existing `RDD`. Just like a call to map on a `List` returns a new `List`, many higher-order functions defined on `RDD` return a new `RDD`. 

From a `SparkContext` (or `SparkSession`) object .
The `SparkContext` object (renamed `SparkSession`) can be thought of as
your handle to the Spark cluster. It represents the connection between the
Spark cluster and your running application. It defines a handful of methods which can be used to create and populate a new RDD:

- `parallelize`: convert a local Scala collection to an RDD .
- `textFile`: read a text file from HDFS or a local file system and return an RDD of String 

# Transformations and Actions 

Operations on RDD's. 

Recall transformers and accessors from Scala sequential and parallel
collections.

Transformers. Return new collections as results. (Not single values.)
Examples: `map`, `filter`, `flatMap`, `groupBy`

```scala
map(f: A=> B): Traversable[B]
```
Accessors: Return single values as results. (Not collections.)
Examples: `reduce`, `fold`, `aggregate`.
```scala
reduce(op: (A, A)=> A): A
```

Transformations: Return new collections RDDs as results. They are lazy, their result RDD is not immediately computed.

Actions. Compute a result based on an RDD, and either returned or saved to an external storage system (e.g., HDFS). They are eager, their result is immediately computed.

Laziness/eagerness is how we can limit network communication using the programming mode.


## Example
Consider the following sim ple exam ple:
```scala
val largelist: List[String] = ...
val wordsRdd = sc.parallelize(largelist)
val lengthsRdd = wordsRdd.map(_.length)
```
What has happened on the cluster at this point?
Nothing. Execution of map ( a transform at ion) is deferred.
To kick off the com putation and wait for its result.



<table class="table">
<tbody><tr><th style="width:25%">Transformation</th><th>Meaning</th></tr>
<tr>
  <td> <b>map</b>(<i>func</i>) </td>
  <td> Return a new distributed dataset formed by passing each element of the source through a function <i>func</i>. </td>
</tr>
<tr>
  <td> <b>filter</b>(<i>func</i>) </td>
  <td> Return a new dataset formed by selecting those elements of the source on which <i>func</i> returns true. </td>
</tr>
<tr>
  <td> <b>flatMap</b>(<i>func</i>) </td>
  <td> Similar to map, but each input item can be mapped to 0 or more output items (so <i>func</i> should return a Seq rather than a single item). </td>
</tr>
<tr>
  <td> <b>mapPartitions</b>(<i>func</i>) </td>
  <td> Similar to map, but runs separately on each partition (block) of the RDD, so <i>func</i> must be of type
    Iterator[T] =&gt; Iterator[U] when running on an RDD of type T. </td>
</tr>
<tr>
  <td> <b>mapPartitionsWithIndex</b>(<i>func</i>) </td>
  <td> Similar to mapPartitions, but also provides <i>func</i> with an integer value representing the index of
  the partition, so <i>func</i> must be of type (Int, Iterator[T]) =&gt; Iterator[U] when running on an RDD of type T.
  </td>
</tr>
<tr>
  <td> <b>sample</b>(<i>withReplacement</i>, <i>fraction</i>, <i>seed</i>) </td>
  <td> Sample a fraction <i>fraction</i> of the data, with or without replacement, using a given random number generator seed. </td>
</tr>
<tr>
  <td> <b>union</b>(<i>otherDataset</i>) </td>
  <td> Return a new dataset that contains the union of the elements in the source dataset and the argument. </td>
</tr>
<tr>
  <td> <b>distinct</b>([<i>numTasks</i>])) </td>
  <td> Return a new dataset that contains the distinct elements of the source dataset.</td>
</tr>
<tr>
  <td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br>
<b>Note:</b> By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
</td>
</tr>
<tr>
  <td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument. </td>
</tr>
<tr>
  <td> <b>sortByKey</b>([<i>ascending</i>], [<i>numTasks</i>]) </td>
  <td> When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean <code>ascending</code> argument.</td>
</tr>
<tr>
  <td> <b>join</b>(<i>otherDataset</i>, [<i>numTasks</i>]) </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. </td>
</tr>
<tr>
  <td> <b>cogroup</b>(<i>otherDataset</i>, [<i>numTasks</i>]) </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. This operation is also called <code>groupWith</code>. </td>
</tr>
<tr>
  <td> <b>cartesian</b>(<i>otherDataset</i>) </td>
  <td> When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). </td>
</tr>
</tbody></table>

## Another Example

Let's assume that we have an `RDD[String]` which contains gigabytes of
logs collected over the previous year. Each element of this `RDD` represents
one line of logging.
Assuming that dates come in the form, `YYYY-MM-DD:HH:MM:SS`, and errors
are logged with a prefix that includes the word "error"

How would you determine the number of errors that were logged in
December 2016?
```scala
val lastYearslogs: RDD[String] = ...
val numDecErrorlogs
= lastYearslogs.filter(lg => lg.contains("2016-12") && lg.contains("error"))
.count()
```

Spark com putes RDDs the first time they are used in an action.
This helps when processing large amounts of data.
Example:

```scala
val lastYearslogs: RDD[String] = ...
val firstlogsWithErrors = lastYearslogs.filter(_.contains("ERROR")) .take(10)
```
The execution of filter is deferred until the take action is applied.
Spark leverages this by analyzing and optimizing the chain of operations before
executing it.

Spark will not compute intermediate RDDs. Instead, as soon as 10 elements of the
filtered RDD have been computed, firstLogsWi thErrors is done. At this point Spark
stops working, saving time and space computing elements of the unused result of filter. 

<table class="table">
<tbody><tr><th>Action</th><th>Meaning</th></tr>
<tr>
  <td> <b>reduce</b>(<i>func</i>) </td>
  <td> Aggregate the elements of the dataset using a function <i>func</i> (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. </td>
</tr>
<tr>
  <td> <b>collect</b>() </td>
  <td> Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. </td>
</tr>
<tr>
  <td> <b>count</b>() </td>
  <td> Return the number of elements in the dataset. </td>
</tr>
<tr>
  <td> <b>first</b>() </td>
  <td> Return the first element of the dataset (similar to take(1)). </td>
</tr>
<tr>
  <td> <b>take</b>(<i>n</i>) </td>
  <td> Return an array with the first <i>n</i> elements of the dataset. Note that this is currently not executed in parallel. Instead, the driver program computes all the elements. </td>
</tr>
<tr>
  <td> <b>takeSample</b>(<i>withReplacement</i>, <i>num</i>, <i>seed</i>) </td>
  <td> Return an array with a random sample of <i>num</i> elements of the dataset, with or without replacement, using the given random number generator seed. </td>
</tr>
<tr>
  <td> <b>saveAsTextFile</b>(<i>path</i>) </td>
  <td> Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. </td>
</tr>
<tr>
  <td> <b>saveAsSequenceFile</b>(<i>path</i>) </td>
  <td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
</tr>
<tr>
  <td> <b>countByKey</b>() </td>
  <td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td>
</tr>
<tr>
  <td> <b>foreach</b>(<i>func</i>) </td>
  <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td>
</tr>
</tbody></table>