# Introduction to Spark

This course has been designed with Spark 2.0.1 (Oct 2016).

**Objectives**

- Describe the advantages/disadvantages of Spark compared to Hadoop MapReduce
- Define what an RDD is, by its properties and operations
- Explain the different between transformations and actions on an RDD
- Implement the different transformations through use cases
- Describe what persisting/caching an RDD means, and situations where this is useful

# 1. Key Concepts

## 1.1. MapReduce vs Spark

<img src="images/apache_hadoop_ecosystem.jpg" width="500">

**Hadoop MapReduce limits:**
- your job has to fit the `<key,value>` paradigm
- no interactions (except by programming)
- each job read from disk: problem with iterative algorithms (machine learning)

**How Spark answers this:**
- Spark proposes other processing workflows than MapReduce
- highly efficient distributed operations
- Spark runs in memory and on disk
- Can be up to 100x faster than Hadoop MapReduce in memory, and 10x faster on disk.
- Spark keeps everything in memory when possible, uses lots of it.

<img src="images/spark_ecosystem.png" width="500">


## 1.2. Resilient Distributed Datasets (RDD)

<img src="images/rdd_on_cluster.png" width="200" align="right">
\[[Image Source](http://horicky.blogspot.com/2015/02/big-data-processing-in-spark.html)\]

- created from HDFS, S3, HBase, JSON, text, local... or transformed from another RDD
- distributed accross the cluster, partitioned (atomic chunks of data)
- can recover from errors (node failure, slow process)
- traceability of each partition, can re-run the processing
- **immutable** : you cannot *modify* an RDD in place

## 1.3. A "functional programming paradigm" and DAGs

RDDs are **immutable** ! You can only **transform** an existing RDD into another one.

Spark provides many transformations functions. By programming these functions, you construct a **Directed Acyclic Graph** (DAG).

<img src="images/dag.png">
\[[Image Source]()\]

When you use them, these functions are passed from the **client** to the **master**, who then distributes them to workers, who apply them accross their partitions of the RDD.


## 1.4. Spark architecture : from your coding hands to the cluster

<img src="images/from_rdd_to_cluster.png">
\[[Image Source]()\]

You construct your sequence of transformations in python.
Spark functional programming interface builds up a **DAG**
This DAG is sent by the **driver** for execution to the **cluster manager**.

## 1.5. Jargon

Excerpt taken from \[[Arush Kharbanda](https://www.quora.com/What-exactly-is-Apache-Spark-and-how-does-it-work) on Quora\]

**Job**: A piece of code which reads some input  from HDFS or local, performs some computation on the data and writes some output data.

**Stages**: Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.

**Tasks**: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).

**DAG**: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.

**Executor**: The process responsible for executing a task.

**Driver**: The program/process responsible for running the Job over the Spark Engine

**Master**: The machine on which the Driver program runs

**Slave/Worker**: The machine on which the Executor program runs

# 2. Operational Spark in Python

<img src="images/spark_flow.png" width="500">

We'll proceed along the usual spark flow (see above).
1. create the enviromnent to run spark from python
2. extract RDDs from files
3. run some transformations
4. execute actions to obtain values (local objects in python)

**Brainstorming**: So, let's suppose you have this thing called an RDD, which is just basically a dataset made of rows and values. What are all the operations you'd like to do to that RDD ?

In [None]:
# put your ideas here...


## 2.1. Initializing a `SparkContext` in Python

IPython / IPython notebook can be a *client* to interact with the *master*.

The client will have a `SparkContext` that..

1. Acts as a gateway between the client and Spark master
2. Sends code/data from IPython to the master (who then sends it to the workers)

<img src="images/spark_driver_etc.png"/>

Using:

```python
import pyspark as ps
sc = ps.SparkContext('local[4]')
```

will create a *"local"* cluster made of the driver using all 4 cores.

In [None]:
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning

In [None]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

## 2.2. Creating an RDD (from files)

RDDs are **immutable**. Once created, you cannot modify them directly. You can only transform them into another RDD. 

Functions for creating an RDD from an external source are methods of the SparkContext object `sc`.

| Method | Description |
| - | - |
| [`sc.parallelize(array)`]() | Create an RDD from a python array or list |
| [`sc.textFile(path)`]() | Create an RDD from a text file |
| [`sc.pickleFile(path)`]() | Create an RDD from a pickle file |

### 2.2.1. Creating RDDs from local files

#### `sc.parallelize()` : create an RDD from a python array/list

In [None]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

#### `sc.textFile()` : from a text file !

The import will give you an rdd made of **strings which are lines of the text file**.

In [None]:
# displaying the content of the file in stdout
with open('data/toy_data.txt', 'r') as fin:
    print fin.read()

# reading the file using SparkContext
rdd = sc.textFile('data/toy_data.txt')

# to output the content in python [irl, use collect() with great care]
rdd.collect()

#### <span style="color:red">`sc.pickeFile()` : from a HDFS pickle file

The import will give you an rdd composed of whatever table was stored into that file.</span>

In [None]:
%ls data/

In [None]:
# reading the file using SparkContext
rdd = sc.pickleFile('data/toy_data.pkl')

# to output the content in python [irl, use with great care]
rdd.collect()

### 2.2.2. Creating RDDs from S3

These two functions above can perform loading from an s3 repository too ! Effortless.

<span style="color:red">Warning: don't .collect() that, or you'll break the internet !</span>

**Note**: in order to do that, you need to have launched jupyter with the `--packages` options for aws and hadoop.

In [None]:
import os

# obtaining your credentials from your environment variables
ACCESS_KEY = os.environ['AWS_ACCESS_KEY_ID']
SECRET_KEY = os.environ['AWS_SECRET_ACCESS_KEY']

# link to the S3 repository
link = 's3n://{}:{}@mortar-example-data/airline-data'.format(ACCESS_KEY, SECRET_KEY)

# creating an RDD...
rdd = sc.textFile(link)

Wow, this repository has like 5 million rows, but this was so fast ! right ?

Not really, Spark is just **lazy**: it only executes the operations when necessary. For instance, when we call for an Action (see below).

In [None]:
# find out how many partitions there are...
rdd.getNumPartitions()

In [None]:
rdd.count()

## 2.3. Transformations : transforming an RDD into another

- They are **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform an RDD into another RDD because RDD are **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).

<img src="images/rdd_narrow_vs_wide_transformations.png" width="400"/>
\[[Image Source](http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html)\]



| Method | Type | Category | Description |
| - | - | - |
| [`.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) | transformation | mapping | Return a new RDD by applying a function to each element of this RDD. |
| [`.flatMap(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap) | transformation | mapping | Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. |
| [`.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter) | transformation | reduction |  Return a new RDD containing only the elements that satisfy a predicate. |
| [`.sample()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sample) | transformation | reduction | Return a sampled subset of this RDD. |
| [`.distinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.distinct) | transformation | reduction |  Return a new RDD containing the distinct elements in this RDD. |
| [`.keys()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.keys) | transformation | `<k,v>` | Return an RDD with the keys of each tuple. |
| [`.values()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.values) | transformation | `<k,v>` | Return an RDD with the values of each tuple. |
| [`.join(rddB)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.join) | transformation | `<k,v>` | Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. |
| [`.reduceByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) | transformation | `<k,v>` | Merge the values for each key using an associative and commutative reduce function. |
| [`.groupByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) | transformation | `<k,v>` | Merge the values for each key using non-associative operation, like mean. |
| [`.sortBy(keyfunc)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy) | transformation | sorting |  Sorts this RDD by the given keyfunc. |
| [`.sortByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortByKey) | transformation | sorting/`<k,v>` | Sorts this RDD, which is assumed to consist of (key, value) pairs. |



### 2.3.1. Applying transformations and chaining them

Recall the spark flow:

<img src="images/spark_flow.png" width="500">

In the sequence below, we will in one sequence:
1. read an RDD from a text file
2. transform by applying `split`
3. transform by filtering
4. transform by casting some columns to their corresponding type.
5. use an action to output the results

Each transformation is a method of an RDD, and returns another RDD.

In [None]:
# displaying the content of the file in stdout
with open('data/sales.txt', 'r') as fin:
    print fin.read()

*Recall: Input functions, reading RDDs from files, are functions of the SparkContext.*

In [None]:
# reads a text file line by line
rdd1 = sc.textFile('data/sales.txt')

rdd1.collect()

In [None]:
# applies split() to each row
rdd2 = rdd1.map(lambda rowstr : rowstr.split())

rdd2.collect()

In [None]:
# filters rows
rdd3 = rdd2.filter(lambda row: not row[0].startswith('#'))

rdd3.collect()

In [None]:
def casting_function((id, date, store, state, product, amount)):
    return((int(id), date, int(store), state, int(product), float(amount)))

# applies casting_function to rows
rdd4 = rdd3.map(casting_function)

# shows the result
rdd4.collect()

**Now, let's see the canonical way to write that in Python...**

In [None]:
rdd_sales = sc.textFile('data/sales.txt')

rdd_sales.collect()

In [None]:
rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda rowstr : rowstr.split())   # <= JUST ADDED THIS HERE

rdd_sales.collect()

In [None]:
rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda rowstr : rowstr.split())\
        .filter(lambda row: not row[0].startswith('#'))    # <= JUST ADDED THIS HERE

rdd_sales.collect()

In [None]:
def casting_function((id, date, store, state, product, amount)):
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda rowstr : rowstr.split())\
        .filter(lambda row: not row[0].startswith('#'))\
        .map(casting_function)   # <= JUST ADDED THIS HERE

rdd_sales.collect()

<span style="color:red">FROM NOW ON WE'LL RELY ON THESE TWO RDDs</span>

In [None]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd_names = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd_names.collect()

In [None]:
def casting_function((id, date, store, state, product, amount)):
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda x : x.split())\
        .filter(lambda x: not x[0].startswith('#'))\
        .map(casting_function)

rdd_sales.collect()

### 2.3.2. Mapping

#### `.map(func)` : applying a function on every row

In [None]:
# applying a lambda function to an rdd
rddout = rdd_names.map(lambda x : len(x[0]))

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

When using lambda functions, use **argument unpacking** to provide a more readable transformation.

In [None]:
# applying a lambda function to an rdd
rddout = rdd_names.map(lambda (name,number) : len(name))

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

#### `.flatMap(func)` : applying a function on every row and flattening the resulting lists

In [None]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.flatMap(lambda (name,number) : [number, number+2, number+len(name)])

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

### 2.3.3. Row reduction

#### `.filter(func)`: filters an RDD using a function that returns boolean values

In [None]:
# filtering an rdd
rddout = rdd_sales.filter(lambda (i, date, store, state, pdt, amnt): (state == 'CA'))

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

#### `.sample(withReplacement, fraction, seed)`: sampling an RDD !!

In [None]:
# sampling an rdd
rddout = rdd_sales.sample(True, 0.4)

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

#### `.distinct()`: obtaining distinct rows

In [None]:
# obtaining distinct values of the "state" column of rdd_sales
rddout = rdd_sales.map(lambda (i, date, store, state, pdt, amnt): state)\
                    .distinct()

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

### 2.3.4. Methods with a `<k,v>` paradigm

#### `.values()`: returns the values of a RDD made of `<k,v>` pairs

In [None]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.values()

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

#### `.keys()`: returns the keys of a RDD made of `<k,v>` pairs

In [None]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.keys()

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

#### `rddA.join(rddB)`: join another RDD

In [None]:
rdd_salesperstate = rdd_sales.map(lambda (i, date, store, state, pdt, amnt): (state,amnt))

rdd_salesperstate.collect()

In [None]:
# creating an adhoc list of managers for each state
data_array = [['CA', 'matthew'],
              ['OR', 'jorge'],
              ['WA','matthew'],
              ['TX', 'emilie']]

# reading the array/list using SparkContext
rdd_managers = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd_salesperstate.join(rdd_managers).collect()

#### `.reduceByKey(func)`: reduce `v`s by their `k` by applying func (what ?)

The `func` here needs to be associative and commutative... can you guess why ?

In [None]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

In [None]:
rdd.reduceByKey(lambda v1,v2 : v1+v2).collect()

#### `.groupByKey(func)`: reduce `v`s by their `k` by applying func (again ?)

This can use any function non-commutative

In [None]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

In [None]:
def mean(iterator):
    total = 0.0; count = 0
    for x in iterator:
        total += x; count += 1
    return total / count

rdd.groupByKey()\
    .map(lambda (state, iterator): (state, mean(iterator)))\
    .collect()

### 2.3.5. Sorting methods

#### `.sortBy(keyfunc)`: sorting by the value of a function on rows

In [None]:
# sorting by any function (because why not?)
rddout = rdd_names.sortBy(lambda (name,number) : (13-number)**2, ascending=True)

# print out the original rdd
print(rdd_names.collect())

# print out the new rdd generated
print(rddout.collect())

#### `.sortByKey()`: sorting by key on a `<k,v>` RDD

In [None]:
# sorting k,v pairs by key
rddout = rdd_names.sortByKey(ascending=False)

# print out the original rdd
print(rdd_names.collect())

# print out the new rdd generated
print(rddout.collect())

## 2.4. Actions : turning your RDD into something else (local object)

Actions are specific methods of an RDD object, they are usually designed to transform an RDD into something else (a python object, or a statistic).

When used/executed in IPython or in a notebook, they **launch the processing of the DAG**. This is where Spark stops being **lazy**. This is where your script will take time to execute.

| Method | Type | Description |
| - | - | - |
| [`.collect()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) | action | Return a list that contains all of the elements in this RDD. Note that this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. |
| [`.count()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.count) | action | Return the number of elements in this RDD. |
| [`.take(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.take) | action | Take the first `n` elements of the RDD. |
| [`.top(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.top) | action | Get the top `n` elements from a RDD. It returns the list sorted in descending order. |
| [`.first()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.first) | action | Return the first element in a RDD. |
| [`.sum()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sum) | action | Add up the elements in this RDD. |
| [`.mean()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.mean) | action | Compute the mean of this RDD’s elements. |
| [`.stdev()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.stdev) | action | Compute the standard deviation of this RDD’s elements. |

In [None]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd_names = sc.parallelize(data_array)

### 2.4.1. Actions that return portions of an RDD

#### `.collect()` : returning the *full* content of an RDD to "python space"

Returns the rows of an RDD as a list. Can be a bad idea if your RDD is gigantic, cause `.collect()` will return everything and put it in memory for python to process.

In [None]:
# to output the content in python
collected = rdd_names.collect()

# let's check the type of RDD
print("type of rdd: {}".format(type(rdd_names)))

# let's check the type of what's collected
print("type of rdd_collected: {}".format(type(collected)))

# let's print the collected content
print(collected)

#### `.take(n)` : returning (any) n lines of an RDD

Returns `n` the rows of an RDD as a list. These `n` are not randomly selected. They are Spark's own internal mechanism for obtaining the lines that can be collected first.

In [None]:
# to output the content in python
taken = rdd_names.take(2)

# let's check the type of what's collected
print("type of rdd_taken: {}".format(type(taken)))

# let's print the collected content
print(taken)

#### `.first()` : returning the first line of an RDD

In [None]:
print(rdd_names.first())

### 2.4.2. Actions that compute some statistics

#### `.count()` : count the number of lines

In [None]:
print(rdd_names.count())

#### `.sum()`: summing every line in an RDD

(The RDD needs to be containing summable values)

In [None]:
print(rdd_names.values().sum())

#### `.mean()`: averaging every line in an RDD

(The RDD needs to be containing summable values)

In [None]:
print(rdd_names.values().mean())

#### `.stdev()`: you get that right ?

In [None]:
print(rdd_names.values().stdev())

# 3. Let's design chains of transformations together !

## 3.1. Computing sales per state

### Input RDD

In [None]:
def casting_function((id, date, store, state, product, amount)):
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda x : x.split())\
        .filter(lambda x: not x[0].startswith('#'))\
        .map(casting_function)

rdd_sales.collect()

### Task

You want to obtain a sorted RDD of the states in which you have most sales done (amount).

What transformations do you need to apply ?
If you had to draw a workflow of the transformations to apply ?

### Code

In [None]:
rddout = rdd_sales # apply transformation here...

rddout.collect()

## Solution (use your mouse to uncover)

<span style="color:white;font-family:'Courier New'"><br/>
rddout = rdd_sales.map(lambda x: (x[3],x[5]))\<br/>
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\<br/>
    .sortBy(lambda state_amount:state_amount[1],ascending=False)<br/>
<br/>
rddout.collect()<br/>
</span>

## 3.2. Word count (again)

### Input RDD

In [None]:
# displaying the content of the file in stdout
with open('data/input.txt', 'r') as fin:
    print fin.read()

# reading the file using SparkContext

rdd = sc.textFile('data/input.txt')

### Task
What transformations do you need to apply ?
If you had to draw a workflow of the transformations to apply ?

### Code

In [None]:
rddout = rdd # apply transformation here...

# collect the result
rddout.collect()

## Solution (use your mouse to uncover)

<span style="color:white;font-family:'Courier New'">
rddout = rdd.flatMap(lambda str : str.split())\<br/>
            .map(lambda word: (word,1))\<br/>
            .reduceByKey(lambda v1,v2: v1+v2)<br/>
<br/>
rddout.collect()<br/>
</span>

## 3.3. Find the date on which AAPL's stock price was the highest

### Input RDD

In [None]:
import urllib2

link = "http://chart.finance.yahoo.com/table.csv?s=AAPL&a=9&b=25&c=2015&d=9&e=25&f=2016&g=d&ignore=.csv"
response = urllib2.urlopen(link)
csv_raw = response.read()
csv_lines = csv_raw.split('\n')

In [None]:
# let's see the first 5 lines
csv_lines[0:5]

In [None]:
# let's see the 5 last lines
csv_lines[-5:]

In [None]:
rdd_aapl = sc.parallelize(csv_lines)

print("lines in file: {}".format(rdd_aapl.count()))

rdd_aapl.take(5)

### Task

Now, design a pipeline that would :
1. filter out headers and last line
2. split each line based on comma
3. keep only fields for Date (col 0) and Close (col 4)
4. order by Close in descending order

### Code

In [None]:
rddout = rdd_aapl # apply transformation here...

rddout.take(5)

In [None]:
rddout = rdd_aapl.filter(lambda line: not line.startswith("Date") and (len(line) > 0))\
.map(lambda line: line.split(","))\
.map(lambda fields: (float(fields[4])-float(fields[1]),fields[0]))\
.sortBy(lambda (close, date): close, ascending=False) 
rddout.collect()

### Solution

<span style="color:white;font-family:'Courier New'">
rddout = rdd_aapl.filter(lambda line: not line.startswith("Date") and (len(line) > 0))\<br/>
.map(lambda line: line.split(","))\<br/>
.map(lambda fields: (float(fields[4]),fields[0]))\<br/>
.sortBy(lambda (close, date): close, ascending=False)
<br/>
rddout.collect()<br/>
</span>


# 4. Caching / Persistency

- The RDD does no work until an action is called. And then when an action is called it figures out the answer and then throws away all the data.
- If you have an RDD that you are going to reuse in your computation you can use cache() to make Spark cache the RDD.
- This is especially useful if you have to run the same computation over and over again on one RDD: one use case ? oh I don't know maybe... **MACHINE LEARNING !!!**

## 4.1. Caching

Consider the following job...

In [None]:
import random
num_count = 500*1000
num_list = [random.random() for i in xrange(num_count)]
rdd1 = sc.parallelize(num_list)
rdd2 = rdd1.sortBy(lambda num: num)

In [None]:
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

- Lets cache it and try again.

In [None]:
rdd2.cache()
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

- Caching the RDD speeds up the job because the RDD does not have to be computed from scratch again.
- Calling cache() flips a flag on the RDD.
- The data is not cached until an action is called.
- You can uncache an RDD using unpersist()

## 4.2. Persist

- Persist RDD to disk instead of caching it in memory.
- You can cache RDDs at different levels.

| Level	| Meaning |
| - | - |
| MEMORY_ONLY	| Same as cache() |
| MEMORY_AND_DISK	| Cache in memory then overflow to disk |
| MEMORY_AND_DISK_SER	| Like above; in cache keep objects serialized instead of live |
| DISK_ONLY	| Cache to disk not to memory |