# Walkthrough: Spark/RDD in Python

<img src="images/spark_flow.png" width="500">

We'll proceed along the usual spark flow (see above).
1. create the enviromnent to run spark from python
2. extract RDDs from files
3. run some transformations
4. execute actions to obtain values (local objects in python)

## 1.1. Initializing a `SparkSession`

IPython / IPython notebook can be a *client* to interact with the *master*.

The client will have a `SparkSession` that..

1. Acts as a gateway between the client and Spark master
2. Sends code/data from IPython to the master (who then sends it to the workers)

<img src="images/spark_driver_etc.png"/>

Using:

```python
import pyspark as ps

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("df lecture") \
            .getOrCreate()
```

will create a *"local"* cluster made of the driver using all 4 cores.

In [1]:
import pyspark as ps    # for the pyspark suite

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("df lecture") \
            .getOrCreate()

sc = spark.sparkContext  # for the pre-2.0 sparkContext

## 1.2. Creating an RDD (from files)

RDDs are **immutable**. Once created, you cannot modify them directly. You can only transform them into another RDD. 

Functions for creating an RDD from an external source are methods of the SparkContext object `sc`.

| Method | Description |
| - | - |
| [`sc.parallelize(array)`]() | Create an RDD from a python array or list |
| [`sc.textFile(path)`]() | Create an RDD from a text file |
| [`sc.pickleFile(path)`]() | Create an RDD from a pickle file |

### 1.2.1. Creating RDDs from local files

#### `sc.parallelize()` : create an RDD from a python array/list

In [2]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

[['matthew', 4],
 ['jorge', 8],
 ['josh', 15],
 ['evangeline', 16],
 ['emilie', 23],
 ['yunjin', 42]]

#### `sc.textFile()` : from a text file !

The import will give you an rdd made of **strings which are lines of the text file**.

In [3]:
# displaying the content of the file in stdout
with open('data/toy_data.txt', 'r') as fin:
    print(fin.read())

# reading the file using SparkContext
rdd = sc.textFile('data/toy_data.txt')

# to output the content in python [irl, use collect() with great care]
rdd.collect()

matthew,4
jorge,8
josh,15
evangeline,16
emilie,23
yunjin,42



[u'matthew,4',
 u'jorge,8',
 u'josh,15',
 u'evangeline,16',
 u'emilie,23',
 u'yunjin,42']

#### <span style="color:red">`sc.pickeFile()` : from a HDFS pickle file

The import will give you an rdd composed of whatever table was stored into that file.</span>

In [4]:
%ls data/toy_data.pkl

_SUCCESS    part-00000  part-00001


In [5]:
# reading the file using SparkContext
rdd = sc.pickleFile('data/toy_data.pkl')

# to output the content in python [irl, use with great care]
rdd.collect()

[u'matthew,4',
 u'jorge,8',
 u'josh,15',
 u'evangeline,16',
 u'emilie,23',
 u'yunjin,42']

### 1.2.2. Creating RDDs from S3

These two functions above can perform loading from an s3 repository too ! Effortless.

<span style="color:red">Warning: don't .collect() that, or you'll break the internet !</span>

**Note**: in order to do that, you need to have launched jupyter with the `--packages` options for aws and hadoop.

In [58]:
#!source ~/.bash_profile

In [6]:
# link to the S3 repository
#link = 's3n://mortar-example-data/airline-data'
#import os
#ACCESS_KEY = os.environ['AWS_ACCESS_KEY_ID']
#SECRET_KEY = os.environ['AWS_SECRET_ACCESS_KEY']

link = 's3a://mortar-example-data/airline-data'
# creating an RDD...
rdd = sc.textFile(link)

Wow, this repository has like 5 million rows, but this was so fast ! right ?

Not really, Spark is just **lazy**: it only executes the operations when necessary. For instance, when we call for an Action (see below).

In [7]:
# find out how many partitions there are...
rdd.getNumPartitions()

11

In [8]:
rdd.count()

5113194

## 1.3. Transformations : transforming an RDD into another

- They are **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform an RDD into another RDD because RDD are **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).

<img src="images/rdd_narrow_vs_wide_transformations.png" width="400"/>
\[[Image Source](http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html)\]



| Method | Type | Category | Description |
| - | - | - |
| [`.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) | transformation | mapping | Return a new RDD by applying a function to each element of this RDD. |
| [`.flatMap(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap) | transformation | mapping | Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. |
| [`.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter) | transformation | reduction |  Return a new RDD containing only the elements that satisfy a predicate. |
| [`.sample()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sample) | transformation | reduction | Return a sampled subset of this RDD. |
| [`.distinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.distinct) | transformation | reduction |  Return a new RDD containing the distinct elements in this RDD. |
| [`.keys()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.keys) | transformation | `<k,v>` | Return an RDD with the keys of each tuple. |
| [`.values()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.values) | transformation | `<k,v>` | Return an RDD with the values of each tuple. |
| [`.join(rddB)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.join) | transformation | `<k,v>` | Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. |
| [`.reduceByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) | transformation | `<k,v>` | Merge the values for each key using an associative and commutative reduce function. |
| [`.groupByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) | transformation | `<k,v>` | Merge the values for each key using non-associative operation, like mean. |
| [`.sortBy(keyfunc)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy) | transformation | sorting |  Sorts this RDD by the given keyfunc. |
| [`.sortByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortByKey) | transformation | sorting/`<k,v>` | Sorts this RDD, which is assumed to consist of (key, value) pairs. |



### 1.3.1. Applying transformations and chaining them

Recall the spark flow:

<img src="images/spark_flow.png" width="500">

In the sequence below, we will in one sequence:
1. read an RDD from a text file
2. transform by applying `split`
3. transform by filtering
4. transform by casting some columns to their corresponding type.
5. use an action to output the results

Each transformation is a method of an RDD, and returns another RDD.

In [9]:
# displaying the content of the file in stdout
with open('data/sales.txt', 'r') as fin:
    print(fin.read())

#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00



*Recall: Input functions, reading RDDs from files, are functions of the SparkContext.*

In [10]:
# reads a text file line by line
rdd1 = sc.textFile('data/sales.txt')

rdd1.collect()

[u'#ID    Date           Store   State  Product    Amount',
 u'101    11/13/2014     100     WA     331        300.00',
 u'104    11/18/2014     700     OR     329        450.00',
 u'102    11/15/2014     203     CA     321        200.00',
 u'106    11/19/2014     202     CA     331        330.00',
 u'103    11/17/2014     101     WA     373        750.00',
 u'105    11/19/2014     202     CA     321        200.00']

In [11]:
# applies split() to each row
rdd2 = rdd1.map(lambda rowstr : rowstr.split())

rdd2.collect()

[[u'#ID', u'Date', u'Store', u'State', u'Product', u'Amount'],
 [u'101', u'11/13/2014', u'100', u'WA', u'331', u'300.00'],
 [u'104', u'11/18/2014', u'700', u'OR', u'329', u'450.00'],
 [u'102', u'11/15/2014', u'203', u'CA', u'321', u'200.00'],
 [u'106', u'11/19/2014', u'202', u'CA', u'331', u'330.00'],
 [u'103', u'11/17/2014', u'101', u'WA', u'373', u'750.00'],
 [u'105', u'11/19/2014', u'202', u'CA', u'321', u'200.00']]

In [12]:
# filters rows
rdd3 = rdd2.filter(lambda row: not row[0].startswith('#'))

rdd3.collect()

[[u'101', u'11/13/2014', u'100', u'WA', u'331', u'300.00'],
 [u'104', u'11/18/2014', u'700', u'OR', u'329', u'450.00'],
 [u'102', u'11/15/2014', u'203', u'CA', u'321', u'200.00'],
 [u'106', u'11/19/2014', u'202', u'CA', u'331', u'330.00'],
 [u'103', u'11/17/2014', u'101', u'WA', u'373', u'750.00'],
 [u'105', u'11/19/2014', u'202', u'CA', u'321', u'200.00']]

In [13]:
def casting_function(row):
    id, date, store, state, product, amount = row
    return((int(id), date, int(store), state, int(product), float(amount)))

# applies casting_function to rows
rdd4 = rdd3.map(casting_function)

# shows the result
rdd4.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

**Now, let's see the canonical way to write that in Python...**

In [14]:
def casting_function(row):
    id, date, store, state, product, amount = row
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda rowstr : rowstr.split())\
        .filter(lambda row: not row[0].startswith('#'))\
        .map(casting_function)   # <= JUST ADDED THIS HERE

rdd_sales.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

<span style="color:red">FROM NOW ON WE'LL RELY ON THESE TWO RDDs</span>

In [10]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd_names = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd_names.collect()

[['matthew', 4],
 ['jorge', 8],
 ['josh', 15],
 ['evangeline', 16],
 ['emilie', 23],
 ['yunjin', 42]]

In [16]:
def casting_function(row):
    id, date, store, state, product, amount = row
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda x : x.split())\
        .filter(lambda x: not x[0].startswith('#'))\
        .map(casting_function)

rdd_sales.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

### 1.3.2. Mapping

#### `.map(func)` : applying a function on every row

In [17]:
# applying a lambda function to an rdd
rddout = rdd_names.map(lambda x : len(x[0]))

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
after: [7, 5, 4, 10, 6, 6]


#### `.flatMap(func)` : applying a function on every row and flattening the resulting lists

In [18]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.flatMap(lambda row : [row[1], row[1]+2, row[1]+len(row[0])])

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
after: [4, 6, 11, 8, 10, 13, 15, 17, 19, 16, 18, 26, 23, 25, 29, 42, 44, 48]


### 1.3.3. Row reduction

#### `.filter(func)`: filters an RDD using a function that returns boolean values

In [19]:
# filtering an rdd
rddout = rdd_sales.filter(lambda row: (row[3] == 'CA'))

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [(101, u'11/13/2014', 100, u'WA', 331, 300.0), (104, u'11/18/2014', 700, u'OR', 329, 450.0), (102, u'11/15/2014', 203, u'CA', 321, 200.0), (106, u'11/19/2014', 202, u'CA', 331, 330.0), (103, u'11/17/2014', 101, u'WA', 373, 750.0), (105, u'11/19/2014', 202, u'CA', 321, 200.0)]
after: [(102, u'11/15/2014', 203, u'CA', 321, 200.0), (106, u'11/19/2014', 202, u'CA', 331, 330.0), (105, u'11/19/2014', 202, u'CA', 321, 200.0)]


#### `.sample(withReplacement, fraction, seed)`: sampling an RDD !!

In [20]:
# sampling an rdd
rddout = rdd_sales.sample(True, 0.4)

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [(101, u'11/13/2014', 100, u'WA', 331, 300.0), (104, u'11/18/2014', 700, u'OR', 329, 450.0), (102, u'11/15/2014', 203, u'CA', 321, 200.0), (106, u'11/19/2014', 202, u'CA', 331, 330.0), (103, u'11/17/2014', 101, u'WA', 373, 750.0), (105, u'11/19/2014', 202, u'CA', 321, 200.0)]
after: [(102, u'11/15/2014', 203, u'CA', 321, 200.0), (105, u'11/19/2014', 202, u'CA', 321, 200.0)]


#### `.distinct()`: obtaining distinct rows

In [21]:
# obtaining distinct values of the "state" column of rdd_sales
rddout = rdd_sales.map(lambda row: row[3])\
                    .distinct()

# print out the original rdd
print("before: {}".format(rdd_sales.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [(101, u'11/13/2014', 100, u'WA', 331, 300.0), (104, u'11/18/2014', 700, u'OR', 329, 450.0), (102, u'11/15/2014', 203, u'CA', 321, 200.0), (106, u'11/19/2014', 202, u'CA', 331, 330.0), (103, u'11/17/2014', 101, u'WA', 373, 750.0), (105, u'11/19/2014', 202, u'CA', 321, 200.0)]
after: [u'CA', u'WA', u'OR']


### 1.3.4. Methods with a `<k,v>` paradigm

#### `.values()`: returns the values of a RDD made of `<k,v>` pairs

In [22]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.values()

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
after: [4, 8, 15, 16, 23, 42]


#### `.keys()`: returns the keys of a RDD made of `<k,v>` pairs

In [23]:
# applying a lambda function to an rdd (because why not)
rddout = rdd_names.keys()

# print out the original rdd
print("before: {}".format(rdd_names.collect()))

# print out the new rdd generated
print("after: {}".format(rddout.collect()))

before: [['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
after: ['matthew', 'jorge', 'josh', 'evangeline', 'emilie', 'yunjin']


#### `rddA.join(rddB)`: join another RDD

In [24]:
rdd_salesperstate = rdd_sales.map(lambda row: (row[3],row[5]))

rdd_salesperstate.collect()

[(u'WA', 300.0),
 (u'OR', 450.0),
 (u'CA', 200.0),
 (u'CA', 330.0),
 (u'WA', 750.0),
 (u'CA', 200.0)]

In [26]:
# creating an adhoc list of managers for each state
data_array = [['CA', 'matthew'],
              ['OR', 'jorge'],
              ['WA','matthew'],
              ['TX', 'emilie']]

# reading the array/list using SparkContext
rdd_managers = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd_salesperstate.join(rdd_managers).collect()

[(u'WA', (300.0, 'matthew')),
 (u'WA', (750.0, 'matthew')),
 (u'CA', (200.0, 'matthew')),
 (u'CA', (330.0, 'matthew')),
 (u'CA', (200.0, 'matthew')),
 (u'OR', (450.0, 'jorge'))]

#### `.reduceByKey(func)`: reduce `v`s by their `k` by applying func (what ?)

The `func` here needs to be associative and commutative... can you guess why ?

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Similarly to combiner for MapReduce framework

In [31]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

[['CA', 1], ['WA', 1], ['CA', 2], ['OR', 1], ['CA', 5], ['OR', 1]]

In [28]:
rdd.reduceByKey(lambda v1,v2 : v1+v2).collect()

[('OR', 2), ('CA', 8), ('WA', 1)]

#### `.groupByKey(func)`: reduce `v`s by their `k` by applying func (again ?)

This can use any function non-commutative

In [29]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

[['CA', 1], ['WA', 1], ['CA', 2], ['OR', 1], ['CA', 5], ['OR', 1]]

In [30]:
def mean(args):
    key,iterator = args
    total = 0.0; count = 0
    for x in iterator:
        total += x; count += 1
    return total / count

rdd.groupByKey()\
    .map(mean)\
    .collect()

[1.0, 2.6666666666666665, 1.0]

### 1.3.5. Sorting methods

#### `.sortBy(keyfunc)`: sorting by the value of a function on rows

In [32]:
# sorting by any function (because why not?)
rddout = rdd_names.sortBy(lambda row : (13-row[1])**2, ascending=True) #how far is the value from 13?

# print out the original rdd
print(rdd_names.collect())

# print out the new rdd generated
print(rddout.collect())

[['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
[['josh', 15], ['evangeline', 16], ['jorge', 8], ['matthew', 4], ['emilie', 23], ['yunjin', 42]]


#### `.sortByKey()`: sorting by key on a `<k,v>` RDD

In [33]:
# sorting k,v pairs by key
rddout = rdd_names.sortByKey(ascending=False)

# print out the original rdd
print(rdd_names.collect())

# print out the new rdd generated
print(rddout.collect())

[['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]
[('yunjin', 42), ('matthew', 4), ('josh', 15), ('jorge', 8), ('evangeline', 16), ('emilie', 23)]


## 1.4. Actions : turning your RDD into something else (local object)

Actions are specific methods of an RDD object, they are usually designed to transform an RDD into something else (a python object, or a statistic).

When used/executed in IPython or in a notebook, they **launch the processing of the DAG**. This is where Spark stops being **lazy**. This is where your script will take time to execute.

| Method | Type | Description |
| - | - | - |
| [`.collect()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) | action | Return a list that contains all of the elements in this RDD. Note that this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. |
| [`.count()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.count) | action | Return the number of elements in this RDD. |
| [`.take(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.take) | action | Take the first `n` elements of the RDD. |
| [`.top(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.top) | action | Get the top `n` elements from a RDD. It returns the list sorted in descending order. |
| [`.first()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.first) | action | Return the first element in a RDD. |
| [`.sum()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sum) | action | Add up the elements in this RDD. |
| [`.mean()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.mean) | action | Compute the mean of this RDD’s elements. |
| [`.stdev()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.stdev) | action | Compute the standard deviation of this RDD’s elements. |

In [34]:
# creating an adhoc list
data_array = [['matthew', 4],
              ['jorge', 8],
              ['josh', 15],
              ['evangeline', 16],
              ['emilie', 23],
              ['yunjin', 42]]

# reading the array/list using SparkContext
rdd_names = sc.parallelize(data_array)

### 1.4.1. Actions that return portions of an RDD

#### `.collect()` : returning the *full* content of an RDD to "python space"

Returns the rows of an RDD as a list. Can be a bad idea if your RDD is gigantic, cause `.collect()` will return everything and put it in memory for python to process.

In [35]:
# to output the content in python
collected = rdd_names.collect()

# let's check the type of RDD
print("type of rdd: {}".format(type(rdd_names)))

# let's check the type of what's collected
print("type of rdd_collected: {}".format(type(collected)))

# let's print the collected content
print(collected)

type of rdd: <class 'pyspark.rdd.RDD'>
type of rdd_collected: <type 'list'>
[['matthew', 4], ['jorge', 8], ['josh', 15], ['evangeline', 16], ['emilie', 23], ['yunjin', 42]]


#### `.take(n)` : returning (any) n lines of an RDD

Returns `n` the rows of an RDD as a list. These `n` are not randomly selected. They are Spark's own internal mechanism for obtaining the lines that can be collected first.

In [36]:
# to output the content in python
taken = rdd_names.take(2)

# let's check the type of what's collected
print("type of rdd_taken: {}".format(type(taken)))

# let's print the collected content
print(taken)

type of rdd_taken: <type 'list'>
[['matthew', 4], ['jorge', 8]]


#### `.first()` : returning the first line of an RDD

In [None]:
print(rdd_names.first())

### 1.4.2. Actions that compute some statistics

#### `.count()` : count the number of lines

In [None]:
print(rdd_names.count())

#### `.sum()`: summing every line in an RDD

(The RDD needs to be containing summable values)

In [39]:
print(rdd_names.values().sum())

108


#### `.mean()`: averaging every line in an RDD

(The RDD needs to be containing summable values)

In [40]:
print(rdd_names.values().mean())

18.0


#### `.stdev()`: you get that right ?

In [41]:
print(rdd_names.values().stdev())

12.3153021346


# 2. Let's design chains of transformations together !

## 2.1. Computing sales per state

### Input RDD

In [42]:
def casting_function(row):
    id, date, store, state, product, amount = row
    return((int(id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda x : x.split())\
        .filter(lambda x: not x[0].startswith('#'))\
        .map(casting_function)

rdd_sales.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

### Task

You want to obtain an RDD of the states sorted by their decreasing cumulated sales.

What transformations do you need to apply ?

If you had to draw a workflow of the transformations to apply ?

### Code

In [43]:
rddout = rdd_sales  ### put your transformations here...

rddout.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

### Solution


<details>
  <summary>Click here to see the solution below</summary>
```
rddout = rdd_sales.map(lambda x: (x[3],x[5]))\
    .reduceByKey(lambda amount1,amount2: amount1+amount2)\
    .sortBy(lambda state_amount:state_amount[1],ascending=False)

rddout.collect()
```
</details>


## 2.2. Word count

### Input RDD

In [44]:
# displaying the content of the file in stdout
with open('data/input.txt', 'r') as fin:
    print(fin.read())

# reading the file using SparkContext
rdd_text = sc.textFile('data/input.txt')

hello world
another line
yet another line
yet another another line



### Task
You want to create a table of unique words and their occurences.

What transformations do you need to apply ?

If you had to draw a workflow of the transformations to apply ?

### Code

In [None]:
rddout = rdd_text  # put your transformations here...

rddout.collect()

### Solution

<details>
  <summary>Click here to see the solution below</summary>
```
rddout = rdd_text.flatMap(lambda str : str.split())\
            .map(lambda word: (word,1))\
            .reduceByKey(lambda v1,v2: v1+v2)

rddout.collect()
```
</details>

## 2.3. Find the date on which AAPL's stock price was the highest

### Input RDD

In [45]:
rdd_aapl_raw = sc.textFile('data/aapl.csv')

print("lines in file: {}".format(rdd_aapl_raw.count()))

rdd_aapl_raw.take(5)

lines in file: 254


[u'Date,Open,High,Low,Close,Volume,Adj Close',
 u'2016-10-25,117.949997,118.360001,117.309998,118.25,39190300,118.25',
 u'2016-10-24,117.099998,117.739998,117.00,117.650002,23538700,117.650002',
 u'2016-10-21,116.809998,116.910004,116.279999,116.599998,23192700,116.599998',
 u'2016-10-20,116.860001,117.379997,116.330002,117.059998,24125800,117.059998']

### Task

Now, design a pipeline that would :
1. filter out headers
2. split each line based on comma
3. keep only fields for Date (col 0) and Close (col 4)
4. order by Close in descending order

### Code

In [49]:
rddout = rdd_aapl_raw # apply transformation here...

rddout.take(5)

[(122.57, u'2015-11-03'),
 (122.0, u'2015-11-04'),
 (121.18, u'2015-11-02'),
 (121.059998, u'2015-11-06'),
 (120.919998, u'2015-11-05'),
 (120.57, u'2015-11-09'),
 (120.529999, u'2015-10-29'),
 (119.5, u'2015-10-30'),
 (119.300003, u'2015-11-20'),
 (119.269997, u'2015-10-28')]

### Solution

<details>
  <summary>Click here to see the solution below</summary>
```
rddout = rdd_aapl_raw.filter(lambda line: not line.startswith("Date"))\
.map(lambda line: line.split(","))\
.map(lambda fields: (float(fields[4]),fields[0]))\
.sortBy(lambda row: row[0], ascending=False)

rddout.collect()
```
</details>

# 3. Caching / Persistency

- The RDD does no work until an action is called. And then when an action is called it figures out the answer and then throws away all the data.
- If you have an RDD that you are going to reuse in your computation you can use cache() to make Spark cache the RDD.
- This is especially useful if you have to run the same computation over and over again on one RDD: one use case ? oh I don't know maybe... **MACHINE LEARNING !!!**

## 3.1. Caching

Consider the following job...

In [50]:
import random
num_count = 500*1000
num_list = [random.random() for i in range(num_count)]
rdd1 = sc.parallelize(num_list)
rdd2 = rdd1.sortBy(lambda num: num)

In [51]:
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

CPU times: user 16.3 ms, sys: 4.84 ms, total: 21.2 ms
Wall time: 1.73 s
CPU times: user 8.86 ms, sys: 2.54 ms, total: 11.4 ms
Wall time: 668 ms
CPU times: user 9.88 ms, sys: 2.89 ms, total: 12.8 ms
Wall time: 702 ms


500000

- Lets cache it and try again.

In [52]:
rdd2.cache()
%time rdd2.count()
%time rdd2.count()
%time rdd2.count()

CPU times: user 7.94 ms, sys: 2.36 ms, total: 10.3 ms
Wall time: 812 ms
CPU times: user 7.48 ms, sys: 2.3 ms, total: 9.78 ms
Wall time: 96.3 ms
CPU times: user 8.78 ms, sys: 2.66 ms, total: 11.4 ms
Wall time: 104 ms


500000

- Caching the RDD speeds up the job because the RDD does not have to be computed from scratch again.
- Calling cache() flips a flag on the RDD.
- The data is not cached until an action is called.
- You can uncache an RDD using unpersist()

## 3.2. Persist

- Persist RDD to disk instead of caching it in memory.
- You can cache RDDs at different levels.

| Level	| Meaning |
| - | - |
| MEMORY_ONLY	| Same as cache() |
| MEMORY_AND_DISK	| Cache in memory then overflow to disk |
| MEMORY_AND_DISK_SER	| Like above; in cache keep objects serialized instead of live |
| DISK_ONLY	| Cache to disk not to memory |

In [11]:
sc.stop()
spark.stop()