# Spark Basics 1

This notebook inroduces two fundamental objects in Spark:
* The Spark Context

* The Resilient Distributed DataSet or RDD

The following code imports all the tools and exercises needed to complete this document:

In [1]:
import sys
import os 

testPath = '/'.join(os.getcwd().split('/')[:-1]) + "/Tester"
sys.path.insert(0, testPath )
pickleFile = testPath+ "/SparkBasics1.pkl"
import pickle
import numpy as np

import Tester
import SparkBasics1

### Spark Context
* Spark is complex distributed doftware. 
* The python interface to spark is called **pyspark**
* **SparkContext** is a python class, defined as part of **pyspark** which manages the communication between the user's program and spark.

We start by creating a **SparkContext** object named **sc**. In this case we create a spark context that uses 4 *executors*

In [2]:
from pyspark import SparkContext
sc = SparkContext(master="local[4]")
sc

<pyspark.context.SparkContext at 0x109e85a10>

#### Teacher Stuff

In [3]:
# we generate the student's test code below, we can check that it works by running the student's tests
import SparkBasics1_MASTER

In [4]:
#SparkBasics1_MASTER.gen_exercise1_1(pickleFile, sc)
#SparkBasics1_MASTER.gen_exercise1_2(pickleFile, sc)
#SparkBasics1_MASTER.gen_exercise1_3(pickleFile, sc)
#SparkBasics1_MASTER.gen_exercise1_4(pickleFile, sc)
SparkBasics1_MASTER.gen_exercise1_5(pickleFile, sc)

In [None]:
#SparkBasics1_MASTER.exercise1_1(pickleFile, SparkBasics1_MASTER.func_ex1_1,sc)
#SparkBasics1_MASTER.exercise1_2(pickleFile, SparkBasics1_MASTER.func_ex1_2,sc)
#SparkBasics1_MASTER.exercise1_3(pickleFile, SparkBasics1_MASTER.func_ex1_3,sc)
#SparkBasics1_MASTER.exercise1_4(pickleFile, SparkBasics1_MASTER.func_ex1_4,sc)
SparkBasics1_MASTER.exercise1_5(pickleFile, SparkBasics1_MASTER.func_ex1_5,sc)

In [None]:
#SparkBasics1.exercise1_1(pickleFile, SparkBasics1_MASTER.func_ex1_1, sc)
#SparkBasics1.exercise1_2(pickleFile, SparkBasics1_MASTER.func_ex1_2, sc)
#SparkBasics1.exercise1_3(pickleFile, SparkBasics1_MASTER.func_ex1_3, sc)
#SparkBasics1.exercise1_4(pickleFile, SparkBasics1_MASTER.func_ex1_4,sc)
SparkBasics1.exercise1_5(pickleFile, SparkBasics1_MASTER.func_ex1_5,sc)

### Only one sparkContext at a time!
When you run spark in local mode, you can have only a single context at a time. Therefor, if you want to use spark in a second notebook, you should first stop the one you are using here. This is what the method `.stop()` is for.

In [None]:
# sc.stop() #commented out so that you don't stop your context by mistake

<h3>RDDs</h3>

<p>RDD (or Resilient Distributed DataSet) is the main novel data structure in Spark. You can think of it as a list whose elements are stored on several computers.</p>

<p><img alt="" src="Figures/SparkContextAndRDD.jpg" style="height:324px; width:900px" /></p>


The elements of each `RDD` are distributed across the **worker nodes** which are the nodes that perform the actual computations. This notebook, however, is running on the **Driver node**. As the RDD is not stored on the driver-node you cannot access it directly. The variable name `RDD` is really just a pointer to a python object which holds the information regardnig the actual location of the elements.

#### Parallelize 
* Simplest way to create an RDD.
* The method `A=sc.parallelize(L)`, creates an RDD named `A` from list `L`.
* `A` is an RDD of type `ParallelCollectionRDD`.

In [3]:
A=sc.parallelize(range(3))
A

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

#### Collect

* RDD content is distributed among all executors.
* `collect()` is the inverse of `parallelize()'
* collects the elements of the RDD
* Returns a list


In [4]:
L=A.collect()
print type(L)
print L

<type 'list'>
[0, 1, 2]


### Using `.collect()` eliminates the benefits of parallelism
It is often tempting to `.collect()` and RDD, make it into a list, and then process the list using standard python. However, note that this means that you are using only the head node to perform the computation which means that you are not getting any benefit from spark.

Using RDD operations, as described below, **will** make use of all of the computers at your disposal.

### Map
* applies a given operation to each element of an RDD
* parameter is the function defining the operation.
* returns a new RDD.
* Operation performed in parallel on all executors.
* Each executor operates on the data **local** to it.

In [5]:
A.map(lambda x: x*x).collect()

[0, 1, 4]

**Note:** Here we are using **lambda** functions, later we will see that regular functions can also be used.

For more on lambda function see [here](http://www.secnetix.de/olli/Python/lambda_functions.hawk)

## Properties of reduce operations

* Reduce operations **must not depend on the order**
  * Order of operands should not matter
  * Order of application of reduce operator should not matter

* Multiplication and summation are good:

```
                1 + 3 + 5 + 2                      5 + 3 + 1 + 2 
```

 * Division and subtraction are bad:

```
                    1 - 3 - 5 - 2                      1 - 3 - 5 - 2
```

### Why must reordering not change the result?

You can think about the reduce operation as a binary tree where the leaves are the elements of the list and the root is the final result. Each triplet of the form (parent, child1, child2) corresponds to a single application of the reduce function. 

The order in which the reduce operation is applied is **determined at run time** and depends on how the RDD is partitioned across the cluster.
There are many different orders to apply the reduce operation. 

If we want the input RDD to uniquely determine the reduced value **all evaluation orders must must yield the same final result**. In addition, the order of the elements in the list must not change the result. In particular, reversing the order of the operands in a reduce function must not change the outcome. 

For example the arithmetic operations multiply `*` and add `+` can be used in a reduce, but the operations subtract `-` and divide `/` should not.

Doing so will not raise an error, but the result is unpredictable.

In [6]:
B=sc.parallelize([1,3,5,2])
B.reduce(lambda x,y: x-y)

-9

Which of these the following orders was executed?
* $$((1-3)-5)-2$$ or
* $$(1-3)-(5-2)$$

### Reduce

* Takes RDD as input, returns a single value.
* **Reduce operator** takes **two** elements as input returns **one** as output.
* Repeatedly applies a **reduce operator**
* Each executor reduces the data local to it.
* The results from all executors are combined.

The simplest example of a 2-to-1 operation is the sum:

In [7]:
A.reduce(lambda x,y: x+y)

3

Here is an example of a reduce operation that finds the shortest string in an RDD of strings.

In [8]:
words=['this','is','the','best','mac','ever']
wordRDD=sc.parallelize(words)
wordRDD.reduce(lambda w,v: w if len(w)<len(v) else v)

'is'

### Using regular functions instead of lambda functions

* lambda function are short and sweet.
* but sometimes it's hard to use just one line.
* We can use full-fledged functions instead.

Suppose we want to find the 
* last word in a lexicographical order 
* among 
* the longest words in the list.

We could achieve that as follows

In [9]:
def largerThan(x,y):
    if len(x)>len(y): return x
    elif len(y)>len(x): return y
    else:  #lengths are equal, compare lexicographically
        if x>y: 
            return x
        else: 
            return y
        
wordRDD.reduce(largerThan)

'this'

# Exercises 

## Exercise 1

Write a function called `mapcos` that has a single paramater: an RDD of numbers. Use `map` to return an RRD that that is the `cos()` (cosine) of the input.

`mapcos(A)` should produce some output approximately like:
    
```
    PythonRDD[14] at RDD at PythonRDD.scala:48
```
`mapcos(A).collect` should produce:
```
    [1.0, 0.54030..., -0.41614...]
```

In [None]:
def mapcos(A):
    # Write code that will perform the task outlined in exercise 1
    return "returns a spark RDD, so to get a list do: mapcos(A).collect() "

In [None]:
import SparkBasics1
SparkBasics1.exercise1_1(pickleFile, mapcos ,sc)

## Excercise 2

Write a function called `mapwords` that has a single paramater: an RDD of strings, and returns an RDD that contains a list of words for each string.

`stringRDD=sc.parallelize(["Spring quarter", "Learning spark basics", "Big data analytics with Spark"])`

`mapwords(stringRDD).collect()` 
```
output: 
[['Spring', 'quarter'], ['Learning', 'spark', 'basics'], ['Big', 'data', 'analytics', 'with', 'Spark']]
```

In [None]:
def mapwords(stringRDD):
    # Write code that will perform the task outlined in exercise 2
    return "return spark RDD, so to get a list run: mapwords(stringRDD).collect() "

In [None]:
SparkBasics1.exercise1_2(pickleFile, mapwords, sc)

## Exercise 3

Write a function `getMax` that uses `reduce` to find the maximum number from a list of numbers. Your command should produce the following:

`
RDD=sc.parallelize([0,2,1])
`

`
getMax(RDD)
`

```
Output:  2
```

### Exercise 3: Teacher Stuff

In [None]:
def getMax(C):
    return C.reduce(max)

In [None]:
inputs = [ sc.parallelize([0,4,2,3,1]),
           sc.parallelize([-3.2,-3.233,-3.1,-3.9]),
           sc.parallelize([2,2,2,2,2,2]) ]

Tester.GenPickle(getMax, inputs, pickleFile, "ex3", isRDD=False )

In [None]:
# after creating the code in the SparkBasics1_Teacher, we check that it works belows
SparkBasics1_Teacher.exercise3(testPath, getMax, sc)

### Student Stuff

In [None]:
def getMax(C):
    # Write the code that will perform the task outlined in exercise 3
    return "return spark NUMBER here, example, to get a number run mapwords(C)"

In [None]:
SparkBasics1.exercise3(pickleFile, getMax, sc)

## Exercise 4

Write a function called `reducewords` that uses `reduce` to create a single string which is the concatenation of all the strings in stringRDD(with a space between each string). Example:


`stringRDD=sc.parallelize(["Spring quarter", "Learning spark basics", "Big data analytics with Spark"])`

`reducewords(stringRDD)`
```
Output: 'Spring quarter Learning spark basics Big data analytics with Spark'
```

### Teacher Stuff

In [None]:
def reducewords(A):
    return A.reduce(lambda x,y: x+" "+y)

In [None]:
inputs = [ sc.parallelize(["Spring quarter", "Learning spark basics", "Big data analytics with Spark"]),
           sc.parallelize(["Do not go gentle", "into that good night", "old age should burn and rave"]),
           sc.parallelize(["do","I dare disturb","the universe","there will be time there will be","time"]) ]

Tester.GenPickle(reducewords, inputs, pickleFile, "ex4", isRDD=False )

In [None]:
SparkBasics1_Teacher.exercise4(testPath, reducewords, sc)

### Student Stuff

In [None]:
def reducewords(mapwords):
    # This function should take in as it's input the previous mapwords function you wrote
    # you should then apply 1 spark function to it to it so that you return a single list
    return "return something like: mapwords.somefunction(...)"

In [None]:
SparkBasics1.exercise4(pickleFile, reducewords, sc)

## Exercise 5

Write a non-Spark function `maxFunc` that when called by the `reduce` command outputs the maximum element from a set of lists. example:


`listRDD=sc.parallelize([[3,4],[2,1],[7,9]])`

`listRDD.reduce(maxFunc)`

```
Output: [9]
```
     
     (Note: The output is a list containing a single number rather than just a single number)

### Teacher Stuff

In [None]:
def maxFunc(x,y):
    return [max(x+y)]
def func5_true(A):
    return A.reduce(maxFunc)

In [None]:
inputs = [ sc.parallelize([[15,20],[21,14],[18,4,20]]),
           sc.parallelize([[3,4,5,-3,19],[19.1],[7,-11]]),
           sc.parallelize([[-3.2,-3.233,-3.9],[-4],[-3,-5]]) ]

Tester.GenPickle(func5_true, inputs, pickleFile, "ex5", isRDD=False )

In [None]:
SparkBasics1_Teacher.exercise5(testPath, maxFunc, sc)

### Student Stuff

In [None]:
def maxFunc(x,y):
    # x,y are lists of numbers
    # write code here for exercise 5
    return "returns a list to be used possibly again by the reduce command"

In [None]:
SparkBasics1.exercise5(pickleFile, maxFunc, sc)