## Spark Basics

Spark requires communication and synchronization between a number of computers connected trough an ethernet connection. The data structure that manages this communication is a **SparkContext**. A program needs only one SparkContext to run. In a stand-alone pyspark script the SparkContext Object has to be initialized explicitly. 

When running pyspark in a notebook the notebook manager initializes a SparkContext named **sc**

In [1]:
sc

<pyspark.context.SparkContext at 0x107025950>

### RDDs
RDDs (or Resilient Distributed DataSets) are the fundamental data structure used in Spark. You can consider it to be an array whose elements are stored on several computers. 

The simplest way of creating an RDD is by initializing it from a regular array using the method `sc.parallelize`

(In this notebook we will use very tiny RDDs, to help understanding. The real utility of RDDs is for lists with millions or billions of items, which do not fit in the memory of a single computer)

In [2]:
RDD=sc.parallelize([0,1,2])
RDD

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

An RDD can be converted back to a regular list, residing on the head node using the method `.collect`

In [2]:
RDD.collect()

[0, 1, 2]

### Map
The map method applies a given operation to each element in the RDD, thus creating a new RDD.

In [3]:
RDD.map(lambda x: x*x).collect()

[0, 1, 4]

#### Excercise 1

1. Write a `map` command that computes the `cos` of each entry. Your command should produce the following output :
    
    ```
    [1.0, 0.5403023058681398, -0.4161468365471424]
    ```

2. Consider the following RDD: 

    ```stringRDD=sc.parallelize(["Spring quarter", "Learning spark basics", "Big data analytics with Spark"])```
    
    Write a `map` command that produces a list of words for each string. Your command should produce the following output:
    
    ``` [['Spring', 'quarter'], ['Learning', 'spark', 'basics'], ['Big', 'data', 'analytics', 'with', 'Spark']]```

In [5]:
import math
RDD.map(lambda x: math.cos(x)).collect()

[1.0, 0.5403023058681398, -0.4161468365471424]

In [6]:
stringRDD=sc.parallelize(["Spring quarter", "Learning spark basics", "Big data analytics with Spark"])
stringRDD.map(lambda x: x.split(" ")).collect()

[['Spring', 'quarter'],
 ['Learning', 'spark', 'basics'],
 ['Big', 'data', 'analytics', 'with', 'Spark']]

### Reduce
The reduce operation takes as input an RDD and repeatedly applies a 2-to-1 operation such as summation or max.
A 2-to-1 operation is an operation that takes as input two input items of some type and outputs a single item of the same type.

The simplest example of a 2-to-1 operation is the sum:

In [4]:
RDD.reduce(lambda x,y: x+y)

3

Here is an example of a reduce operation that finds the shortest string in an RDD of strings.

In [5]:
words=['this','is','the','best','mac','ever']
wordRDD=sc.parallelize(words)
wordRDD.reduce(lambda w,v: w if len(w)<len(v) else v)

'is'

#### Exercise 2

1. Write a `reduce` command that outputs the maximum number from a list of numbers. Your command should produce the following output on ` RDD=sc.parallelize([0,2,1]) `:

   Output: ``` 2 ```
   

2. Consider the stringRDD defined in Exercise. Write a `reduce` command to produce a single string which is the concatenation of all the strings in stringRDD(with a space between each string). You output should look like:

    Output: ``` 'Spring quarter Learning spark basics Big data analytics with Spark' ```

In [9]:
RDD.reduce(lambda x,y: max(x, y))

2

In [10]:
stringRDD.reduce(lambda x,y: x + " " + y)

'Spring quarter Learning spark basics Big data analytics with Spark'

### Using regular functions instead of lambda functions
Lambda functions can produce compact code. However, you can use regular functions instead. This is a better choice when the operation is too complex to fit in a single line.

For example suppose that we want to find the last word in a lexicographical order among the longest words in the list.
We could achieve that as follows

In [6]:
def largerThan(x,y):
    if len(x)>len(y): return x
    elif len(y)>len(x): return y
    else:  #lengths are equal, compare lexicographically
        if x>y: 
            return x
        else: 
            return y
        
wordRDD.reduce(largerThan)

'this'

#### Exercise 3

1. Consider the following RDD:

    ``` listRDD=sc.parallelize([[3,4],[2,1],[7,9]]) ```
 
     Write a regular function with `reduce` command to output the maximum element from a set of lists. Your output should look like:
     
     Output: ```[9]```
     
     (Note: The output is a list containing a single number rather than just a single number)

In [12]:
def maximum(x, y):
    xMax = max(x)
    yMax = max(y)
    if(xMax > yMax):
        return [xMax]
    else:
        return [yMax]
    
listRDD=sc.parallelize([[3,4],[2,1],[7,9]])
listRDD.reduce(maximum)    

[9]

### Reduce operations **must not depend on the order of application**

You can think about the reduce operation as a binary tree where the leaves are the elements of the list and the root is the final result. Each triplet of the form (parent, child1, child2) corresponds to a single application of the reduce function. There are many ways of arranging this binary tree. **all of these ways have to yield the same final result**. In addition, the order of the elements in the list must not change the result. In particular, reversing the order of the operands in a reduce function must not change the outcome. 

For example the arithmetic operations multiply `*` and add `+` can be used in a reduce, but the operations subtract `-` and divide `/` cannot.

Doing so will not raise an error, but the result is unpredictable.

In [13]:
print RDD.collect()
RDD.reduce(lambda x,y: x-y)

[0, 1, 2]


1

### Combining operations

The method `map` takes as input an RDD and returns an RDD. Similarly the method `reduce` takes as input an RDD and returns a single element. We can therefor cascade map and reduce operations to perform more complex operations in one line.

Suppose we want to compute the sum of the squares of the elements in the RDD. We can either write it in the traditional way:

In [14]:
#separate commands
Squares=RDD.map(lambda x:x*x)
Squares.reduce(lambda x,y:x+y)

5

Or we can combine them into a single cascaded command

In [16]:
#cascaded commands
RDD.map(lambda x:x*x).reduce(lambda x,y:x+y)

5

These two expressions are equivalent, and we might expect that the more basic one is the first, where the commands 
are separate, and that the python compiler translates the cascaded commands into machine code that corresponds to the separate commands.

It turns out that the opposite is true, it is the cascaded form that is closer to the machine code, and spark identifies cascading operations even when they are expressed in a non-cascaded way.

The explanation of this surprising behaviour is related to the notion of lazy evaluation in scala and is explained in [spark programming guide/RDD operations](http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations)

### An instructive mistake
Here is another way to compute the sum of the squares using a single reduce command. What is wrong with it?

In [17]:
RDD.reduce(lambda x,y: x*x+y*y)


25

#### Exercise 4.

1. Consider the listRDD given in Exercise 3. Find the sum of maximum numbers of all lists. Your output should be:

    Output: ``` 15 ```

In [18]:
listRDD.map(lambda x: max(x)).reduce(lambda x,y: x+y)

15

### getting some information about an RDD
RDD's typically have hundreds of thousands of elements. It usually makes no sense to print out the content of a whole RDD. Here are some ways to get manageable amounts of information about an RDD

In [19]:
n=10000;
B=sc.parallelize(range(n))

In [20]:
#find the number of elements in the RDD
B.count()

10000

In [21]:
# get the first few elements of an RDD
print 'first element=',B.first()
print 'first 5 elements = ',B.take(5)

first element= 0
first 5 elements =  [0, 1, 2, 3, 4]


#### Sampling an RDD
The method `RDD.sample(withReplacement,p)` generates a sample of the elements of the RDD. where
- `withReplacement` is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
- `p` is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

In [23]:
# get a sample whose expected size is m
m=5.
B.sample(False,m/n).collect()

[999, 1483, 7027]

#### filtering an RDD
The method `RDD.filter(func)` Return a new dataset formed by selecting those elements of the source on which func returns true.


In [24]:
# How many positive numbers?
B.filter(lambda n: n > 0).count()

9999

#### Exercise 5

1. Write a `filter` command to output elements whose cosine is positive. Your command should produce the following output on ` RDD=sc.parallelize([0,2,1]) `:

    ` [0,1] `
    
    
2. Write a `filter` command to output all words whose length is greater than or equal to 4. Your command should produce the following output on ` wordRDD=sc.parallelize(['this','is','the','best','mac','ever']) `:

    ` ['this', 'best', 'ever'] `

In [25]:
RDD.filter(lambda x: math.cos(x) > 0).collect()

[0, 1]

In [26]:
wordRDD=sc.parallelize(['this','is','the','best','mac','ever'])
wordRDD.filter(lambda x: len(x) >= 4).collect()

['this', 'best', 'ever']

#### Remove duplicate elements in RDD
The method `RDD.distinct(numPartitions=None)` Returns a new dataset that contains the distinct elements of the source dataset 

* The number of partitions is specified through the **numPartitions** argument. Each of this partitions is potentially on different machine.


In [27]:
# Remove duplicate element in DuplicateRDD, we get distinct RDD
DuplicateRDD = sc.parallelize([1,1,2,2,3,3])
DistinctRDD = DuplicateRDD.distinct()
DistinctRDD.collect()


[2, 1, 3]

#### flatmap an RDD
The method `RDD.flatMap(func)` is similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [28]:
text=["you are my sunshine","my only sunshine"]
text_file = sc.parallelize(text)
# map each line in text to a list of words
print 'map:',text_file.map(lambda line: line.split(" ")).collect()
# create a single list of words by combining the words from all of the lines
print 'flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect()

map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]
flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']


#### Exercise 6

1. Write a `flatMap` command to collect all the elements from 1 to x for each element x in a list. Your command should produce the following output on `RDD=sc.parallelize([2,3,5])`:

    ``` [1, 2, 1, 2, 3, 1, 2, 3, 4, 5] ```

In [29]:
RDD=sc.parallelize([2,3,5])
RDD.flatMap(lambda x: range(1, x + 1)).collect()

[1, 2, 1, 2, 3, 1, 2, 3, 4, 5]

### Set operations
In this part, we explore set operations including **union**,**intersection**,**subtract**, **cartesian** in pyspark
1. union(other)
 * Return the union of this RDD and another one.
2. intersection(other)
 * Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.Note that this method performs a shuffle internally.
3. subtract(other, numPartitions=None)
 * Return each value in self that is not contained in other.
4. cartesian(other)
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where **a** is in **self** and **b** is in **other**.


In [30]:
rdd1 = sc.parallelize([1, 1, 2, 3])
rdd2 = sc.parallelize([1, 3, 4, 5])
print rdd1.union(rdd2).collect()
print rdd1.intersection(rdd2).collect()
print rdd1.subtract(rdd2).collect()
print rdd1.cartesian(rdd2).collect()

[1, 1, 2, 3, 1, 3, 4, 5]
[1, 3]
[2]
[(1, 1), (1, 3), (1, 1), (1, 3), (1, 4), (1, 5), (1, 4), (1, 5), (2, 1), (2, 3), (3, 1), (3, 3), (2, 4), (2, 5), (3, 4), (3, 5)]


#### Exercise 7

Consider the following RDDs: 

` RDD1=sc.parallelize(["spark basics", "big data analysis", "spring"]) `

` RDD2=sc.parallelize(["spark using pyspark", "big data"]) `

Use the set operations to produce the following outputs:

* ` ['spark', 'basics', 'big', 'data', 'analysis', 'spring', 'spark', 'using', 'pyspark', 'big', 'data'] `
* ` ['data', 'big', 'spark'] `
* ` ['spring', 'analysis', 'basics'] `
* ` [('spark', 'spark'), ('spark', 'using'), ('spark', 'pyspark'), ('basics', 'spark'), ('basics', 'using'), ('basics', 'pyspark'), ('spark', 'big'), ('spark', 'data'), ('basics', 'big'), ('basics', 'data'), ('big', 'spark'), ('big', 'using'), ('big', 'pyspark'), ('data', 'spark'), ('analysis', 'spark'), ('data', 'using'), ('data', 'pyspark'), ('analysis', 'using'), ('analysis', 'pyspark'), ('spring', 'spark'), ('spring', 'using'), ('spring', 'pyspark'), ('big', 'big'), ('big', 'data'), ('data', 'big'), ('analysis', 'big'), ('data', 'data'), ('analysis', 'data'), ('spring', 'big'), ('spring', 'data')]
 `    

In [36]:
RDD1=sc.parallelize(["spark basics", "big data analysis", "spring"])
RDD2=sc.parallelize(["spark using pyspark", "big data"])
RDD11 = RDD1.flatMap(lambda x: x.split(" "))
RDD21 = RDD2.flatMap(lambda x: x.split(" "))
RDD11.collect(), RDD21.collect()

(['spark', 'basics', 'big', 'data', 'analysis', 'spring'],
 ['spark', 'using', 'pyspark', 'big', 'data'])

In [38]:
print RDD11.union(RDD21).collect()
print RDD11.intersection(RDD21).collect()
print RDD11.subtract(RDD21).collect()
print RDD11.cartesian(RDD21).collect()

['spark', 'basics', 'big', 'data', 'analysis', 'spring', 'spark', 'using', 'pyspark', 'big', 'data']
['data', 'big', 'spark']
['spring', 'analysis', 'basics']
[('spark', 'spark'), ('spark', 'using'), ('spark', 'pyspark'), ('basics', 'spark'), ('basics', 'using'), ('basics', 'pyspark'), ('spark', 'big'), ('spark', 'data'), ('basics', 'big'), ('basics', 'data'), ('big', 'spark'), ('big', 'using'), ('big', 'pyspark'), ('data', 'spark'), ('analysis', 'spark'), ('data', 'using'), ('data', 'pyspark'), ('analysis', 'using'), ('analysis', 'pyspark'), ('spring', 'spark'), ('spring', 'using'), ('spring', 'pyspark'), ('big', 'big'), ('big', 'data'), ('data', 'big'), ('analysis', 'big'), ('data', 'data'), ('analysis', 'data'), ('spring', 'big'), ('spring', 'data')]
