# Spark Basics 2

### Combining operations

The method `map` takes as input an RDD and returns an RDD. Similarly the method `reduce` takes as input an RDD and returns a single element. 

We can combine map and reduce operations to perform more complex operations.

Suppose we want to compute the sum of the squares
$$ \sum_{i=1}^n x_i^2 $$
where the elements $x_i$ are stored in an RDD.

Traditional syntax: 
* perform the map
* store the intermediate result in a variable
* perform the reduce

In [1]:
from pyspark import SparkContext
sc = SparkContext(master="local[3]")
sc

<pyspark.context.SparkContext at 0x1058f2b90>

In [2]:
B=sc.parallelize(range(4))

In [3]:
#separate commands
Squares=B.map(lambda x:x*x)
Squares.reduce(lambda x,y:x+y)

14

Or we can combine them into a single cascaded command

In [4]:
#cascaded commands
B.map(lambda x:x*x)\
   .reduce(lambda x,y:x+y)

14

These two expressions are equivalent, and we might expect that the more basic one is the first, where the commands 
are separate, and that the python compiler translates the cascaded commands into machine code that corresponds to the separate commands.

It turns out that the opposite is true, it is the cascaded form that is closer to the machine code, and spark identifies cascading operations even when they are expressed in a non-cascaded way.

The explanation of this surprising behaviour is related to the notion of lazy evaluation in scala and is explained in [spark programming guide/RDD operations](http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations)

### An instructive mistake
Here is another way to compute the sum of the squares using a single reduce command. What is wrong with it?

In [5]:
C=sc.parallelize([1,1,1])
C.reduce(lambda x,y: x*x+y*y)


5

<h1>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1 &nbsp; &nbsp; 1 &nbsp; &nbsp; 1<br />
&nbsp;</h1>


#### Exercise 4.

1. Consider the listRDD given in Exercise 3. Find the sum of maximum numbers of all lists. Your output should be:

    Output: ``` 15 ```

### getting information about an RDD
RDD's typically have hundreds of thousands of elements. It usually makes no sense to print out the content of a whole RDD. Here are some ways to get manageable amounts of information about an RDD

In [6]:
n=1000000
B=sc.parallelize([0,0,1,0]*(n/4))

In [7]:
#find the number of elements in the RDD
B.count()

1000000

In [8]:
# get the first few elements of an RDD
print 'first element=',B.first()
print 'first 5 elements = ',B.take(5)

first element= 0
first 5 elements =  [0, 0, 1, 0, 0]


#### Sampling an RDD
* RDDs are often very large.
* Aggregates, such as averages, can be approximated efficiently by using a sample.
* Sampling is done in parallel and it keeps the data local.

The method `RDD.sample(withReplacement,p)` generates a sample of the elements of the RDD. where
- `withReplacement` is a boolean flag indicating whether or not a an element in the RDD can be sampled more than once.
- `p` is the probability of accepting each element into the sample. Note that as the sampling is performed independently in each partition, the number of elements in the sample changes from sample to sample.

In [9]:
# get a sample whose expected size is m
m=5.
B.sample(False,m/n).collect()

[0, 0, 0, 0]

In [10]:
#compute the average exactly:
exact=B.reduce(lambda x,y:x+y)/(n+0.0)
print 'exact average', exact
#compute the average by sampling 1% of the elements
p=0.01
approx=B.sample(False,p).reduce(lambda x,y:x+y)/(n*p)
print 'approximate average',approx
print 'error=',approx-exact


exact average 0.25
approximate average 0.2513
error= 0.0013


#### Things to note and think about
* Each time you run the previous cell, you get a different estimate
* The accuracy of the estimate is determined by the size of the sample $n*p$
* See how the error changes as you vary $p$
* Can you give a formula that relates the variance of the estimate to $(p*n)$ ? (The answer is in the Probability and statistics course).

#### filtering an RDD
The method `RDD.filter(func)` Return a new dataset formed by selecting those elements of the source on which func returns true.


In [11]:
# How many positive numbers?
B.filter(lambda n: n > 0).count()

250000

#### Exercise 5

1. Write a `filter` command to output elements whose cosine is positive. Your command should produce the following output on ` RDD=sc.parallelize([0,2,1]) `:

    ` [0,1] `
    
    
2. Write a `filter` command to output all words whose length is greater than or equal to 4. Your command should produce the following output on ` wordRDD=sc.parallelize(['this','is','the','best','mac','ever']) `:

    ` ['this', 'best', 'ever'] `

#### Remove duplicate elements in RDD
The method `RDD.distinct(numPartitions=None)` Returns a new dataset that contains the distinct elements of the source dataset 

* The number of partitions is specified through the **numPartitions** argument. Each of this partitions is potentially on different machine.


In [12]:
# Remove duplicate element in DuplicateRDD, we get distinct RDD
DuplicateRDD = sc.parallelize([1,1,2,2,3,3])
DistinctRDD = DuplicateRDD.distinct()
DistinctRDD.collect()


[3, 1, 2]

#### flatmap an RDD
The method `RDD.flatMap(func)` is similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [13]:
text=["you are my sunshine","my only sunshine"]
text_file = sc.parallelize(text)
# map each line in text to a list of words
print 'map:',text_file.map(lambda line: line.split(" ")).collect()
# create a single list of words by combining the words from all of the lines
print 'flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect()

map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]
flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']


#### Exercise 6

1. Write a `flatMap` command to collect all the elements from 1 to x for each element x in a list. Your command should produce the following output on `RDD=sc.parallelize([2,3,5])`:

    ``` [1, 2, 1, 2, 3, 1, 2, 3, 4, 5] ```

### Set operations
In this part, we explore set operations including **union**,**intersection**,**subtract**, **cartesian** in pyspark

In [14]:
rdd1 = sc.parallelize([1, 1, 2, 3])
rdd2 = sc.parallelize([1, 3, 4, 5])


1. union(other)
 * Return the union of this RDD and another one.
 

 rdd1.union(rdd2).collect()

2. intersection(other)
 * Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.Note that this method performs a shuffle internally.

In [15]:
rdd1.intersection(rdd2).collect()

[1, 3]

3. subtract(other, numPartitions=None)
 * Return each value in self that is not contained in other.

In [16]:
rdd1.subtract(rdd2).collect()

[2]

4. cartesian(other)
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where **a** is in **self** and **b** is in **other**.

In [17]:
print rdd1.cartesian(rdd2).collect()

[(1, 1), (1, 3), (1, 4), (1, 5), (1, 1), (1, 3), (1, 4), (1, 5), (2, 1), (3, 1), (2, 3), (3, 3), (2, 4), (2, 5), (3, 4), (3, 5)]


#### Exercise 7

Consider the following RDDs: 

` RDD1=sc.parallelize(["spark basics", "big data analysis", "spring"]) `

` RDD2=sc.parallelize(["spark using pyspark", "big data"]) `

Use the set operations to produce the following outputs:

* ` ['spark', 'basics', 'big', 'data', 'analysis', 'spring', 'spark', 'using', 'pyspark', 'big', 'data'] `
* ` ['data', 'big', 'spark'] `
* ` ['spring', 'analysis', 'basics'] `
* ` [('spark', 'spark'), ('spark', 'using'), ('spark', 'pyspark'), ('basics', 'spark'), ('basics', 'using'), ('basics', 'pyspark'), ('spark', 'big'), ('spark', 'data'), ('basics', 'big'), ('basics', 'data'), ('big', 'spark'), ('big', 'using'), ('big', 'pyspark'), ('data', 'spark'), ('analysis', 'spark'), ('data', 'using'), ('data', 'pyspark'), ('analysis', 'using'), ('analysis', 'pyspark'), ('spring', 'spark'), ('spring', 'using'), ('spring', 'pyspark'), ('big', 'big'), ('big', 'data'), ('data', 'big'), ('analysis', 'big'), ('data', 'data'), ('analysis', 'data'), ('spring', 'big'), ('spring', 'data')]
 `    