<font color=red>
Spark_Version:2.0.0<br/>
Python_Version:Python 3.5.2 | Anaconda4.1.1(64-bit)<br/>
Jupyter_Version:4.2.1</br>
System:Ubuntu 16.04 LTS(64-bit)
</font>

In [1]:
import platform
print("Spark_Version:",sc.version)
print("Python_Version:",platform.python_version())
print("System:",platform.system())

Spark_Version: 2.0.0
Python_Version: 3.5.2
System: Linux


# Creating RDDs
In this section, we will use both a Python shell (PySpark) and a Scala shell (Spark-Shell) to create an RDD. Both of these shells have a predefined, interpreter-aware SparkContext that is assigned to a variable sc.

In [2]:
type(sc)

pyspark.context.SparkContext

In [3]:
#Pass the file path to create an RDD 
#from the local file system
fileRDD = sc.textFile('README.md')

In [4]:
type(fileRDD)

pyspark.rdd.RDD

In [5]:
#action method. Evaluates RDD DAG and also returns
#the first item in the RDD along with the time taken
fileRDD.first()

'# SparkForDataScience'

We have completed the whole cycle of initiating a Spark application (shell), creating an RDD, and consuming it. Since RDDs are recomputed every time an action is executed, fileRDD is not persisted in the memory or hard disk. This allows Spark to optimize the sequence of steps and execute intelligently. <strong>In fact, in the previous example, the optimizer would have just read one partition of the input file because first() does not require a complete file scan</strong>.

The following example creates an RDD by passing a Python/Scala list with the parallelize function:

In [6]:
numRDD = sc.parallelize([1,2,3,4], 2)
type(numRDD)

pyspark.rdd.RDD

In [7]:
numRDD

ParallelCollectionRDD[3] at parallelize at PythonRDD.scala:475

In [8]:
numRDD.map(lambda x: x*x).collect()

[1, 4, 9, 16]

In [9]:
numRDD.map(lambda x: x*x).reduce(lambda a,b: a+b)

30

RDD creation by passing in-memory collections is simple but may not work very well for large collections, <strong>because the input collection should fit completely in the driver node's memory</strong>.

Writing a Spark program usually consists of transformations and actions. Transformations are lazy operations that define how to build an RDD. Most of the transformations accept a single function argument. All these methods convert one data source to another. Every time you perform a transformation on any RDD, a new RDD will be generated, even if it is a small change,This is because the RDDs are immutable (read-only) abstractions by design. The resulting output from an action can either be written back to the storage system or it can be returned to the driver program for local computation if needed to produce the final output.

## The filter operation
The `filter` operation returns an RDD with only those elements that satisfy a filter condition, similar to the `WHERE` condition in `SQL`.

In [10]:
a = sc.parallelize([1,2,3,4,5,6], 3)
b = a.filter(lambda x: x % 3 == 0)
b.collect()

[3, 6]

## The distinct operation
The `distinct([numTasks])` operation returns an RDD with a new dataset after eliminating(消除) duplicates:

In [11]:
c = sc.parallelize(['John', 'Jack', 'Mike', 'Jack'], 2)
c.distinct().collect()

['Jack', 'John', 'Mike']

## The intersection operation
The intersection operation takes another dataset as input. It returns a dataset that contains common elements:

In [12]:
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = sc.parallelize([5,6,7,8,9,10,11,12,13,14,15])
z = x.intersection(y)
z.collect()

[8, 5, 9, 10, 6, 7]

## The union operation
The union operation takes another dataset as input. It returns a dataset that contains elements of itself and the input dataset supplied to it. If there are common values in both sets, then they will appear as duplicate values in the resulting set after union:

In [13]:
a = sc.parallelize([3,4,5,6,7], 1)
b = sc.parallelize([7,8,9], 1)
c = a.union(b)
c.collect()

[3, 4, 5, 6, 7, 7, 8, 9]

## The map operation
The map operation returns a distributed dataset formed by executing an input function on each of the elements in the input dataset:

In [14]:
a = sc.parallelize(["animal", "human", "bird", "rat"], 3)
b = a.map(lambda x: len(x))
c = a.zip(b)
c.collect()

[('animal', 6), ('human', 5), ('bird', 4), ('rat', 3)]

In [15]:
c.keys().collect()

['animal', 'human', 'bird', 'rat']

## The flatMap operation
The flatMap operation is similar to the map operation.While map returns one element per input element, flatMap returns a list of zero or more elements for each input element:

In [16]:
a = sc.parallelize([1,2,3,4,5], 4)
a.flatMap(lambda x: range(1, x+1)).collect()

[1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]

In [17]:
sc.parallelize([5, 10, 20], 2).flatMap(lambda x: [x, x, x]).collect()

[5, 5, 5, 10, 10, 10, 20, 20, 20]

## The keys operation
The keys operation returns an RDD with the key of each tuple:

In [18]:
a = sc.parallelize(['black', 'blue', 'white', 'green', 'grey'], 2)
b = a.map(lambda x: (len(x), x))
c = b.keys()
c.collect()

[5, 4, 5, 5, 4]

## The cartesian operation
The cartesian operation takes another dataset as argument and returns the Cartesian product(笛卡儿积) of both datasets. This can be an expensive operation, returning a dataset of size m x n where m and n are the sizes of input datasets:

In [19]:
x = sc.parallelize([1, 2, 3])
y = sc.parallelize([10, 11, 12])
x.cartesian(y).collect()

[(1, 10),
 (1, 11),
 (1, 12),
 (2, 10),
 (3, 10),
 (2, 11),
 (2, 12),
 (3, 11),
 (3, 12)]

## Transformations on pair RDDs
Some Spark operations are available only on RDDs of key value pairs. <strong>Note that most of these operations, except counting operations, usually involve shuffling, because the data related to a key may not always reside on a single partition</strong>.

## The groupByKey operation
Similar to the SQL groupBy operation, this groups input data based on the key and you can use aggregateKey or reduceByKey to perform aggregate operations:

In [20]:
a = sc.parallelize(["black", "blue", "white", "green", "grey"], 2)
b = a.groupBy(lambda x: len(x)).collect()
sorted([(x, sorted(y)) for (x,y) in b])

[(4, ['blue', 'grey']), (5, ['black', 'green', 'white'])]

# The join operation
The join operation takes another dataset as input. Both datasets should be of the <strong>key value</strong> pairs type. The resulting dataset is yet another key value dataset having keys and values from both datasets:

In [21]:
a = sc.parallelize(["blue", "green", "orange"], 3)
b = a.keyBy(lambda x: len(x))
b.collect()

[(4, 'blue'), (5, 'green'), (6, 'orange')]

In [22]:
c = sc.parallelize(['black', 'white', 'grey'], 3)
d = c.keyBy(lambda x: len(x))
d.collect()

[(5, 'black'), (5, 'white'), (4, 'grey')]

In [23]:
b.join(d).collect()

[(4, ('blue', 'grey')), (5, ('green', 'black')), (5, ('green', 'white'))]

In [24]:
#leftOuterJoin
b.leftOuterJoin(d).collect()

[(6, ('orange', None)),
 (4, ('blue', 'grey')),
 (5, ('green', 'black')),
 (5, ('green', 'white'))]

In [25]:
#rightOuterJoin
b.rightOuterJoin(d).collect()

[(4, ('blue', 'grey')), (5, ('green', 'black')), (5, ('green', 'white'))]

In [26]:
#fullOuterJoin
b.fullOuterJoin(d).collect()

[(6, ('orange', None)),
 (4, ('blue', 'grey')),
 (5, ('green', 'black')),
 (5, ('green', 'white'))]

## The reduceByKey operation
The reduceByKey operation merges the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer and producing hash-partitioned output:

In [27]:
a = sc.parallelize(["black", "blue", "white", "green", "grey"], 2)
b = a.map(lambda x: (len(x), x))
b.collect()

[(5, 'black'), (4, 'blue'), (5, 'white'), (5, 'green'), (4, 'grey')]

In [28]:
b.reduceByKey(lambda x, y: x+y).collect()

[(4, 'bluegrey'), (5, 'blackwhitegreen')]

In [29]:
a = sc.parallelize(["black", "blue", "white", "orange"], 2)
b = a.map(lambda x: (len(x), x))
b.reduceByKey(lambda x,y: x + y).collect()

[(4, 'blue'), (6, 'orange'), (5, 'blackwhite')]

## The aggregate operation
The aggregrate operation returns an RDD with the keys of each tuple:

In [30]:
z = sc.parallelize([1, 2, 7, 4, 30, 6], 2)
z.aggregate(0,(lambda x, y: max(x, y)), (lambda x, y: x+y))

37

In [31]:
z = sc.parallelize(['a', 'b', 'c', 'd'], 2)
z.aggregate('', (lambda x, y: x + y), (lambda x, y: x + y))

'abcd'

In [32]:
z.reduce(lambda x, y: x + y)

'abcd'

In [33]:
z.aggregate('s', (lambda x, y: x + y), (lambda x, y: x + y))

'ssabscd'

In [34]:
z = sc.parallelize(["12","234","345","56789"],2)
z.aggregate('', (lambda x, y: str(max(len(str(x)), len(str(y))))), (lambda x,y: str(y) + str(x)))

'53'

In [35]:
z.aggregate('', \
            (lambda x, y: str(min(len(str(x)), len(str(y))))), \
            (lambda x,y: str(y) + str(x)))

'11'

In [36]:
z = sc.parallelize(["12","234","345",""],2)
z.aggregate("",(lambda x, y: str(min(len(str(x)), len(str(y))))),(lambda x,
y: str(y) + str(x)))

'01'

# Actions
Once an RDD has been created, the various transformations get executed only when an action is performed on it. The result of an action can either be data written back to the storage system or returned to the driver program that initiated this for further computation locally to produce the final result.

## The collect() function
The `collect()` function returns all the results of an RDD operation as an array to the driver program. This is usually useful for operations that produce sufficiently small datasets. Ideally, the result should easily fit in the memory of the system that's hosting the driver program.
## The count() function
This returns the number of elements in a dataset or the resulting output of an RDD operation.
## The take(n) function
The `take(n)` function returns the `first (n)` elements of a dataset or the resulting output of an RDD operation.
## The first() function
The `first()` function returns the first element of the dataset or the resulting output of an RDD operation. It works similarly to the `take(1)` function.

In [37]:
sc.parallelize([2, 3, 4]).count()

3

In [38]:
sc.parallelize([2, 3, 4]).collect()

[2, 3, 4]

In [39]:
sc.parallelize([2, 3, 4]).first()

2

In [40]:
sc.parallelize([2, 3, 4]).take(2)

[2, 3]

## The takeSample() function
The `takeSample(withReplacement, num, [seed])` function returns an array with a random sample of elements from a dataset. It has three arguments as follows:
<ul><li>`withReplacement/withoutReplacement`: This indicates sampling with or without replacement (while taking multiple samples, it indicates whether to replace the old sample back to the set and then take a fresh sample or sample without replacing). For withReplacement, argument should be True and False otherwise.</li>
<li>`num`: This indicates the number of elements in the sample.</li>
<li>`Seed`: This is a random number generator seed (optional).</li>
</ul>

In [41]:
rdd = sc.parallelize(range(1,10))
rdd.takeSample(True, 20, 1)

[5, 4, 1, 7, 9, 2, 6, 4, 6, 9, 4, 4, 3, 1, 1, 4, 4, 6, 9, 8]

In [42]:
rdd.takeSample(False, 20, 1)

[6, 7, 8, 5, 4, 1, 9, 2, 3]

In [43]:
rdd.takeSample(False, 4, 1)

[6, 7, 8, 5]

In [44]:
rdd.takeSample(True, 4, 1)

[8, 6, 5, 5]

## The countByKey() function
The `countByKey()` function is available only on RDDs of type <strong>key value</strong>. It returns a table of <strong>(K, Int)</strong> pairs with the count of each key.

In [45]:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
rdd.countByKey()

defaultdict(int, {'a': 2, 'b': 1})

In [46]:
rdd.countByKey().items()

dict_items([('a', 2), ('b', 1)])