Visual diagrams depicting the Spark API Created by Jeff Thomspon, https://github.com/jkthompson/pyspark-pictures.

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local")
spark = SparkSession.builder.getOrCreate()

# Collecting the data

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.collect">
<img align=left src="images/pyspark-page22.svg" width=360 height=203 />
</a>

**collect()**. Return a list that contains all of the elements in this RDD.The data is transfered from the Spark Java Core to Python enviroment. It is an expensive opperation.

Note: This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

In [2]:
# collect
x = sc.parallelize([1,2,3])
y = x.collect()
print(x)  # distributed
print(y)  # not distributed

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274
[1, 2, 3]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.top">
<img align=left src="images/pyspark-page37.svg" width=360 height=203 />
</a>

**top(num, key=None)**. Get the top N elements from a RDD. Note: It returns the list sorted in descending order.

In [3]:
# top
x = sc.parallelize([1,3,1,2,3])
y = x.top(num = 3)
print(x.collect())
print(y)

[1, 3, 1, 2, 3]
[3, 3, 2]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.takeOrdered">
<img align=left src="images/pyspark-page38.svg" width=360 height=203 />
</a>

**takeOrdered(num, key=None)**. Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.

In [4]:
# takeOrdered
x = sc.parallelize([1,3,1,2,3])
y = x.takeOrdered(num = 3)
print(x.collect())
print(y)

[1, 3, 1, 2, 3]
[1, 1, 2]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.take">
<img align=left src="images/pyspark-page39.svg" width=360 height=203 />
</a>

**take(num)**. Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

In [5]:
# take
x = sc.parallelize([1,3,1,2,3])
y = x.take(num = 3)
print(x.collect())
print(y)

[1, 3, 1, 2, 3]
[1, 3, 1]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.first">
<img align=left src="images/pyspark-page40.svg" width=360 height=203 />
</a>

**first()**. Return the first element in this RDD.

In [6]:
# first
x = sc.parallelize([1,3,1,2,3])
y = x.first()
print(x.collect())
print(y)

[1, 3, 1, 2, 3]
1


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.collectAsMap">
<img align=left src="images/pyspark-page41.svg" width=360 height=203 />
</a>

**collectAsMap()**. Return the key-value pairs in this RDD to the master as a dictionary.

In [7]:
# collectAsMap
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.collectAsMap()
print(x.collect())
print(y)

[('C', 3), ('A', 1), ('B', 2)]
{'C': 3, 'A': 1, 'B': 2}


# Mapping

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.map">
<img align=left src="images/pyspark-page3.svg" width=360 height=203 />
</a>

**map(f, preservesPartitioning=False)**. Return a new distributed dataset formed by passing each element of the source through a function func.

In [10]:
x = sc.parallelize(["b", "a", "c"])
y = x.map(lambda x: (x, 1))
print(x.collect())  # collect copies RDD elements to a list on the driver
print(y.collect())

['b', 'a', 'c']
[('b', 1), ('a', 1), ('c', 1)]


In [11]:
# map
x = sc.parallelize([1,2,3]) # sc = spark context, parallelize creates an RDD from the passed object
y = x.map(lambda x: (x,x**2))
print(x.collect())  # collect copies RDD elements to a list on the driver
print(y.collect())

[1, 2, 3]
[(1, 1), (2, 4), (3, 9)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.flatMap">
<img align=left src="images/pyspark-page4.svg" width=360 height=203 /></a>
<br>

**flatMap(f, preservesPartitioning=False)**. Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

In [3]:
# flatMap
x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, 100*x, x**2))
print(x.collect())
print(y.collect())

[1, 2, 3]
[1, 100, 1, 2, 200, 4, 3, 300, 9]


In [4]:
# Map
x = sc.parallelize([1,2,3])
y = x.map(lambda x: (x, 100*x, x**2))
print(x.collect())
print(y.collect())

[1, 2, 3]
[(1, 100, 1), (2, 200, 4), (3, 300, 9)]


In [13]:
x = sc.parallelize([2, 3, 4])
y = x.flatMap(lambda x: range(1, x))
print(x.collect())
print(y.collect())

[2, 3, 4]
[1, 1, 2, 1, 2, 3]


In [14]:
# Split sentence into words
lines = sc.parallelize([
    "Apache Spark is a unified analytics engine for large-scale data processing.",
    "It provides high-level APIs in Java, Scala, Python and R",
    "It also supports a rich set of higher-level tools including Spark SQL",
    "MLlib for machine learning",
    "GraphX for graph processing",
    "Structured Streaming for incremental computation and stream processing"
 ])
words = lines.flatMap(lambda x: x.split(' '))
print(lines.collect())
print(words.collect())

['Apache Spark is a unified analytics engine for large-scale data processing.', 'It provides high-level APIs in Java, Scala, Python and R', 'It also supports a rich set of higher-level tools including Spark SQL', 'MLlib for machine learning', 'GraphX for graph processing', 'Structured Streaming for incremental computation and stream processing']
['Apache', 'Spark', 'is', 'a', 'unified', 'analytics', 'engine', 'for', 'large-scale', 'data', 'processing.', 'It', 'provides', 'high-level', 'APIs', 'in', 'Java,', 'Scala,', 'Python', 'and', 'R', 'It', 'also', 'supports', 'a', 'rich', 'set', 'of', 'higher-level', 'tools', 'including', 'Spark', 'SQL', 'MLlib', 'for', 'machine', 'learning', 'GraphX', 'for', 'graph', 'processing', 'Structured', 'Streaming', 'for', 'incremental', 'computation', 'and', 'stream', 'processing']


In [15]:
x = sc.parallelize([1, 1, 2, 3])
y =  x.union(x)
print(x.collect())
print(y.collect())

[1, 1, 2, 3]
[1, 1, 2, 3, 1, 1, 2, 3]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapValues">
<img align=left src="images/pyspark-page56.svg" width=360 height=203 />
</a>

**mapValues(f)**. Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

In [16]:
# mapValues
x = sc.parallelize([('A',(1,2,3)),('B',(4,5))])
y = x.mapValues(lambda x: [i**2 for i in x]) # function is applied to entire value
print(x.collect())
print(y.collect())

[('A', (1, 2, 3)), ('B', (4, 5))]
[('A', [1, 4, 9]), ('B', [16, 25])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.filter">
<img align=left src="images/pyspark-page8.svg" width=360 height=203 />
</a>

**filter(func)** Return a new dataset formed by selecting those elements of the source on which func returns true

In [17]:
# filter
x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1)  # filters out even elements
print(x.collect())
print(y.collect())

[1, 2, 3]
[1, 3]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.distinct">
<img align=left src="images/pyspark-page9.svg" width=360 height=203 />
</a>

**distinct(numPartitions=None)** Return a new RDD containing the distinct elements in this RDD.

In [18]:
# distinct
x = sc.parallelize(['A','A','B'])
y = x.distinct()
print(x.collect())
print(y.collect())

['A', 'A', 'B']
['A', 'B']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.keys">
<img align=left src="images/pyspark-page42.svg" width=360 height=203 />
</a>

**keys()**. Return an RDD with the keys of each tuple.

In [None]:
# keys
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.keys()
print(x.collect())
print(y.collect())

[('C', 3), ('A', 1), ('B', 2)]
['C', 'A', 'B']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.values">
<img align=left src="images/pyspark-page43.svg" width=360 height=203 />
</a>

**values()**. Return an RDD with the values of each tuple.

In [None]:
# values
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.values()
print(x.collect())
print(y.collect())

[('C', 3), ('A', 1), ('B', 2)]
[3, 1, 2]


# Partitions

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.getNumPartitions">
<img align=left src="images/pyspark-page7.svg" width=360 height=203 />
</a>

**getNumPartitions()** Returns the number of partitions in RDD

In [19]:
# getNumPartitions
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14], 4)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12, 13, 14]]
4


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.repartition">
<img align=left src="images/pyspark-page63.svg" width=360 height=203 />
</a>

**repartition(numPartitions)**. Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

In [20]:
# repartition
x = sc.parallelize([1,2,3,4,5],2)
y = x.repartition(numPartitions=3)
print(x.glom().collect())
print(y.glom().collect())

[[1, 2], [3, 4, 5]]
[[], [1, 2], [3, 4, 5]]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.coalesce">
<img align=left src="images/pyspark-page64.svg" width=360 height=203 />
</a>

**coalesce(numPartitions, shuffle=False)**. Return a new RDD that is reduced into numPartitions partitions.

In [21]:
# coalesce
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14], 4)
y = x.coalesce(numPartitions=2)
print(x.glom().collect())
print(y.glom().collect())

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12, 13, 14]]
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14]]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.glom">
<img align=left src="images/pyspark-page16.svg" width=360 height=203 />
</a>

**glom()**. Return an RDD created by coalescing all elements within each partition into a list.

In [22]:
# glom
x = sc.parallelize(['C','B','A'], 2)
y = x.glom()
print(x.collect()) 
print(y.collect())

['C', 'B', 'A']
[['C'], ['B', 'A']]


# Sampling

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.takeSample">
<img align=left src="images/pyspark-page11.svg" width=360 height=203 />
</a>

**takeSample(withReplacement, num, seed=None)**. Return a fixed-size sampled subset of this RDD (currently requires numpy).

In [24]:
# takeSample
x = sc.parallelize(range(7))
ylist = [x.takeSample(withReplacement=False, num=3) for i in range(5)]  # call 'sample' 5 times
print('x = ' + str(x.collect()))
for cnt,y in zip(range(len(ylist)), ylist):
    print('sample:' + str(cnt) + ' y = ' +  str(y))  # no collect on y

x = [0, 1, 2, 3, 4, 5, 6]
sample:0 y = [2, 0, 5]
sample:1 y = [1, 3, 0]
sample:2 y = [6, 5, 4]
sample:3 y = [6, 1, 3]
sample:4 y = [5, 0, 6]


# Set oprations

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.union">
<img align=left src="images/pyspark-page12.svg" width=360 height=203 />
</a>

**union(other)**. Return the union of this RDD and another one

In [26]:
# union
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['D','C','A'])
z = x.union(y)
print(x.collect())
print(y.collect())
print(z.collect())

['A', 'A', 'B']
['D', 'C', 'A']
['A', 'A', 'B', 'D', 'C', 'A']


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.intersection">
<img align=left src="images/pyspark-page13.svg" width=360 height=203 />
</a>

**intersection(other)**. Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally.

In [27]:
# intersection
x = sc.parallelize(['A','A','B'])
y = sc.parallelize(['A','C','D'])
z = x.intersection(y)
print(x.collect())
print(y.collect())
print(z.collect())

['A', 'A', 'B']
['A', 'C', 'D']
['A']


# Aggregate

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.reduce">
<img align=left src="images/pyspark-page23.svg" width=360 height=203 />
</a>

**reduce(f)**. Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.

In [64]:
# reduce
x = sc.parallelize([1,2,3])
y = x.reduce(lambda x, y: x + y)  # computes a cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.fold">
<img align=left src="images/pyspark-page24.svg" width=360 height=203 />
</a>

**fold(zeroValue, op)**. Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2

In [65]:
# fold
x = sc.parallelize([1,2,3])
neutral_zero_value = 0  # 0 for sum, 1 for multiplication
y = x.fold(neutral_zero_value,lambda obj, accumulated: accumulated + obj) # computes cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.aggregate">
<img align=left src="images/pyspark-page25.svg" width=360 height=203 />
</a>

**aggregate(zeroValue, seqOp, combOp)**. Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.” 

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. 

Thus, we need one operation for merging a T into an U and one operation for merging two U

In [66]:
# aggregate
x = sc.parallelize([2,3,4])
neutral_zero_value = (0,1) # sum: x+0 = x, product: 1*x = x
seqOp = (lambda aggregated, el: (aggregated[0] + el, aggregated[1] * el)) 
combOp = (lambda aggregated, el: (aggregated[0] + el[0], aggregated[1] * el[1]))
y = x.aggregate(neutral_zero_value,seqOp,combOp)  # computes (cumulative sum, cumulative product)
print(x.collect())
print(y)

[2, 3, 4]
(9, 24)


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.reduceByKey">
<img align=left src="images/pyspark-page44.svg" width=360 height=203 />
</a>

**reduceByKey(func, numPartitions=None)**. Merge the values for each key using an associative reduce function.

In [34]:
# reduceByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.reduceByKey(lambda agg, obj: agg + obj)
print(x.collect())
print(y.collect())

[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', 3), ('A', 12)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.foldByKey">
<img align=left src="images/pyspark-page53.svg" width=360 height=203 />
</a>

**foldByKey(zeroValue, func, numPartitions=None)**. Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

In [36]:
# foldByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
zeroValue = 1 # one is 'zero value' for multiplication
y = x.foldByKey(zeroValue,lambda agg,x: agg*x )  # computes cumulative product within each key
print(x.collect())
print(y.collect())

[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', 2), ('A', 60)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.aggregateByKey">
<img align=left src="images/pyspark-page52.svg" width=360 height=203 />
</a>

**aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None)**. Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

In [35]:
# aggregateByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
zeroValue = [] # empty list is 'zero value' for append operation
mergeVal = (lambda aggregated, el: aggregated + [(el,el**2)])
mergeComb = (lambda agg1,agg2: agg1 + agg2 )
y = x.aggregateByKey(zeroValue,mergeVal,mergeComb)
print(x.collect())
print(y.collect())

[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('B', [(1, 1), (2, 4)]), ('A', [(3, 9), (4, 16), (5, 25)])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.groupByKey">
<img align=left src="images/pyspark-page54.svg" width=360 height=203 />
</a>

In [37]:
# groupByKey
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print(x.collect())
print([(j[0],[i for i in j[1]]) for j in y.collect()])

[('B', 5), ('B', 4), ('A', 3), ('A', 2), ('A', 1)]
[('B', [5, 4]), ('A', [3, 2, 1])]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.countByKey">
<img align=left src="images/pyspark-page46.svg" width=360 height=203 />
</a>

**countByKey()**. Count the number of elements for each key, and return the result to the master as a dictionary.

In [32]:
# countByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.countByKey()
print(x.collect())
print(y)

[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
defaultdict(<class 'int'>, {'B': 2, 'A': 3})


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.countByValue">
<img align=left src="images/pyspark-page36.svg" width=360 height=203 />
</a>

**countByValue()**. Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.

In [33]:
# countByValue
x = sc.parallelize([1,3,1,2,3])
y = x.countByValue()
print(x.collect())
print(y)

[1, 3, 1, 2, 3]
defaultdict(<class 'int'>, {1: 2, 3: 2, 2: 1})


# Statistics

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.max">
<img align=left src="images/pyspark-page26.svg" width=360 height=203 />
</a>

**max(key=None)**. Find the maximum item in this RDD.

Parameters:	key – A function used to generate key for comparing

In [67]:
# max
x = sc.parallelize([1,3,2,11])
y = x.max()
z = x.max(key=str)
print(x.collect())
print(y)
print(z)

[1, 3, 2, 11]
11
3


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.min">
<img align=left src="images/pyspark-page27.svg" width=360 height=203 />
</a>

**min(key=None)**. Find the minimum item in this RDD.

Parameters:	key – A function used to generate key for comparin

In [40]:
# min
x = sc.parallelize([1,3,2])
y = x.min()
print(x.collect())
print(y)

[1, 3, 2]
1


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.sum">
<img align=left src="images/pyspark-page28.svg" width=360 height=203 />
</a>

**sum()**. Add up the elements in this RDD.

In [41]:
# sum
x = sc.parallelize([1,3,2])
y = x.sum()
print(x.collect())
print(y)

[1, 3, 2]
6


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.count">
<img align=left src="images/pyspark-page29.svg" width=360 height=203 />
</a>

**count()**. Return the number of elements in this RDD.

In [42]:
# count
x = sc.parallelize([1,3,2])
y = x.count()
print(x.collect())
print(y)

[1, 3, 2]
3


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mean">
<img align=left src="images/pyspark-page31.svg" width=360 height=203 />
</a>

**mean()**. Compute the mean of this RDD’s elements.

In [45]:
# mean
x = sc.parallelize([1,3,2])
y = x.mean()
print(x.collect())
print(y)

[1, 3, 2]
2.0


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.variance">
<img align=left src="images/pyspark-page32.svg" width=360 height=203 />
</a>

**variance()**. Compute the variance of this RDD’s elements.

In [46]:
# variance
x = sc.parallelize([1,3,2])
y = x.variance()  # divides by N
print(x.collect())
print(y)

[1, 3, 2]
0.6666666666666666


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.stdev">
<img align=left src="images/pyspark-page33.svg" width=360 height=203 />
</a>

In [47]:
# stdev
x = sc.parallelize([1,3,2])
y = x.stdev()  # divides by N
print(x.collect())
print(y)

[1, 3, 2]
0.816496580927726


# Join and combine RDDs

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.join">
<img align=left src="images/pyspark-page47.svg" width=360 height=203 />
</a>

**join(other, numPartitions=None)**. Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. Performs a hash join across the cluster.

In [50]:
# join
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.join(y)
print(x.collect())
print(y.collect())
print(z.collect())

[('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.leftOuterJoin">
<img align=left src="images/pyspark-page48.svg" width=360 height=203 />
</a>

**leftOuterJoin(other, numPartitions=None)**. Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.

In [52]:
# leftOuterJoin
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.leftOuterJoin(y)
print(x.collect())
print(y.collect())
print(z.collect())

[('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('C', (4, None)), ('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.rightOuterJoin">
<img align=left src="images/pyspark-page49.svg" width=360 height=203 />
</a>

In [53]:
# rightOuterJoin
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('A',8),('B',7),('A',6),('D',5)])
z = x.rightOuterJoin(y)
print(x.collect())
print(y.collect())
print(z.collect())

[('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('A', 8), ('B', 7), ('A', 6), ('D', 5)]
[('B', (3, 7)), ('A', (2, 8)), ('A', (2, 6)), ('A', (1, 8)), ('A', (1, 6)), ('D', (None, 5))]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.cartesian">
<img align=left src="images/pyspark-page17.svg" width=360 height=203 />
</a>

**cartesian(other)**. Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

In [55]:
# cartesian
x = sc.parallelize(['A','B'])
y = sc.parallelize(['C','D'])
z = x.cartesian(y)
print(x.collect())
print(y.collect())
print(z.collect())

['A', 'B']
['C', 'D']
[('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.subtract">
<img align=left src="images/pyspark-page61.svg" width=360 height=203 />
</a>

**subtract(other, numPartitions=None)**. Return each value in self that is not contained in other.

In [57]:
# subtract
x = sc.parallelize([('C',4),('B',3),('A',2),('A',1)])
y = sc.parallelize([('C',8),('A',2),('D',1)])
z = x.subtract(y)
print(x.collect())
print(y.collect())
print(z.collect())

[('C', 4), ('B', 3), ('A', 2), ('A', 1)]
[('C', 8), ('A', 2), ('D', 1)]
[('C', 4), ('B', 3), ('A', 1)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.zip">
<img align=left src="images/pyspark-page65.svg" width=360 height=203 />
</a>

**zip(other)**.Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

In [58]:
# zip
x = sc.parallelize(['B','A','A'])
y = x.map(lambda x: ord(x))  # zip expects x and y to have same #partitions and #elements/partition
z = x.zip(y)
print(x.collect())
print(y.collect())
print(z.collect())

['B', 'A', 'A']
[66, 65, 65]
[('B', 66), ('A', 65), ('A', 65)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.zipWithIndex">
<img align=left src="images/pyspark-page66.svg" width=360 height=203 />
</a>

**zipWithIndex()**. Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.

In [59]:
# zipWithIndex
x = sc.parallelize(['B','A','A'],2)
y = x.zipWithIndex()
print(x.glom().collect())
print(y.collect())

[['B'], ['A', 'A']]
[('B', 0), ('A', 1), ('A', 2)]


# Other functions

<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.sortByKey">
<img align=left src="images/pyspark-page14.svg" width=360 height=203 />
</a>

**sortByKey(ascending=True, numPartitions=None, keyfunc=lambda)**. Sorts this RDD, which is assumed to consist of (key, value) pairs.

In [None]:
# sortByKey
x = sc.parallelize([('B',1),('A',2),('C',3)])
y = x.sortByKey()
print(x.collect())
print(y.collect())

[('B', 1), ('A', 2), ('C', 3)]
[('A', 2), ('B', 1), ('C', 3)]


<a href="http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.foreach">
<img align=left src="images/pyspark-page20.svg" width=360 height=203 />
</a>

**foreach(f)**. Applies a function to all elements of this RDD.

In [60]:
# foreach
from __future__ import print_function
x = sc.parallelize([1,2,3])
def f(el):
    '''side effect: append the current RDD elements to a file'''
    f1=open("./foreachExample.txt", 'a+') 
    print(el,file=f1)

open('./foreachExample.txt', 'w').close()  # first clear the file contents

y = x.foreach(f) # writes into foreachExample.txt

print(x.collect())
print(y) # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachExample.txt", "r") as foreachExample:
    print (foreachExample.read())

[1, 2, 3]
None
1
2
3

