#  Spark RDD and Beyond

## Types of RDD

- ParallelCollectionRDD — An RDD created by Spark Context through parallelizing an existing collection. Eg. sc.parallelize, sc.makeRDD
- CoGroupedRDD — An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.
- HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS using the older MapReduce API. The most notable use case is the return RDD of SparkContext.textFile.
- MapPartitionsRDD — An RDD that applies the provided function to every partition of the parent RDD. A result of calling operations like map, flatMap, filter, mapPartitions, etc.
- CoalescedRDD — a result of repartition or coalesce transformations.
- ShuffledRDD — a result of shuffling, e.g. after repartition or coalesce transformations.
- PipedRDD — an RDD created by piping elements to a forked external process.
- SequenceFileRDD — is an RDD that can be saved as a SequenceFile.

https://medium.com/knoldus/rdd-sparks-fault-tolerant-in-memory-weapon-130f8df2f996

# Let's do practice

In [1]:
import findspark
import pyspark
findspark.find() 
findspark

<module 'findspark' from '/Users/nics/miniforge3/lib/python3.9/site-packages/findspark.py'>

In [2]:
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sc

23/05/02 16:30:22 WARN Utils: Your hostname, NicsM1.local resolves to a loopback address: 127.0.0.1; using 151.97.58.67 instead (on interface en0)
23/05/02 16:30:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/02 16:30:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Create a RDD using parallelize

In [3]:
# Create a RDD 
digits = sc.parallelize([0,1,2,3,4,5,6,7,8,9])
digits

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:287

**RDD.collect()**

Return a list that contains all of the elements in this RDD.

**Notes**

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

In [4]:
digits.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

[▶![](http://img.youtube.com/vi/MVkjC-doVP4/0.jpg)](https://www.youtube.com/watch?v=MVkjC-doVP4&start=120)

## RDD Operations

RDDs support two types of operations: 
- **transformations**, which create a new dataset from an existing one
- **actions**, which return a value to the driver program after running a computation on the dataset. 

For example, *map* is a transformation that passes each dataset element through a function and returns a new RDD representing the results.

On the other hand, *reduce* is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are **lazy**, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). 

he transformations are only computed when an action requires a result to be returned to the driver program. 

This design enables Spark to run more **efficiently** . For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

![](https://media.licdn.com/dms/image/C5612AQHZTuPnWQT6KQ/article-cover_image-shrink_600_2000/0/1520128748970?e=2147483647&v=beta&t=lJDV7DWofBg1ijqPLYGbGDr-FVX_pM6nks-RWQQgRZo)

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/11/persist.jpg)

https://www.analyticsvidhya.com/blog/2020/11/8-must-know-spark-optimization-tips-for-data-engineering-beginners/

# Trasformation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

![](https://i.ytimg.com/vi/2IQxywxlMO8/hqdefault.jpg)

## map(func)	
Return a new distributed dataset formed by passing each element of the source through a function func. 

In [10]:
squares=digits.map(lambda x: x*x)

In [6]:
squares

PythonRDD[1] at RDD at PythonRDD.scala:53

In [7]:
squares.collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

## filter(func)	
Return a new dataset formed by selecting those elements of the source on which func returns true.

In [11]:
evens = digits.filter(lambda x: x % 2 == 0)

In [12]:
evens.collect()

[0, 2, 4, 6, 8]

## flatMap(func)	
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [19]:
def factors(nr):
    i = 2
    factors = []
    while i <= nr:
        if (nr % i) == 0:
            factors.append(i)
            nr = nr / i
        else:
            i = i + 1
    return factors
primes = digits.flatMap(factors)
primes.collect()

[2, 3, 2, 2, 5, 2, 3, 7, 2, 2, 2, 3, 3]

## distinct()
Return a new RDD that contains the distinct elements of a RDD.

In [20]:
primes=primes.distinct()
primes.collect()

[2, 3, 5, 7]

 ## sample()
Return a sampled subset of this RDD.

Parameters
- withReplacement – can elements be sampled multiple times (replaced when sampled out)
- fraction – expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
- seed – seed for the random number generator

In [42]:
digits.sample(False,0.5).collect()

[0, 1, 2, 4, 5, 6, 8]

## union()
Return a new dataset that contains the union of the elements in the source dataset and the argument.

In [43]:
odds = digits.filter(lambda x: x % 2 == 1)
yy = evens.union(odds)
yy.collect()

[0, 2, 4, 6, 8, 1, 3, 5, 7, 9]

## intersection()
Return a new RDD that contains the intersection of elements in the source dataset and the argument.

In [44]:
intersects=evens.intersection(squares)
intersects.collect()

[0, 4]

## cartesian()
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

In [45]:
cart=evens.cartesian(odds)
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

## groupByKey()
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

In [46]:
cartgroup=cart.groupByKey()
cartgroup.collect()

[(0, <pyspark.resultiterable.ResultIterable at 0x113dfe430>),
 (2, <pyspark.resultiterable.ResultIterable at 0x113dfe610>),
 (4, <pyspark.resultiterable.ResultIterable at 0x113dfe490>),
 (6, <pyspark.resultiterable.ResultIterable at 0x113dfe160>),
 (8, <pyspark.resultiterable.ResultIterable at 0x113dfe190>)]

![](https://media.makeameme.org/created/nice-235c1c3f2e.jpg)
[Source](https://makeameme.org/meme/nice-235c1c3f2e)

Let's do something better

In [47]:
cartgroup.map(lambda x : (x[0], list(x[1]))).collect()

[(0, [1, 3, 5, 7, 9]),
 (2, [1, 3, 5, 7, 9]),
 (4, [1, 3, 5, 7, 9]),
 (6, [1, 3, 5, 7, 9]),
 (8, [1, 3, 5, 7, 9])]

## join(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

In [48]:
cart2=squares.cartesian(odds)
cart2.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (1, 1),
 (4, 1),
 (1, 3),
 (1, 5),
 (4, 3),
 (4, 5),
 (1, 7),
 (1, 9),
 (4, 7),
 (4, 9),
 (9, 1),
 (16, 1),
 (25, 1),
 (36, 1),
 (9, 3),
 (9, 5),
 (16, 3),
 (16, 5),
 (25, 3),
 (25, 5),
 (36, 3),
 (36, 5),
 (9, 7),
 (9, 9),
 (16, 7),
 (16, 9),
 (25, 7),
 (25, 9),
 (36, 7),
 (36, 9),
 (49, 1),
 (64, 1),
 (81, 1),
 (49, 3),
 (49, 5),
 (64, 3),
 (64, 5),
 (81, 3),
 (81, 5),
 (49, 7),
 (49, 9),
 (64, 7),
 (64, 9),
 (81, 7),
 (81, 9)]

In [49]:
cart.join(cart2).collect()

[(0, (1, 1)),
 (0, (1, 3)),
 (0, (1, 5)),
 (0, (1, 7)),
 (0, (1, 9)),
 (0, (3, 1)),
 (0, (3, 3)),
 (0, (3, 5)),
 (0, (3, 7)),
 (0, (3, 9)),
 (0, (5, 1)),
 (0, (5, 3)),
 (0, (5, 5)),
 (0, (5, 7)),
 (0, (5, 9)),
 (0, (7, 1)),
 (0, (7, 3)),
 (0, (7, 5)),
 (0, (7, 7)),
 (0, (7, 9)),
 (0, (9, 1)),
 (0, (9, 3)),
 (0, (9, 5)),
 (0, (9, 7)),
 (0, (9, 9)),
 (4, (1, 1)),
 (4, (1, 3)),
 (4, (1, 5)),
 (4, (1, 7)),
 (4, (1, 9)),
 (4, (3, 1)),
 (4, (3, 3)),
 (4, (3, 5)),
 (4, (3, 7)),
 (4, (3, 9)),
 (4, (5, 1)),
 (4, (5, 3)),
 (4, (5, 5)),
 (4, (5, 7)),
 (4, (5, 9)),
 (4, (7, 1)),
 (4, (7, 3)),
 (4, (7, 5)),
 (4, (7, 7)),
 (4, (7, 9)),
 (4, (9, 1)),
 (4, (9, 3)),
 (4, (9, 5)),
 (4, (9, 7)),
 (4, (9, 9))]

## cogroup(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith

In [50]:
cart.cogroup(cart2).map(lambda x : (x[0], list(x[1][0])+list(x[1][1]))).collect()

[(0, [1, 3, 5, 7, 9, 1, 3, 5, 7, 9]),
 (2, [1, 3, 5, 7, 9]),
 (4, [1, 3, 5, 7, 9, 1, 3, 5, 7, 9]),
 (6, [1, 3, 5, 7, 9]),
 (8, [1, 3, 5, 7, 9]),
 (16, [1, 3, 5, 7, 9]),
 (36, [1, 3, 5, 7, 9]),
 (64, [1, 3, 5, 7, 9]),
 (1, [1, 3, 5, 7, 9]),
 (9, [1, 3, 5, 7, 9]),
 (25, [1, 3, 5, 7, 9]),
 (49, [1, 3, 5, 7, 9]),
 (81, [1, 3, 5, 7, 9])]

## pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.


In [51]:
digits.pipe("tee").collect()

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

# Actions
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

![](https://i.imgflip.com/315qmr.jpg)
[Source](https://imgflip.com/i/315qmr)

## reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). 

The function should be commutative and associative so that it can be computed correctly in parallel.


In [83]:
h=sc.parallelize(range,50)

In [85]:
hd=h.reduce(lambda a,b:a-b)
hd

1

In [55]:
sumOfDigits=digits.reduce(lambda a,b:a+b)
sumOfDigits

45

## count()
Return the number of elements in the dataset.

In [86]:
digits.count()

10

## collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

In [87]:
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

## take(n)

In [88]:
digits.take(2)

[0, 1]

## takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

In [90]:
digits.takeSample(False,5)

[6, 4, 0, 9, 2]

## takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.


In [91]:
digits.takeOrdered(5)

[0, 1, 2, 3, 4]

## first()
Return the first element of the dataset (similar to take(1)).

In [92]:
digits.first()

0

## countByKey()
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().

In [93]:
cart.countByKey()

defaultdict(int, {0: 5, 2: 5, 4: 5, 6: 5, 8: 5})

## histogram(buckets)
Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1.

In [94]:
rdd = sc.parallelize(range(51))

In [96]:
rdd.histogram(2)

([0, 25, 50], [25, 26])

In [97]:
rdd.histogram([0, 5, 25, 50])

([0, 5, 25, 50], [5, 20, 26])

In [99]:
rdd.collect()

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50]

In [100]:
# Works also with strings
rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
rdd.histogram(("a", "b", "c"))

(('a', 'b', 'c'), [2, 2])

## stats()
Return a StatCounter object that captures the mean, variance and count of the RDD’s elements in one operation.

In [101]:
digits.stats()

(count: 10, mean: 4.5, stdev: 2.8722813232690143, max: 9.0, min: 0.0)

## foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. 
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.

In [110]:
countdown=sc.parallelize(range(11))
countdown.foreach(lambda x: print("[%-10s] %d%%" % ('='*x, 10*x)))

[          ] 0%
[=         ] 10%
[==        ] 20%
[===       ] 30%
[====      ] 40%
[=====     ] 50%


In [111]:
sc.stop()