#  Spark RDD and Beyond

## Types of RDD

- ParallelCollectionRDD — An RDD created by Spark Context through parallelizing an existing collection. Eg. sc.parallelize, sc.makeRDD
- CoGroupedRDD — An RDD that cogroups its parents. For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.
- HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS using the older MapReduce API. The most notable use case is the return RDD of SparkContext.textFile.
- MapPartitionsRDD — An RDD that applies the provided function to every partition of the parent RDD. A result of calling operations like map, flatMap, filter, mapPartitions, etc.
- CoalescedRDD — a result of repartition or coalesce transformations.
- ShuffledRDD — a result of shuffling, e.g. after repartition or coalesce transformations.
- PipedRDD — an RDD created by piping elements to a forked external process.
- SequenceFileRDD — is an RDD that can be saved as a SequenceFile.

https://medium.com/knoldus/rdd-sparks-fault-tolerant-in-memory-weapon-130f8df2f996

# Let's do practice

In [34]:
import findspark
import pyspark
findspark.find() 
findspark

<module 'findspark' from '/Users/nics/miniforge3/lib/python3.9/site-packages/findspark.py'>

In [37]:
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sc

22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
22/05/02 14:38:24 WARN Utils: Service 'sparkDriver' could not bi

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:461)
	at sun.nio.ch.Net.bind(Net.java:453)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)


In [29]:
# Create a RDD 
digits = sc.parallelize([0,1,2,3,4,5,6,7,8,9])
digits

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

[▶![](http://img.youtube.com/vi/MVkjC-doVP4/0.jpg)](https://www.youtube.com/watch?v=MVkjC-doVP4&start=120)

**RDD.collect()**

Return a list that contains all of the elements in this RDD.

**Notes**

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

In [5]:
digits.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Trasformation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

![](https://i.ytimg.com/vi/2IQxywxlMO8/hqdefault.jpg)

Transformation return a new RDD

## map(func)	
Return a new distributed dataset formed by passing each element of the source through a function func. 

In [7]:
squares=digits.map(lambda x: x*x)

In [8]:
squares

PythonRDD[1] at RDD at PythonRDD.scala:53

In [9]:
squares.collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

## filter(func)	
Return a new dataset formed by selecting those elements of the source on which func returns true.

In [10]:
evens = digits.filter(lambda x: x % 2 == 0)

In [11]:
evens.collect()

[0, 2, 4, 6, 8]

## flatMap(func)	
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [12]:
def factors(nr):
    i = 2
    factors = []
    while i <= nr:
        if (nr % i) == 0:
            factors.append(i)
            nr = nr / i
        else:
            i = i + 1
    return factors
primes = digits.flatMap(factors)
primes.collect()

[2, 3, 2, 2, 5, 2, 3, 7, 2, 2, 2, 3, 3]

## distinct()
Return a new RDD that contains the distinct elements of a RDD.

In [13]:
primes=primes.distinct()
primes.collect()

[2, 3, 5, 7]

 ## sample()
Return a sampled subset of this RDD.

Parameters
- withReplacement – can elements be sampled multiple times (replaced when sampled out)
- fraction – expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
- seed – seed for the random number generator

In [23]:
digits.sample(False,0.2).collect()

[4, 8, 9]

## union()
Return a new dataset that contains the union of the elements in the source dataset and the argument.

In [24]:
odds = digits.filter(lambda x: x % 2 == 1)
yy = evens.union(odds)
yy.collect()

[0, 2, 4, 6, 8, 1, 3, 5, 7, 9]

## intersection()
Return a new RDD that contains the intersection of elements in the source dataset and the argument.

In [25]:
intersects=evens.intersection(squares)
intersects.collect()

[0, 4]

## cartesian()
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

In [26]:
cart=evens.cartesian(odds)
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

## groupByKey()
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

In [27]:
cartgroup=cart.groupByKey()
cartgroup.collect()

[(0, <pyspark.resultiterable.ResultIterable at 0x107ef0e80>),
 (2, <pyspark.resultiterable.ResultIterable at 0x107e818e0>),
 (4, <pyspark.resultiterable.ResultIterable at 0x107e81dc0>),
 (6, <pyspark.resultiterable.ResultIterable at 0x107e81c10>),
 (8, <pyspark.resultiterable.ResultIterable at 0x107ef7520>)]

![](https://media.makeameme.org/created/nice-235c1c3f2e.jpg)
[Source](https://makeameme.org/meme/nice-235c1c3f2e)

Let's do something better

In [28]:
cartgroup.map(lambda x : (x[0], list(x[1]))).collect()

[(0, [1, 3, 5, 7, 9]),
 (2, [1, 3, 5, 7, 9]),
 (4, [1, 3, 5, 7, 9]),
 (6, [1, 3, 5, 7, 9]),
 (8, [1, 3, 5, 7, 9])]

# Actions
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

## reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). 

The function should be commutative and associative so that it can be computed correctly in parallel.


In [29]:
sumOfDigits=digits.reduce(lambda a,b:a+b)
sumOfDigits

45

In [30]:
digits.count()

10

In [22]:
cart.collect()

[(0, 1),
 (0, 3),
 (0, 5),
 (0, 7),
 (0, 9),
 (2, 1),
 (4, 1),
 (2, 3),
 (2, 5),
 (4, 3),
 (4, 5),
 (2, 7),
 (2, 9),
 (4, 7),
 (4, 9),
 (6, 1),
 (8, 1),
 (6, 3),
 (6, 5),
 (8, 3),
 (8, 5),
 (6, 7),
 (6, 9),
 (8, 7),
 (8, 9)]

In [31]:
cart.countByKey()

defaultdict(int, {0: 5, 2: 5, 4: 5, 6: 5, 8: 5})

In [34]:
digits.histogram(20)

([0.0,
  0.45,
  0.9,
  1.35,
  1.8,
  2.25,
  2.7,
  3.15,
  3.6,
  4.05,
  4.5,
  4.95,
  5.4,
  5.8500000000000005,
  6.3,
  6.75,
  7.2,
  7.65,
  8.1,
  8.55,
  9],
 [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

In [35]:
sc.stop()