# Transformations in RDDs
A transformation in Spark is a function that takes an existing RDD (the source RDD),
applies a transformation to it, and creates a new RDD (the target RDD). 

> RDDs are not evaluated until an action is performed on them: this means that trans‐
formations are lazily evaluated.

In [1]:
# import required Spark class
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

24/05/16 13:56:49 WARN Utils: Your hostname, msi-MAG resolves to a loopback address: 127.0.1.1; using 192.168.0.129 instead (on interface wlp3s0)
24/05/16 13:56:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/16 13:56:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/16 13:56:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


**sparkContext** creates RDDs

In [2]:
tuples = [('A', 7), ('A', 8), ('A', -4),
('B', 3), ('B', 9), ('B', -1),
('C', 1), ('C', 5)]
rdd = spark.sparkContext.parallelize(tuples)
rdd.collect() # this is an action

In [4]:
# drop negative values
positives = rdd.filter(lambda x: x[1] > 0) #transformation
positives.collect() #action


[('A', 7), ('A', 8), ('B', 3), ('B', 9), ('C', 1), ('C', 5)]

In [5]:
# find sum and average per key using groupByKey() 
# groupByKey() = group by first column
# now any function is applied to the VALUE (second column)
grouped = positives.groupByKey()
grouped.collect()


[('B', <pyspark.resultiterable.ResultIterable at 0x7bab4734fcd0>),
 ('A', <pyspark.resultiterable.ResultIterable at 0x7bab3ff9e750>),
 ('C', <pyspark.resultiterable.ResultIterable at 0x7bab3f54ded0>)]

> you havent applied a function thats why, you need to tell Spark how you want to merge the values over the keys

**Example1**
this is a way to make Spark show you what is in the groups

In [6]:
grouped.mapValues(list).collect()

[('B', [3, 9]), ('A', [7, 8]), ('C', [1, 5])]

find sum and average per key using <mark>groupByKey()</mark>

In [7]:
# 2 operations, SUM and AVG in one step
sum_and_avg = grouped.mapValues(lambda v: (sum(v), float(sum(v))/len(v)))
sum_and_avg.collect()

[('B', (12, 6.0)), ('A', (15, 7.5)), ('C', (6, 3.0))]

find sum and average per key using <mark>reduceByKey()</mark>
> **reduceByKey** Merge the values for each key using an associative and commutative reduce function.

In [9]:
positives.collect()

[('A', 7), ('A', 8), ('B', 3), ('B', 9), ('C', 1), ('C', 5)]

In [8]:

# 1. create (the element itself, 1) per key
sum_count = positives.mapValues(lambda v: (v, 1))
sum_count.collect()

[('A', (7, 1)),
 ('A', (8, 1)),
 ('B', (3, 1)),
 ('B', (9, 1)),
 ('C', (1, 1)),
 ('C', (5, 1))]

> **mapValues:** Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

In [10]:
# 2. aggregate (sum, count) per key
sum_count_agg = sum_count.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
sum_count_agg.collect()

[('B', (12, 2)), ('A', (15, 2)), ('C', (6, 2))]

> ('B', (3, 1)),('B', (9, 1)) --> B: (3+9),(1+1) --> B:  (12, 2)

In [11]:
# 3. finalize sum and average per key
sum_and_avg = sum_count_agg.mapValues(lambda v: (v[0], float(v[0])/v[1]))
sum_and_avg.collect()

[('B', (12, 6.0)), ('A', (15, 7.5)), ('C', (6, 3.0))]

> This one is way more verbose

# Actions
* reduce():Applies a function to deliver a single value, such as adding values for a given
RDD[Integer]
* collect(): Converts an RDD[T] into a list of type T
* count() Finds the number of elements in a given RDD
* saveAsTextFile(): Saves RDD elements to a disk
* saveAsMap(): Saves RDD[(K, V)] elements to a disk as a dict[K, V]

In [None]:
spark.stop()