#### RDD Transformations - Wide Transformations

- distinct
- sortBy
- groupBy
- partitionBy
- sortByKey
- groupByKey
- reduceByKey
- repartition
- coalesce


In [None]:
%run "0.RDD-Sample-Data.ipynb"

In [None]:
rddWords.collect()

In [None]:
sc

**distinct**

In [None]:
rdd1.glom().collect()

In [None]:
#rdd1.distinct().glom().collect()
#rdd1.distinct(2).glom().collect()
#rddWords.distinct().glom().collect()

sc \
.textFile("E:\\Spark\\wordcount.txt", 4) \
.flatMap(lambda x: x.split()) \
.distinct(2) \
.glom()\
.collect()

In [None]:
rdd2.distinct().collect()

In [None]:
rddWords.distinct().collect()

In [None]:
rddFile.flatMap(lambda x: x).distinct().collect()

**sortBy**

- Sorts the objects of RDD based on the function output that they generate.
- P: U -> V,  Optional: ascending (True/False), numPartitions

In [None]:
rdd1.glom().collect()

In [None]:
rdd1.sortBy(lambda x: x, False).collect()

In [None]:
#rdd1.sortBy(lambda x: x%3).glom().collect()
#rdd1.sortBy(lambda x: x%2, False, 2).glom().collect()
rdd1.sortBy(lambda x: x > 4, False, 1).glom().collect()

In [None]:
#rdd2.sortBy(sum, False).collect()
#rdd2.sortBy(lambda x: x[1]).glom().collect()
rdd2.sortBy(lambda x: x).glom().collect()

In [None]:
#rddWords.sortBy(len, False, 10).glom().collect()
#rddWords.sortBy(lambda x: x[-1], True, 3).glom().collect()
rddWords.sortBy(lambda x: x).glom().collect()

**groupBy**

Returns a Pair-RDD where:
- key: Each unique value of the function output
- value: **ResultIterable**. Grouped objects of the RDD that produced the key

In [None]:
rdd1.collect()

In [None]:
#rdd1.groupBy(lambda x: x%2).map(lambda p: (p[0], list(p[1]))).collect()
#rdd1.groupBy(lambda x: x%2).mapValues(list).collect()
rdd1.groupBy(lambda x: x > 4, 2).mapValues(list).glom().collect()

In [None]:
#rdd2.groupBy(lambda x: x[1]).mapValues(list).collect()
rdd2.groupBy(sum).mapValues(list).collect()

In [None]:
#rddWords.groupBy(lambda x: x[0]).mapValues(list).collect()
rddWords.groupBy(len).mapValues(list).collect()

- How many words are there with each length in *rddWords* rdd

In [None]:
rddWords.groupBy(len).mapValues(len).sortBy(lambda x: x).collect()


**Transformations that can be applied only to Pair RDDs**
- partitionBy
- sortByKey
- groupByKey
- reduceByKey
- mapValues

**partitionBy**

- Applied only on pair RDD
- Partitioning happens based on the key

In [None]:
rdd2.glom().collect()

In [None]:
#rdd_partitioned = rdd2.partitionBy(5)
rdd_partitioned = rdd2.partitionBy(5, lambda x: x + 17)
rdd_partitioned.glom().collect()

In [None]:
transactions = [
    {'name': 'Raju', 'amount': 100, 'city': 'Chennai'},
    {'name': 'Mahesh', 'amount': 15, 'city': 'Hyderabad'},
    {'name': 'Madhu', 'amount': 51, 'city': 'Hyderabad'},
    {'name': 'Revati', 'amount': 200, 'city': 'Chennai'},
    {'name': 'Amrita', 'amount': 75, 'city': 'Pune'},
    {'name': 'Aditya', 'amount': 175, 'city': 'Bangalore'},
    {'name': 'Keertana', 'amount': 105, 'city': 'Pune'},
    {'name': 'Keertana', 'amount': 105, 'city': 'Vijayawada'},
    {'name': 'Amrita', 'amount': 75, 'city': 'Pune'},
    {'name': 'Aditya', 'amount': 175, 'city': 'Bangalore'},
    {'name': 'Keertana', 'amount': 105, 'city': 'Pune'},
    {'name': 'Keertana', 'amount': 105, 'city': 'Vijayawada'}]

In [None]:
def city_partitioner(city):
    if (city == "Hyderabad" or city == "Vijayawada"): return 0
    elif (city == "Bangalore"): return 1
    elif (city == "Chennai"): return 2
    else: return 3

In [None]:
transactions_rdd = sc.parallelize(transactions, 3) \
                    .map(lambda d: (d['city'], d)) \
                    .partitionBy(4, city_partitioner) \
                    .map(lambda p: p[1])

transactions_rdd.glom().collect()

**sortByKey**

- Applied only on Pair RDDs
- Sorts the data based on the key

In [None]:
#rdd2.sortByKey().collect()
#rdd2.sortByKey(False).glom().collect()
rdd2.sortByKey(True, 6).glom().collect()

In [None]:
rddPairs = rddWords.map(lambda x: (x, 1))
rddPairs.collect()

In [None]:
rddPairs.sortByKey().collect()

**groupByKey**

- Applied on Pair RDDs only
- Returns a Pair RDD where:
	- key: each unique key of the (K, V) pairs
	- value: *ResultIterable*. Grouped values with the same key.

In [None]:
#rdd2.groupByKey().mapValues(list).glom().collect()
rdd2.groupByKey(4).mapValues(list).glom().collect()

In [None]:
rddPairs.groupByKey().mapValues(list).collect()

**reduceByKey**

- Reduces all the values of each unique key by iterativly running the reduce function.

In [None]:
rdd2.reduceByKey(lambda x, y: x + y, 2).glom().collect()

In [None]:
rddPairs.reduceByKey(lambda x, y: x + y).collect()

**repartition**
- Is used to increase or decrease the number of partitions
- Performs global shuffle

In [None]:
def partitions(rdd):
    print(rdd.glom().map(len).collect())
    pass

In [None]:
partitions(rddWords)

In [None]:
rddWords6 = rddWords.repartition(6)
partitions(rddWords6)

In [None]:
rddWords3 = rddWords6.repartition(3)
partitions(rddWords3)

In [None]:
rddWords4 = rddWords6.repartition(4)
partitions(rddWords4)

**coalesce**
- Is used only to decrease the number of partitions
- Performs partition merging


In [None]:
partitions(rddWords6)

In [None]:
rddWords3 = rddWords6.coalesce(3)
partitions(rddWords3)

In [None]:
rddWords4 = rddWords6.coalesce(4)
partitions(rddWords4)