# Working with Pair RDDs

From your home directory clone the Spark in Action GitHub repository using the following command: 
   
git clone https://github.com/spark-in-action/first-edition

You might also want to have [Spark official documentation](https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkContext) open.

First create a Spark context:

In [1]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(master='local[*]', appName="Spark course Pair RDDs")

The file `first-edition/ch04/ch04_data_transactions.txt` contains data about transactions your customers made. But before those transactions are fully processed, you can influence the final amounts and prices by changing the values in the file. Each line of the file contains these hash-delimited fields (delimited with "#"):
* transaction date
* transaction time
* customer ID
* product ID
* quantity
* product price   
   
Load the file into `transactions` RDD so that each element of the RDD is a list of strings (lines of the file split by the "#" sign).

In [2]:
transactions = sc.textFile("/home/spark/first-edition/ch04/ch04_data_transactions.txt").map(lambda x: x.split("#"))

Do not execute the following cell! 

Pyspark in Jupyter notebooks has trouble serializing classes declared inside the same notebook so if we want to use the below-defined class, we need to resort to a trick.

Copy the `Transaction` class declaration into the file `/home/spark/first-edition/transaction.py` (using vi, for example).

In [24]:
DO NOT EXECUTE
class Transaction:
    def __init__(self, date, time, customer, product, quantity, price):
        self.date = date
        self.time = time
        self.customer = customer
        self.product = product
        self.quantity = quantity
        self.price = price
        
    def __repr__(self):
        return "[%s, %s, %d, %d, %.3f, %.2f]" % (self.date, self.time, self.customer, self.product, self.quantity, self.price)

Now execute the following cell in order to add that file to this PySpark runtime instance and import the class.

In [3]:
sc.addPyFile('/home/spark/first-edition/transaction.py')
from transaction import Transaction

Now you can use the declared class. Map each transaction from the `transactions` RDD into a `Transaction` object. Parse `customer` and `product` fields as `int`s and `quantity` and `price` as `float`s. 

Then create a `transByCust` Pair RDD whose keys are customer IDs and values are the original `Transaction` objects.

In [4]:
transByCust = transactions.map(lambda tran: Transaction(tran[0], tran[1], int(tran[2]), int(tran[3]), float(tran[4]), float(tran[5]))).\
    map(lambda t: (t.customer, t))

Examine the first couple of elements of `transByCust` to see if everything went smoothly.

In [5]:
transByCust.take(10)

[(51, [2015-03-30, 6:55 AM, 51, 68, 1.000, 9506.21]),
 (99, [2015-03-30, 7:39 PM, 99, 86, 5.000, 4107.59]),
 (79, [2015-03-30, 11:57 AM, 79, 58, 7.000, 2987.22]),
 (51, [2015-03-30, 12:46 AM, 51, 50, 6.000, 7501.89]),
 (86, [2015-03-30, 11:39 AM, 86, 24, 5.000, 8370.20]),
 (63, [2015-03-30, 10:35 AM, 63, 19, 5.000, 1023.57]),
 (23, [2015-03-30, 2:30 AM, 23, 77, 7.000, 5892.41]),
 (49, [2015-03-30, 7:41 PM, 49, 58, 4.000, 9298.18]),
 (97, [2015-03-30, 9:18 AM, 97, 86, 8.000, 9462.89]),
 (94, [2015-03-30, 10:06 PM, 94, 26, 4.000, 4199.15])]

Now we can start analyzing the data.

First, find out how many customers have made a purchase. (Hint: this is equal to the number of distinct keys.)

In [6]:
transByCust.keys().distinct().count()

100

Which customer (ID) made the largest number of purchases? (Hints: Use Python's `sorted` method with the `key` option. Python `dict` can be converted into a list of tuples using `items()` method.)

In [15]:
sorted([(k, v) for k, v in transByCust.countByKey().items()], key=lambda kc: kc[1])[-1]

(53, 19)

Which transactions did that customer make?

In [17]:
transByCust.lookup(53)

[[2015-03-30, 6:18 AM, 53, 42, 5.000, 2197.85],
 [2015-03-30, 4:42 AM, 53, 44, 6.000, 9182.08],
 [2015-03-30, 2:51 AM, 53, 59, 5.000, 3154.43],
 [2015-03-30, 5:57 PM, 53, 31, 5.000, 6649.27],
 [2015-03-30, 6:11 AM, 53, 33, 10.000, 2353.72],
 [2015-03-30, 9:46 PM, 53, 93, 1.000, 2889.03],
 [2015-03-30, 4:15 PM, 53, 72, 7.000, 9157.55],
 [2015-03-30, 2:42 PM, 53, 94, 1.000, 921.65],
 [2015-03-30, 8:30 AM, 53, 38, 5.000, 4000.92],
 [2015-03-30, 6:06 AM, 53, 12, 6.000, 2174.02],
 [2015-03-30, 3:44 AM, 53, 47, 1.000, 7556.32],
 [2015-03-30, 10:25 AM, 53, 30, 2.000, 5107.00],
 [2015-03-30, 1:48 AM, 53, 58, 4.000, 718.93],
 [2015-03-30, 9:31 AM, 53, 18, 4.000, 8214.79],
 [2015-03-30, 9:04 AM, 53, 68, 4.000, 9246.59],
 [2015-03-30, 1:51 AM, 53, 40, 1.000, 4095.50],
 [2015-03-30, 1:53 PM, 53, 85, 9.000, 1630.24],
 [2015-03-30, 6:51 PM, 53, 100, 1.000, 1694.52],
 [2015-03-30, 7:39 PM, 53, 100, 8.000, 7885.35]]

Lower the price by 10% for each purchase of two or more products with ID 25. (Hint: Create a new RDD by executing `mapValues` on the first one.)

In [18]:
def lowerPriceFor25(tran):
    if tran.product == 25 and tran.quantity > 1:
        tran.price *= 0.95
    return tran

transByCust2 = transByCust.mapValues(lowerPriceFor25)

Find the matching transactions (in the old and the new RDD) and check if the change has been made. (Hint: Use `filter` on both RDDs and print out the results for visual inspection. Again, use `toInt` and `toDouble` as necessary.)

In [22]:
for t in transByCust.filter(lambda tran: tran[1].product == 25 and tran[1].quantity > 1).map(lambda x: x[1]).collect():
    print(t)
print("TRANS NEW:")
for t in transByCust2.filter(lambda tran: tran[1].product == 25 and tran[1].quantity > 1).map(lambda x: x[1]).collect():
    print(t)

[2015-03-30, 6:26 PM, 17, 25, 6.000, 7193.11]
[2015-03-30, 7:27 AM, 93, 25, 7.000, 2749.15]
[2015-03-30, 1:07 AM, 68, 25, 9.000, 8391.61]
[2015-03-30, 1:23 AM, 59, 25, 5.000, 5296.69]
[2015-03-30, 9:45 AM, 42, 25, 10.000, 1363.97]
[2015-03-30, 10:40 PM, 77, 25, 3.000, 3345.81]
[2015-03-30, 12:53 PM, 22, 25, 9.000, 6996.42]
[2015-03-30, 12:44 AM, 32, 25, 8.000, 8849.50]
[2015-03-30, 5:30 PM, 75, 25, 10.000, 3557.01]
TRANS NEW:
[2015-03-30, 6:26 PM, 17, 25, 6.000, 6833.45]
[2015-03-30, 7:27 AM, 93, 25, 7.000, 2611.69]
[2015-03-30, 1:07 AM, 68, 25, 9.000, 7972.03]
[2015-03-30, 1:23 AM, 59, 25, 5.000, 5031.86]
[2015-03-30, 9:45 AM, 42, 25, 10.000, 1295.77]
[2015-03-30, 10:40 PM, 77, 25, 3.000, 3178.52]
[2015-03-30, 12:53 PM, 22, 25, 9.000, 6646.60]
[2015-03-30, 12:44 AM, 32, 25, 8.000, 8407.02]
[2015-03-30, 5:30 PM, 75, 25, 10.000, 3379.16]


To all customers who bought five or more products with ID 81 in a single transaction give a complimentary item ID 70. Create a new transaction for this (using Pythons `copy.copy()` method) and use the original date and time, but set the price to zero. 

(Hint: use `flatMapValues`)

In [31]:
def compl70(tran):
    import copy
    if tran.product == 81 and tran.quantity >= 5:
        newtran = copy.copy(tran)
        newtran.price = 0.00
        newtran.product = 70
        newtran.quantity = 1
        return [tran, newtran]
    else:
        return [tran]

transByCust3 = transByCust2.flatMapValues(compl70)

Check if the changes have been made for the first customer that matches.

In [33]:
transByCust2.filter(lambda tran: tran[1].product == 81 and tran[1].quantity >= 5).map(lambda t: t[1].customer).collect()

[85, 77, 82, 47, 16, 10]

In [36]:
for t in transByCust2.filter(lambda t: t[1].customer == 77).map(lambda t: t[1]).collect():
    print(t)
print("NEW")
for t in transByCust3.filter(lambda t: t[1].customer == 77).map(lambda t: t[1]).collect():
    print(t)

[2015-03-30, 10:49 AM, 77, 33, 8.000, 5440.16]
[2015-03-30, 5:03 PM, 77, 18, 9.000, 5578.06]
[2015-03-30, 11:08 PM, 77, 1, 3.000, 5188.87]
[2015-03-30, 9:13 PM, 77, 29, 10.000, 7363.10]
[2015-03-30, 4:00 PM, 77, 87, 8.000, 1343.98]
[2015-03-30, 9:32 AM, 77, 4, 8.000, 3732.02]
[2015-03-30, 9:34 PM, 77, 81, 5.000, 7798.28]
[2015-03-30, 1:30 PM, 77, 49, 6.000, 102.77]
[2015-03-30, 7:51 PM, 77, 58, 7.000, 7566.47]
[2015-03-30, 10:40 PM, 77, 25, 3.000, 3178.52]
[2015-03-30, 7:08 PM, 77, 61, 2.000, 4382.38]
NEW
[2015-03-30, 10:49 AM, 77, 33, 8.000, 5440.16]
[2015-03-30, 5:03 PM, 77, 18, 9.000, 5578.06]
[2015-03-30, 11:08 PM, 77, 1, 3.000, 5188.87]
[2015-03-30, 9:13 PM, 77, 29, 10.000, 7363.10]
[2015-03-30, 4:00 PM, 77, 87, 8.000, 1343.98]
[2015-03-30, 9:32 AM, 77, 4, 8.000, 3732.02]
[2015-03-30, 9:34 PM, 77, 81, 5.000, 7798.28]
[2015-03-30, 9:34 PM, 77, 70, 1.000, 0.00]
[2015-03-30, 1:30 PM, 77, 49, 6.000, 102.77]
[2015-03-30, 7:51 PM, 77, 58, 7.000, 7566.47]
[2015-03-30, 10:40 PM, 77, 25, 3

Which customer has spent the most? (Hint: use `reduceByKey` to find the total amount, which is quantity times price, collect the results and then sort them using `sorted`)

In [37]:
sumAmountsByCust = transByCust2.mapValues(lambda t: t.quantity * t.price).reduceByKey(lambda v1, v2: v1 + v2)

In [38]:
sorted(sumAmountsByCust.collect(), key=lambda x: x[1])[-1]

(56, 597122.3099999999)

## Joining, sorting and grouping

Let's say you now need to create a report with product names and IDs and the total amounts sold. The product IDs and the amounts will come from the RDDs you created previously, but you still need product names.

First, load the `/home/spark/first-edition/ch04/ch04_data_products.txt` file which contains: 
- product ID
- product name
- price
- number of items 

for each product available in the storage, separated with hash signs (`#`). 

Create a Pair RDD whose keys are product IDs (as `int`s), and values are product names.

In [41]:
products = sc.textFile("/home/spark/first-edition/ch04/ch04_data_products.txt").\
  map(lambda line: line.split("#")).\
  map(lambda p: (int(p[0]), p[1]))

Now use this RDD and the RDD with transactions to obtain an RDD that contains tuples of product IDs, product names, and total amounts sold. (Hint: You first need to create adequate Pair RDDs so that they can be joined by product ID. Don't forget to sum up all the amounts for each product using `reduceByKey`.)

In [50]:
totalsByProd = transByCust3.map(lambda tran: (tran[1].product, tran[1].quantity * tran[1].price)).\
    reduceByKey(lambda a1, a2: a1 + a2)

In [51]:
totalsByProdNames = products.join(totalsByProd).map(lambda tbp: (tbp[0], tbp[1][0], tbp[1][1]))

See if the results are as expected.

In [53]:
totalsByProdNames.take(10)

[(4, 'Bear doll', 352329.38),
 (8, 'LEGO Star Wars', 205839.83),
 (12, 'LEGO Hero Factory', 257001.33000000002),
 (16, 'LEGO Classic', 362672.58999999997),
 (24, 'LEGO Pirates', 263908.14999999997),
 (28, 'Far Cry 4 Limited Edition for Xbox One', 409954.8599999999),
 (32, 'Intel Core i7 3770K', 181640.36999999997),
 (36, 'GAM X360 Hitman Absolution X360', 81603.95),
 (40, 'Might and Magic: Clash Of Heroes PC', 156079.9),
 (44, 'SAMSUNG LED TV 39F5500, Full HD, USB', 750864.47)]

Now save them as `products_sold.report` file.

In [54]:
totalsByProdNames.saveAsTextFile('products_sold.report')

## Aggregating by key

As the last task, use `aggregateByKey` to find average, minimum and maximum price of products bought by each customer (you can use `transByCust` RDD for this). (Hint: your "zero value" will need to contain 4 elements: count, sum, minimum and maximum)

In [58]:
def transFunc(counters, t):
    p = t.price
    return [counters[0]+1, min(counters[1], p), max(counters[2], p), counters[3] + p]

def mergeFunc(cntr1, cntr2):
    return [cntr1[0] + cntr2[0], min(cntr1[1], cntr2[1]), max(cntr1[2], cntr2[2]), cntr1[3] + cntr2[3]]

avgMinMaxPerCust = transByCust.aggregateByKey([0, 9999999999999, 0.0, 0.0], transFunc, mergeFunc).\
    mapValues(lambda m: (m[3]/m[0], m[1], m[2]))
avgMinMaxPerCust.take(10)

[(20, (4124.725, 416.72, 9463.01)),
 (56, (5053.349411764706, 102.47, 9826.83)),
 (8, (3860.381, 58.62, 8616.57)),
 (84, (5891.179999999999, 1312.06, 9874.56)),
 (40, (5221.209, 1944.72, 9307.77)),
 (80, (4542.088571428571, 1167.38, 8460.09)),
 (52, (6483.113333333334, 2786.88, 9995.81)),
 (4, (3800.122727272727, 568.64, 9325.95)),
 (100, (5010.055, 1257.63, 9210.55)),
 (12, (5074.478571428571, 140.46, 9341.65))]