# Working with Pair RDDs

[Full Slides](https://pages.github.umn.edu/deliu/bigdata19/13-PairRDD/Spark4-PairRDDs.pdf)

Why?  
- Better for joining/sort/count/ other aggregation stuff
- Faster (using mapreduce -- map each record to one or more records, reduce by agg)

`Pair RDDs is how Spark implements MapReduce`

<img src=https://i.imgur.com/RfclYM0.png width="400" height="340" align="left">


---

## Create pairs

[Create Pairs Slides](file:///C:/Users/Sam/Desktop/Big%20Data/Hive/create%20pairs.pdf)  
https://www.youtube.com/watch?v=eOKeIUE5S1w


**Pair RDD - key value pairs**  
What should the keys/values be?  

- `map` and `flatMap` same as before


- `flatMapValues` keep the keys, just map values  
`users = sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))`


- `keyBy` keep the values, just add keys  
`sc.textFile(logfile) \
.keyBy(lambda line: line.split(' ')[2])`


- complex values  
`sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],(fields[1],fields[2])))`


- map single values to multiple pairs  
`sc.textFile(file) \
.map(lambda line: line.split('\t')) \
.map(lambda fields: (fields[0],fields[1]))
.flatMapValues(lambda skus: skus.split(':'))`


---

## Pair RDD operations

[Examples in Slides](file:///C:/Users/Sam/Desktop/Big%20Data/Hive/pair%20ops.pdf)  
https://www.youtube.com/watch?v=6CzJfrz3yJk

- `reduceByKey` (trans) merge values per key  

- `countByKey` (act) count elements per key

- `groupByKey` (trans) single sequence per group

- `sortByKey` (trans)

- `join` (trans) return rdd containing all pairs with matching keys

---

**Word counts example**  
`counts = sc.textFile(file) \
.flatMap(lambda line: line.split()) \` to get the list of words  
`.map(lambda word: (word,1)) \` to initialize the count  
`.reduceByKey(lambda v1,v2: v1+v2)` to add up the counts per word


---

## Other pair operations

[Slides](file:///C:/Users/Sam/Desktop/Big%20Data/Hive/Arc.pdf)  
https://www.youtube.com/watch?v=jArlIctlb7w

- `keys` - return rdd of keys
- `values` - return rdd of values
- `lookup(key)` - return value for key
- `leftOuterJoin` - include keys from left
- `mapValues, flatMapValues` - ignore the keys, operate on values

