# PairRDD 
- Key, Value 쌍으로 데이터를 표현
- Key : 단순객체 (정수, 문자열 등 단순 객체) 또는 복합객체(튜플 등의 복합 객체)
- Value : 스칼라 값, List, Tuple, Dictionary 등
- 주요 PairRDD Transformation, Action 연산
    - keys(), values(), keyBy(), mapValues(), flatMapValues(), groupByKey(), reduceByKey(), foldByKey(), sortByKey()

In [1]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.master("local").appName("pairrdd-op-test1").getOrCreate()

In [2]:
shakespeare = spark_session.sparkContext.textFile("./data/shakespeare.txt")
shakespeare.take(5)

["A MIDSUMMER-NIGHT'S DREAM",
 '',
 'Now , fair Hippolyta , our nuptial hour ',
 'Draws on apace : four happy days bring in ',
 'Another moon ; but O ! methinks how slow ']

### keys()
- key/value 쌍에서 key를 반환

In [3]:
kvrdd = spark_session.sparkContext.parallelize([('city', 'chungju'), ('state', 'chungbuk')])
kvrdd.keys().collect()

['city', 'state']

### values()
- key/value 쌍에서 value를 반환

In [4]:
kvrdd.values().collect()

['chungju', 'chungbuk']

### keyBy(function)
- function을 적용해 key, value 쌍을 생성함

In [5]:
kvrdd = spark_session.sparkContext.parallelize([('city', 'chungju', 1), ('state', 'chungbuk', 2)])
kvrdd1 = kvrdd.keyBy(lambda x: x[2])
kvrdd1.collect()

[(1, ('city', 'chungju', 1)), (2, ('state', 'chungbuk', 2))]

### mapValues(function), flatMapValues(function)
- value에 대해서만 map() 수행

In [6]:
loc_rdd = spark_session.sparkContext.parallelize(['Hayward, 71|69|71|71|72',
                                               'Baumholder, 46|42|40|37|39',
                                               'Alexandria, 50|48|51|53|44'])
kv_loc_rdd = loc_rdd.map(lambda x: x.split(','))
kv_loc_rdd.take(2)

[['Hayward', ' 71|69|71|71|72'], ['Baumholder', ' 46|42|40|37|39']]

In [7]:
temp_rdd1 = kv_loc_rdd.mapValues(lambda x: x.split('|'))
temp_rdd1.take(2)

[('Hayward', [' 71', '69', '71', '71', '72']),
 ('Baumholder', [' 46', '42', '40', '37', '39'])]

In [8]:
temp_rdd2 = temp_rdd1.mapValues(lambda x: [int(s) for s in x])
temp_rdd2.take(2)

[('Hayward', [71, 69, 71, 71, 72]), ('Baumholder', [46, 42, 40, 37, 39])]

In [9]:
temp_rdd3 = kv_loc_rdd.flatMapValues(lambda x: x.split('|'))
temp_rdd3.take(2)

[('Hayward', ' 71'), ('Hayward', '69')]

temp_rdd4 = temp_rdd3.map(lambda x: (x[0], int(x[1])))
temp_rdd4.take(2)

### groupByKey(numPartitions, partition_function)
- key, value 쌍의 key에 대해 그룹화

### reduceByKey(function, numPartitions, partition_function), foldByKey
- key에 대해 value 들을 병합

In [18]:
data = spark_session.sparkContext.parallelize( [('panda', 0), ('pink', 3), ('pirate', 3), ('panda', 1), ('pink', 4)] )
kvRdd = data.mapValues(lambda x: (x,1))
print(kvRdd.collect())
sumCount= kvRdd.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
#sumCount = data.mapValues(lambda x: (x,1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(sumCount.collect())
averageByKey = sumCount.map(lambda x: (x[0], x[1][0] / x[1][1]))
print(averageByKey.collectAsMap())

[('panda', (0, 1)), ('pink', (3, 1)), ('pirate', (3, 1)), ('panda', (1, 1)), ('pink', (4, 1))]
[('panda', (1, 2)), ('pink', (7, 2)), ('pirate', (3, 1))]
{'panda': 0.5, 'pink': 3.5, 'pirate': 3.0}


In [19]:
group = data.groupByKey()
print(group.collectAsMap())
group1= group.map(lambda x : (x[0], list(x[1])))
print(group1.collect())

{'panda': <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b39490>, 'pink': <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b2bbd0>, 'pirate': <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b3bb90>}
[('panda', [0, 1]), ('pink', [3, 4]), ('pirate', [3])]


### sortByKey(ascending, numPartitions, keyfunc)
- key, value 데이터에서 key 를 기준으로 정렬

In [20]:
sortedKV = data.sortByKey()
print(sortedKV.take(4))

[('panda', 0), ('panda', 1), ('pink', 3), ('pink', 4)]


### join(otherRDD, numPartitions)
- 2개의 key, value RDD에 대해서 key 가 같은 레코드들을 결합하는 내부 조인(inner join)
- numPartitions : join을 통해 생성되는 RDD의 파티션 수

In [21]:
names1 = spark_session.sparkContext.parallelize(("abe", "abby", "apple")).map(lambda a: (a, 1))
names2 = spark_session.sparkContext.parallelize(("apple", "beatty", "beatrice")).map(lambda a: (a, 1))
print(names1.join(names2).collect())
print(names1.leftOuterJoin(names2).collect())
print(names1.rightOuterJoin(names2).collect())
print(names1.fullOuterJoin(names2).collect())

[('apple', (1, 1))]
[('abby', (1, None)), ('apple', (1, 1)), ('abe', (1, None))]
[('apple', (1, 1)), ('beatty', (None, 1)), ('beatrice', (None, 1))]
[('abby', (1, None)), ('apple', (1, 1)), ('abe', (1, None)), ('beatty', (None, 1)), ('beatrice', (None, 1))]


### cogroup(otherRDD, numPartitions)
- key를 기준으로 다수의 RDD들을 그룹화
- fullOuterJoin과 유사하지만 구현 방식은 다름
- 출력 : iterable 객체
- 3개 이상의 RDD 그룹화 가능

In [22]:
print(names1.cogroup(names2).collect())

[('abby', (<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b54390>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b54a90>)), ('apple', (<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b54410>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b54690>)), ('abe', (<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b56b90>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b57650>)), ('beatty', (<pyspark.resultiterable.ResultIterable object at 0x7f0de66eb3d0>, <pyspark.resultiterable.ResultIterable object at 0x7f0d6624da90>)), ('beatrice', (<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b68cd0>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b691d0>))]


In [28]:
cogroupRDD = names1.cogroup(names2)
print(cogroupRDD.mapValues(lambda x: [item for sublist in x for item in sublist]).collect())
print(cogroupRDD.mapValues(lambda x: [sublist for sublist in x]).collect())

[('abby', [1]), ('apple', [1, 1]), ('abe', [1]), ('beatty', [1]), ('beatrice', [1])]
[('abby', [<pyspark.resultiterable.ResultIterable object at 0x7f0d66249650>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b28bd0>]), ('apple', [<pyspark.resultiterable.ResultIterable object at 0x7f0d66461b90>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b2ab10>]), ('abe', [<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b298d0>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b28b10>]), ('beatty', [<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b286d0>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b2a590>]), ('beatrice', [<pyspark.resultiterable.ResultIterable object at 0x7f0dc3b28850>, <pyspark.resultiterable.ResultIterable object at 0x7f0dc3b28510>])]


### cartesian(otherRDD)
- 크로스 조인 수행
- 두 RDD의 key, value 쌍들에 대해 모든 가능한 조합을 만들어서 RDD생성

### 집합 연산
- union(), intersection(), subtract(), subtractByKey()

### 숫자 RDD 처리 연산
- min(), max(), mean(), sum(), stdev(), variance(), stats()