### pari RDD

- spark为键值对的RDD提供一些专有操作，例如聚合和分组。

### 创建Pari RDD

#### 用每行数据的首个元素作为键值，创建键值对

```python
pairs = lines.map(lambda x: (x.split(' ')[0], x))
```

```scala
val paris = lines.map(x => x.split(" ")(0), x))
```

In [1]:
from pyspark import SparkConf, SparkContext

In [3]:
sc = SparkContext()

In [4]:
nums = sc.parallelize({(1,2),(3,4),(3,6)})

## 操作单个Pari RDD

#### reduceByKey(func) : 根据键对值求和

In [10]:
result = nums.reduceByKey(lambda x, y: x+y)

result.collect()

[(1, 2), (3, 10)]

#### groupByKey() : 对具有相同的键的值进行分组

In [11]:
result = nums.groupByKey()
result.collect()

[(1, <pyspark.resultiterable.ResultIterable at 0x7fad2d1c42e8>),
 (3, <pyspark.resultiterable.ResultIterable at 0x7fad2d1c4358>)]

#### mapValues(func) : 对每个Pari RDD的值应用func,键值不变

In [12]:
result = nums.mapValues(lambda x : x*2)
result.collect()

[(1, 4), (3, 8), (3, 12)]

#### flatMapValues(func) : 对每个Pari RDD的值应用func,返回的每个元素都生成一个对原键的键值对记录

In [14]:
result = nums.flatMapValues(lambda x : [x,x,x])
result.collect()

[(1, 2), (1, 2), (1, 2), (3, 4), (3, 4), (3, 4), (3, 6), (3, 6), (3, 6)]

#### keys() : 返回Pari RDD的key

In [15]:
keys = nums.keys()
keys.collect()

[1, 3, 3]

#### values() : 返回Pari RDD的值

In [16]:
values = nums.values()
values.collect()

[2, 4, 6]

#### sorByKey() : 返回根据键排序的RDD

In [17]:
result = nums.sortByKey()
result.collect()

[(1, 2), (3, 4), (3, 6)]

## 针对两个Pair RDD的操作

#### subtractByKey() : 返回RDD1中与RDD2没有重复的Pari RDD

In [5]:
rdd1 = sc.parallelize({(1,2), (3,4), (3,6)})
rdd2 = sc.parallelize({(3,9)})

In [21]:
result = rdd1.subtractByKey(rdd2)

result.collect()

[(1, 2)]

#### join() : 对两个Pari RDD 进行内连接

In [22]:
result = rdd1.join(rdd2)
result.collect()

[(3, (4, 9)), (3, (6, 9))]

#### rightOuterJoin() : 对两个Pari RDD进行右连接

In [23]:
result = rdd1.rightOuterJoin(rdd2)
result.collect()

[(3, (4, 9)), (3, (6, 9))]

#### leftOuterJoin() : 对两个Pari RDD进行左连接

In [24]:
result = rdd1.leftOuterJoin(rdd2)
result.collect()

[(1, (2, None)), (3, (4, 9)), (3, (6, 9))]

#### cogroup() : 把两个RDD中拥有相同的键的数据分组到一起

In [6]:
result = rdd1.cogroup(rdd2)
result.collect()

[(1,
  (<pyspark.resultiterable.ResultIterable at 0x7f3288b4e390>,
   <pyspark.resultiterable.ResultIterable at 0x7f3288b4e438>)),
 (3,
  (<pyspark.resultiterable.ResultIterable at 0x7f3288b4e400>,
   <pyspark.resultiterable.ResultIterable at 0x7f3288b4e4a8>))]

In [13]:
a = result.first()

In [14]:
a.collect()

AttributeError: 'tuple' object has no attribute 'collect'