# 键值对操作   
### 1.动机
- pairRDD: 提供并行操作各个键或跨节点重新进行数据分组的操作接口。
- reduceByKey: 分别规约每个键对应的数据
- join: 可以把两个RDD中键相同的的元素组合到一起，合并为一个RDD。

### 2.创建PairRDD
```
rawRDD = sc.parallelize(["打招呼 你好啊~~吃了吗??","约 约吗美女?"])
pairRDD = rawRDD.map(lambda line : (line.split(' ')[0],line.split(' ')[1:]))
pairRDD.collect()
```
### 3.PairRDD的转化操作
#### 0.基本函数操作详见代码
略
#### 1.聚合操作   
- 基础RDD聚合操作: reduce(),combine(),fold()
- PairRDD聚合操作: reduceByKey().combineByKey(),foldByKey()
- 数据分组: groupByKey().    groupByKey + mapValues 可以实现reduceByKey()同样功能，但前者效率低
- 连接: join(),leftOuterJoin(),rightOuterJoin().
- 数据排序: sortByKey()    
### 4.PairRDD的行动操作
- countByKey(): 对每个键嘴硬的元素分别计数
- countByValue(): 按键对值进行统计  单词计数简化版
- collectAsMap(): 结果以映射表的形式返回
- lookup(key): 返回给定键对应的所有值 
### 5.数据分区（进阶）
使用自定义分区来提高效率，减少每次对不变的表进行混洗操作而消耗时间。例如使用rdd.partitionBy()



In [2]:
from pyspark import SparkContext,SparkConf
sc = SparkContext("local","PairRDD")

In [103]:
# WordCount 
import os
words = sc.textFile("file://" + os.path.abspath(".") + "/quickstart.txt").flatMap(lambda line : line.split(" "))
wordCount = words.countByValue()
# wordCount = words.map(lambda x: (x,1)).reduceByKey(lambda x,y : x + y)
wordCount

defaultdict(int,
            {'': 71,
             '")).map(word': 1,
             '").size).reduce((a,': 2,
             '"1.0"': 1,
             '"1.6.2"': 1,
             '"2.10.5"': 1,
             '"Simple': 1,
             '"SimpleApp"': 1,
             '"Spark"?': 1,
             '"YOUR_SPARK_HOME/README.md"': 1,
             '"org.apache.spark"': 1,
             '"spark-core"': 1,
             '#': 7,
             '$': 3,
             '%': 1,
             '%%': 1,
             '%s".format(numAs,': 1,
             '%s,': 1,
             '(Because,1),': 1,
             '(Python,2),': 1,
             '(RDD).': 1,
             '(Scala,': 1,
             '(String,': 1,
             '(a': 1,
             '(agree,1),': 1,
             '(closures),': 1,
             '(cluster.,1),': 1,
             '(in': 1,
             '(such': 1,
             '(this,3),': 1,
             '(under,2),': 1,
             '(which': 1,
             '(with': 2,
             '(word,': 1,
             '*/': 

In [27]:
rawRDD = sc.parallelize(["打招呼 你好啊~~吃了吗??","约 约吗美女?"])
pairRDD = rawRDD.map(lambda line : (line.split(' ')[0],line.split(' ')[1:]))
pairRDD.collect()


[('打招呼', ['你好啊~~吃了吗??']), ('约', ['约吗美女?'])]

In [71]:
# PairRDD转化操作
pairRDD = sc.parallelize([[1,2],[3,4],[3,6],[4,6]])
# 1.reduceByKey 合并相同Key的值
pair1 = pairRDD.reduceByKey(lambda x,y : x + y)
# 2.groupByKey  对相同Key的值分组
pair2 = pairRDD.groupByKey()
# 3.mapValues   对RDD中的每个值应用一个函数而不改变键
pair3 = pairRDD.mapValues(lambda x : x + 5)
# 4.flatMapValues 对RDD中的每个值应用一个返回迭代器的函数，对于每个元素都生成一个对应原键的键值对记录。
pair4 = pairRDD.flatMapValues(lambda x: (range(x)))
# 5.keys 返回Key的RDD
keysRDD = pairRDD.keys()
# 6.values 返回value的RDD
valuesRDD = pairRDD.values()
# 7.sortByKey 按键排序
sortedRDD = pairRDD.sortByKey()
sortedRDD.collect()
# 针对两个RDD的转换操作
rdd1 = sc.parallelize([[1,2],[3,4],[3,6]])
rdd2 = sc.parallelize([[3,9],[4,5]])
# 1.substractByKey  删掉key值重复的元素
subRDD = rdd1.subtractByKey(rdd2)
# 2.join  内连接
joinRDD = rdd1.join(rdd2)
# 3.rightOuterJoin 确保第2个RDD的键必须存在   右外连接
rightOuterRDD = rdd1.rightOuterJoin(rdd2)
# 4.leftOuterJoin 确保第1个RDD的键必须存在  左外连接
leftOuterRDD = rdd1.leftOuterJoin(rdd2)
# 5.cogroup 将两个RDD中拥有相同键的数据分组到一起
coRDD = rdd1.cogroup(rdd2)
# joinRDD.collect()
# rightOuterRDD.collect()
# leftOuterRDD.collect()
# coRDD.collect()


[(4,
  (<pyspark.resultiterable.ResultIterable at 0x1069162e8>,
   <pyspark.resultiterable.ResultIterable at 0x10694a400>)),
 (1,
  (<pyspark.resultiterable.ResultIterable at 0x106916048>,
   <pyspark.resultiterable.ResultIterable at 0x10694acf8>)),
 (3,
  (<pyspark.resultiterable.ResultIterable at 0x1069160b8>,
   <pyspark.resultiterable.ResultIterable at 0x10694afd0>))]

In [80]:
# 对pairRDD的value进行筛选
lines = sc.textFile("file://" + os.path.abspath(".")+ "/quickstart.txt")
pairsRDD = lines.map(lambda line : (line.split(" ")[0],line))
limitLengthRDD = pairsRDD.filter(lambda keyValue : len(keyValue[0]) > 7)
limitLengthRDD.collect()

[('Interactive', 'Interactive Analysis with the Spark Shell'),
 ('Self-Contained', 'Self-Contained Applications'),
 ('Interactive', 'Interactive Analysis with the Spark Shell'),
 ('./bin/spark-shell', './bin/spark-shell'),
 ('textFile:', 'textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3'),
 ('linesWithSpark:',
  'linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09'),
 ('wordCounts:',
  'wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8'),
 ('Self-Contained', 'Self-Contained Applications'),
 ('scalaVersion', 'scalaVersion := "2.10.5"'),
 ('libraryDependencies',
  'libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.2"'),
 ('./simple.sbt', './simple.sbt'),
 ('./src/main', './src/main'),
 ('./src/main/scala', './src/main/scala'),
 ('./src/main/scala/SimpleApp.scala', './src/main/scala/SimpleApp.scala'),
 ('Congratulations',
  'Congratulations on running your first Spark application!'),
 ('Finally,',
  'Finally, Spark includes several

In [96]:
# 聚合操作   按key值计算平均值
pairRDD = sc.parallelize([['a',1],['b',2],['c',3],['a',2],['b',0],['c',10]])
aveRDD = pairRDD.mapValues(lambda x:(x,1)).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).mapValues(lambda x:x[0]/x[1])
sumRDD.collect()


[('a', 1.5), ('b', 1.0), ('c', 6.5)]

In [101]:
sumCount = pairRDD.combineByKey((lambda x:(x,1)),
                               (lambda x,y:(x[0]+y,x[1]+1)),
                               (lambda x,y:(x[0]+y[0],x[1]+y[1])))
sumCount.getNumPartitions()

1