# Spark的RDD

简介(参考https://www.shiyanlou.com/courses/536)  
    
   spark的核心基础是弹性分布式数据集RDD，它代表一个不可变、可分区、里面的元素可并行计算的集合。在上面有基本的两种计算方式，变换（transform）和行动（action）：  
    transform简单说就是RDD转变成另一个RDD，比如map(func)、filter(func)、flatMap(func)，union(otherDataset)、sample(withReplacement, fraction, seed)等等；  
    action是触发代码的运行，一段spark代码里面至少需要有一个action操作。返回的是数值或数组等计算结果，比如reduce、countByKey()等等。  
 
窄依赖：是指每个父RDD的一个Partition最多被子RDD的一个Partition所使用，例如map、filter、union等操作都会产生窄依赖；（独生子女）   
宽依赖：是指一个父RDD的Partition会被多个子RDD的Partition所使用，例如groupByKey、reduceByKey、sortByKey等操作都会产生宽依赖；（超生）  

# DataFrame
用过数据编程的python或R的人应该很熟悉DataFrame，这就是一种将数据按列样式存储的数据框，以便数据开发时处理使用。spark的DataFrame其实就是源于python的，只不过使用起来和pandas的有很大区别，毕竟是面向RDD是分布式的。  
从宏观上讲，DataFrame 是为了帮助建立 Spark 生态系统。 DataFrame 是 RDD 基础核心的一种扩展。同时spark又实现了R、Java、Scala、python等高级语言的API，所以传统python开发者可以方便的使用pyspark调用操作DataFrame，从而进行数据分析处理。一般来说pandas是处理本地小数据，比如100m内的，而spark 的DataFrame可以处理分布式的大数据集。  
理解了原理我们就大概的知道，DataFrame是基于RDD，遵守RDD的一些特性规范，同时按照一定的模式和优化扩展一些功能。
DataFrame=RDD+模式+优化
同时DataFrame的操作有两种方式，一种就是基于RDD的常规接口，比如df.select('year').count()  先transform后action 。 另一种就是基于Spark ＳＱＬ接口，这种模式先要注册临时表，SQLContext，操作模式就像sql，select * from df。
小结
- 是 SparkSQL 中的编程抽象。
- 支持广泛的数据格式和存储系统。
- 可通过 Scala、Python、Java 和 R 语言编程。

项目数据集来自 2009. Data expo - Airline on-time performance 。其中 1987 年的航班数据下载链接为 http://stat-computing.org/dataexpo/2009/1987.csv.bz2   
下载地址2：
wget http://labfile.oss.aliyuncs.com/courses/536/1987.csv.bz2
先看下数据有哪些列以及格式含义。
![image.png](attachment:image.png)

In [4]:
#读取数据
from pyspark.sql import SparkSession 
spark=SparkSession.builder.appName("dataframe").getOrCreate()
sc = SparkContext.getOrCreate()
data = spark.read.csv(r"1987.csv",header=True,inferSchema=True)

In [6]:
print(data.head())

Row(Year=1987, Month=10, DayofMonth=14, DayOfWeek=3, DepTime='741', CRSDepTime=730, ArrTime='912', CRSArrTime=849, UniqueCarrier='PS', FlightNum=1451, TailNum='NA', ActualElapsedTime='91', CRSElapsedTime=79, AirTime='NA', ArrDelay='23', DepDelay='11', Origin='SAN', Dest='SFO', Distance='447', TaxiIn='NA', TaxiOut='NA', Cancelled=0, CancellationCode='NA', Diverted=0, CarrierDelay='NA', WeatherDelay='NA', NASDelay='NA', SecurityDelay='NA', LateAircraftDelay='NA')


In [7]:
data.take(5)

[Row(Year=1987, Month=10, DayofMonth=14, DayOfWeek=3, DepTime='741', CRSDepTime=730, ArrTime='912', CRSArrTime=849, UniqueCarrier='PS', FlightNum=1451, TailNum='NA', ActualElapsedTime='91', CRSElapsedTime=79, AirTime='NA', ArrDelay='23', DepDelay='11', Origin='SAN', Dest='SFO', Distance='447', TaxiIn='NA', TaxiOut='NA', Cancelled=0, CancellationCode='NA', Diverted=0, CarrierDelay='NA', WeatherDelay='NA', NASDelay='NA', SecurityDelay='NA', LateAircraftDelay='NA'), Row(Year=1987, Month=10, DayofMonth=15, DayOfWeek=4, DepTime='729', CRSDepTime=730, ArrTime='903', CRSArrTime=849, UniqueCarrier='PS', FlightNum=1451, TailNum='NA', ActualElapsedTime='94', CRSElapsedTime=79, AirTime='NA', ArrDelay='14', DepDelay='-1', Origin='SAN', Dest='SFO', Distance='447', TaxiIn='NA', TaxiOut='NA', Cancelled=0, CancellationCode='NA', Diverted=0, CarrierDelay='NA', WeatherDelay='NA', NASDelay='NA', SecurityDelay='NA', LateAircraftDelay='NA'), Row(Year=1987, Month=10, DayofMonth=17, DayOfWeek=6, DepTime='741

In [12]:
#检测数据格式，确保数据格式是否正确
data.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Car

## 数据类型转换
从数据说明可以知道除了code类是字符串编码，其实基本是时间，以minute为单位，所以我们可以转为integer型.  
从rdd的特性我们知道transform无法是在原数据集上改变，所以需要将新的列加入DataFrame


In [14]:
from pyspark.sql.types import IntegerType
data = data.withColumn("DepTime", data["DepTime"].cast(IntegerType()))
data.printSchema() 

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: integer (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: string (nullable = true)
 |-- TaxiOut: string (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- Ca

In [16]:
trans_col = ['ArrTime','ActualElapsedTime','AirTime','ArrDelay','DepDelay',
'DepDelay','TaxiIn','TaxiOut','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']
for col in trans_col:
    data = data.withColumn(col, data[col].cast(IntegerType()))
data.printSchema() 

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: integer (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: integer (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: integer (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: string (nullable = true)
 |-- TaxiIn: integer (nullable = true)
 |-- TaxiOut: integer (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)


In [17]:
data = data.withColumn('Distance', data['Distance'].cast('float'))

In [18]:
data.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: integer (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: integer (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: integer (nullable = true)
 |-- CRSElapsedTime: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: float (nullable = true)
 |-- TaxiIn: integer (nullable = true)
 |-- TaxiOut: integer (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 