# 数据帧DataFrame
---

In [1]:
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

## 创建DataFrame

### 通过读取文件创建DataFrame

In [2]:
mtcars = spark.read.csv(path='../../data/mtcars.csv',
                        sep=',',
                        encoding='UTF-8',
                        comment=None,
                        header=True, 
                        inferSchema=True)
mtcars.show(n=5, truncate=False)

+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|_c0              |mpg |cyl|disp |hp |drat|wt   |qsec |vs |am |gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|Mazda RX4        |21.0|6  |160.0|110|3.9 |2.62 |16.46|0  |1  |4   |4   |
|Mazda RX4 Wag    |21.0|6  |160.0|110|3.9 |2.875|17.02|0  |1  |4   |4   |
|Datsun 710       |22.8|4  |108.0|93 |3.85|2.32 |18.61|1  |1  |4   |1   |
|Hornet 4 Drive   |21.4|6  |258.0|110|3.08|3.215|19.44|1  |0  |3   |1   |
|Hornet Sportabout|18.7|8  |360.0|175|3.15|3.44 |17.02|0  |0  |3   |2   |
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows



### 通过 `createDataFrame` 函数创建DataFrame

#### 通过RDD

RDD中的元素必须为行对象

In [3]:
from pyspark.sql import Row
rdd = sc.parallelize([
    Row(x=[1,2,3], y=['a','b','c']),
    Row(x=[4,5,6], y=['e','f','g'])
])
rdd.collect()

[Row(x=[1, 2, 3], y=['a', 'b', 'c']), Row(x=[4, 5, 6], y=['e', 'f', 'g'])]

In [4]:
df = spark.createDataFrame(rdd)
df.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+



#### 通过pandas DataFrame

In [5]:
import pandas as pd
pdf = pd.DataFrame({
    'x': [[1,2,3], [4,5,6]],
    'y': [['a','b','c'], ['e','f','g']]
})
pdf

Unnamed: 0,x,y
0,"[1, 2, 3]","[a, b, c]"
1,"[4, 5, 6]","[e, f, g]"


In [6]:
df = spark.createDataFrame(pdf)
df.show()

+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+



#### 通过列表

列表中的每个元素都成为DataFrame中的一行. 

In [7]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['letter', 'number'])
df.show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     b|     2|
+------+------+



In [8]:
df.dtypes

[('letter', 'string'), ('number', 'bigint')]

In [9]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['my_column'])
df.show()

+---------+---+
|my_column| _2|
+---------+---+
|        a|  1|
|        b|  2|
+---------+---+



In [10]:
df.dtypes

[('my_column', 'string'), ('_2', 'bigint')]

下面的代码生成一个包含两列的DataFrame，每列是一个向量列.

为什么在这种情况下会生成向量列? 在本例中，列表**my_list**只有一个元素，即元组。因此，DataFrame只有一行。这个元组有两个元素。因此，它生成一个两列的DataFrame。元组中的每个元素都是一个列表，因此结果列是向量列。

In [11]:
my_list = [(['a', 1], ['b', 2])]
df = spark.createDataFrame(my_list, ['x', 'y'])
df.show()

+------+------+
|     x|     y|
+------+------+
|[a, 1]|[b, 2]|
+------+------+





## 列及相关方法

有两种方式创建列:

1. 在*DataFrame*外直接创建: `df.colName`
2. 从列表达式创建: `df.colName + 1`

列类附带了一些可以操作列实例的方法。***但是，几乎所有的函数都来自`pyspark.sql`，函数的模块采用一个或多个列实例作为参数***。这些函数对于数据操作工具很重要。

### 接受列名作为参数的方法

* `corr(col1, col2)`: two column names.
* `cov(col1, col2)`: two column names.
* `crosstab(col1, col2)`: two column names.
* `describe(*cols)`: ***`*cols` refers to only column names (strings).***

### 以列名或列表达式或两者作为参数的方法

* `cube(*cols)`: column names (string) or column expressions or **both**.
* `drop(*cols)`: ***a list of column names OR a single column expression.***
* `groupBy(*cols)`: column name (string) or column expression or **both**.
* `rollup(*cols)`: column name (string) or column expression or **both**.
* `select(*cols)`: column name (string) or column expression or **both**.
* `sort(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sortWithinPartitions(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `orderBy(*cols, **kwargs)`: column name (string) or column expression or **both**.
* `sampleBy(col, fractions, sed=None)`: a column name.
* `toDF(*cols)`: **a list of column names (string).**
* `withColumn(colName, col)`: `colName` refers to column name; `col` refers to a column expression.
* `withColumnRenamed(existing, new)`: takes column names as arguments.
* `filter(condition)`: ***condition** refers to a column expression that returns `types.BooleanType` of values. 