# 第7章：数据规整化：清理、转换、合并、重塑

# 主要内容
1. 处理缺失数据  
    1.1 过滤缺失值——dropna()  
    1.2 填补缺失值——fillna()

pandas非常适合用于数据清理

# 1.处理缺失数据

对于数值型数据，pandas用浮点值Nan来表示缺失值，称为识别符，这些值能够被isnull()函数轻易检测到

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

内建的Python None值也被当做NA：

In [4]:
string_data[0] = None

In [5]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

关于缺失值处理的函数有：  
1. dropna:去掉缺失值
2. fillna:用值或者一些方法填补缺失值
3. isnull：判断是否为缺失值
4. notnull

# 1.1 过滤缺失值——dropna()

# Series缺失值的处理

In [6]:
from numpy import nan as NA

In [7]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [8]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

也可以用notnull来进行过滤

In [11]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

# DataFrame缺失值的处理
1. dropna默认删除含有缺失值的row
2. 设定how='all'删除全是na的row
3. axis设定删除行或列
4. thresh参数设定门限值

对于DataFrame，会复杂一些。你可能想要删除包含有NA的row和column。dropna默认会删除包含有缺失值的row

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [13]:
data1 = data.dropna()

In [14]:
data1

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


设定how=all只会删除那些全是NA的行：

In [15]:
data2 = data.dropna(how = 'all')

In [16]:
data2

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


若要删除含有NA的column，则需设置axis = 1

In [19]:
data3 = data.dropna(axis = 1, how = 'all')

In [20]:
data3

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


dropna的thresh参数可以设定门限值,取移除每非None数据个数小于thresh的row

In [42]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.293006,,
1,-0.359534,,
2,-0.404661,,0.658974
3,0.408151,,-1.247985
4,-0.593691,1.018129,2.301734
5,-0.042249,-1.286548,0.442065
6,-1.290267,0.692825,1.698135


In [22]:
df.dropna()

Unnamed: 0,0,1,2
4,0.39729,0.64585,-0.665025
5,-0.712725,-1.562976,2.103419
6,-0.202565,0.174872,0.090411


In [37]:
df.dropna(thresh=4 ,axis = 1) 

Unnamed: 0,0,2
0,0.378179,
1,0.012029,
2,0.586739,1.214814
3,0.327175,-0.809736
4,0.39729,-0.665025
5,-0.712725,2.103419
6,-0.202565,0.090411


# 1.2 填补缺失值——fillna()
1. 常数调用
2. 字典调用
3. 返回一个新对象，可用inplace进行修改
4. 用指定数/平均数/中位数进行填补

In [38]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.378179,0.0,0.0
1,0.012029,0.0,0.0
2,0.586739,0.0,1.214814
3,0.327175,0.0,-0.809736
4,0.39729,0.64585,-0.665025
5,-0.712725,-1.562976,2.103419
6,-0.202565,0.174872,0.090411


In [39]:
df.fillna(999)

Unnamed: 0,0,1,2
0,0.378179,999.0,999.0
1,0.012029,999.0,999.0
2,0.586739,999.0,1.214814
3,0.327175,999.0,-0.809736
4,0.39729,0.64585,-0.665025
5,-0.712725,-1.562976,2.103419
6,-0.202565,0.174872,0.090411


也可以给fillna传入一个dict，可以给不同列替换不同的值

In [40]:
df.fillna({1: 0.5, 2: 0})
# 第1列填入0.5，第2列填入0

Unnamed: 0,0,1,2
0,0.378179,0.5,0.0
1,0.012029,0.5,0.0
2,0.586739,0.5,1.214814
3,0.327175,0.5,-0.809736
4,0.39729,0.64585,-0.665025
5,-0.712725,-1.562976,2.103419
6,-0.202565,0.174872,0.090411


fillna返回一个新对象，但是可以使用in-place来直接更改原有的数据

In [41]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.378179,0.0,0.0
1,0.012029,0.0,0.0
2,0.586739,0.0,1.214814
3,0.327175,0.0,-0.809736
4,0.39729,0.64585,-0.665025
5,-0.712725,-1.562976,2.103419
6,-0.202565,0.174872,0.090411


前一个数据代替NaN：method='ffill'；
bfill：后一个数据  
limit参数表示限制每列可以替代NaN的数目

In [46]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.833254,0.043804,-1.140954
1,1.146979,1.21273,-0.158403
2,-0.461293,,0.958895
3,0.031554,,-0.207975
4,-2.672036,,
5,-0.273125,,


In [51]:
# 用前一个数填补缺失值
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.833254,0.043804,-1.140954
1,1.146979,1.21273,-0.158403
2,-0.461293,1.21273,0.958895
3,0.031554,1.21273,-0.207975
4,-2.672036,1.21273,-0.207975
5,-0.273125,1.21273,-0.207975


In [50]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.833254,0.043804,-1.140954
1,1.146979,1.21273,-0.158403
2,-0.461293,1.21273,0.958895
3,0.031554,1.21273,-0.207975
4,-2.672036,,-0.207975
5,-0.273125,,-0.207975


用平均值填补

In [52]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

下面是fillna的一些参数：

In [54]:
from IPython.display import Image
Image(url= "ch7/1.png",width=500, height=300)