数据缺失会在很多数据应用分析中出现，pandas的目标之一就是尽可能无痛地处理缺失值。

例如：pandas对象的所有描述性统计信息默认情况下是排除缺失值的。

pandas对象中表现缺失值的方式并不完美，但是它对大部分用户来说是有用的
* 对于数值型数据，pandas使用浮点值NaN(NOT a Number 来表示缺失值)，称NaN为容易检测到的标识值


In [1]:
import pandas as pd 
import numpy as np

In [2]:
string_data = pd.Series(['beijing','shanghai',np.nan,'chengdu'])

In [3]:
string_data

0     beijing
1    shanghai
2         NaN
3     chengdu
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

**在pandas中，采用了R语言中的编程惯例，将缺失值成为NA，意思是not available（不可用）。在统计学应用中，NA数据可以是 不存在的数据或者是 存在但不可观察的数据（例如在数据收集过程中出现了问题）。当清洗数据用于分析时，对缺失数据本身进行分析以确定数据收集问题或数据丢失导致的数据偏差通常很重要**
* python 内建的None值在 对象数组中也被当做NA处理

In [5]:
string_data[0] = None

In [6]:
string_data

0        None
1    shanghai
2         NaN
3     chengdu
dtype: object

In [7]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### 表7-1：NA处理方法
* dropna         根据每个标签的值是否是缺失数据来筛选轴标签，并根据允许丢失的数据量确定阈值
* fillna         用某些填充缺失的数据或使用差值方法（如：‘ffill’或‘bfill’）
* isnull         返回表名哪些值是缺失的布尔值
* notnull        isnull的反函数

## 7.1.1 过滤缺失值


In [8]:
from numpy import nan as NA

In [9]:
data = pd.Series([1,NA,3.5,NA,7])

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

当处理DataFrame对象时，事情会稍微复杂一点，因为我们可能会删除全部为NA或者包含有NA的列或者行。

**dropna默认情况下会删除包含缺失值的行**

In [14]:
data = pd.DataFrame([[1,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,5]])

In [15]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,5.0


In [16]:
cleaned = data.dropna()

In [18]:
cleaned  # 默认情况下dropna是删除所有包含有NA的行

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [20]:
data.dropna(how='all')  # 当传入how='all'时，将删除所有均值为NA的行

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,5.0


如果要用同样的方式去删除列，传入参数axis = 1

In [21]:
data[4]=NA

In [22]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,5.0,


In [23]:
data.dropna(how='all',axis=1)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,5.0


In [25]:
df = pd.DataFrame(np.random.randn(7,3))

In [26]:
df

Unnamed: 0,0,1,2
0,1.048948,-0.525648,-0.955399
1,0.927433,-0.373156,-1.680487
2,1.688831,1.21937,-0.331118
3,0.383837,0.840717,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


In [27]:
df.iloc[:4,1]=NA

In [28]:
df.iloc[:2,2]=NA

In [29]:
df

Unnamed: 0,0,1,2
0,1.048948,,
1,0.927433,,
2,1.688831,,-0.331118
3,0.383837,,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


In [30]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


In [37]:
df.dropna(thresh=2)  # 只保留至少有2个非NA值的行。

Unnamed: 0,0,1,2
2,1.688831,,-0.331118
3,0.383837,,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


## 7.1.2 补全缺失值
大多数情况下使用fillna方法来补全缺失值，调用fillna时，可以使用一个常数来替代缺失值

In [39]:
df.fillna(0)  # 使用常数0代替空缺值

Unnamed: 0,0,1,2
0,1.048948,0.0,0.0
1,0.927433,0.0,0.0
2,1.688831,0.0,-0.331118
3,0.383837,0.0,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


**在调用fillna时使用字典，你可以为不同列设定不同的填充值**

In [42]:
df.fillna({1:0.5,2:0})  # 键为列索引，和我们熟知的字典有点差别

Unnamed: 0,0,1,2
0,1.048948,0.5,0.0
1,0.927433,0.5,0.0
2,1.688831,0.5,-0.331118
3,0.383837,0.5,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


fillna返回的是一个新对象，当然也可以修改已经存在的对象

In [43]:
df.fillna(0,inplace=True)

In [45]:
df  # 和上面df.fillna() 返回不同，inplace=True,直接操作的是df本身

Unnamed: 0,0,1,2
0,1.048948,0.0,0.0
1,0.927433,0.0,0.0
2,1.688831,0.0,-0.331118
3,0.383837,0.0,-1.270647
4,-0.621212,1.218101,-0.155079
5,-0.460994,0.659049,0.414698
6,-0.197211,-0.871988,-1.763401


用于重建索引的相同的插值方法也可以用于fillna

In [46]:
df = pd.DataFrame(np.random.randn(6,3))

In [47]:
df.iloc[2:,1] =NA

In [48]:
df.iloc[4:,2]=NA

In [50]:
df

Unnamed: 0,0,1,2
0,1.439808,-0.124123,-0.489843
1,-1.3779,-0.934427,0.060328
2,-0.936085,,-1.64568
3,-1.205642,,0.324014
4,-0.603823,,
5,1.162737,,


In [51]:
df.fillna(method='ffill')  # 前向填充

Unnamed: 0,0,1,2
0,1.439808,-0.124123,-0.489843
1,-1.3779,-0.934427,0.060328
2,-0.936085,-0.934427,-1.64568
3,-1.205642,-0.934427,0.324014
4,-0.603823,-0.934427,0.324014
5,1.162737,-0.934427,0.324014


In [57]:
df.fillna(method='ffill',limit=2)  # limit 可以理解为允许填充几行的空缺值

Unnamed: 0,0,1,2
0,1.439808,-0.124123,-0.489843
1,-1.3779,-0.934427,0.060328
2,-0.936085,-0.934427,-1.64568
3,-1.205642,-0.934427,0.324014
4,-0.603823,,0.324014
5,1.162737,,0.324014


In [58]:
data = pd.Series([1.,NA,3.5,NA,7])

In [60]:
data.fillna(data.mean())   # 可以在一维数组中填充均值

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [63]:
df.fillna({1:df.iloc[:2,1].mean(),2:df.iloc[:4,2].mean()})   # DataFrame也能做点骚操作

Unnamed: 0,0,1,2
0,1.439808,-0.124123,-0.489843
1,-1.3779,-0.934427,0.060328
2,-0.936085,-0.529275,-1.64568
3,-1.205642,-0.529275,0.324014
4,-0.603823,-0.529275,-0.437795
5,1.162737,-0.529275,-0.437795


### 7-2 fillna函数参考
* value   标量值或自典型对象用户填充缺失值
* method  插值方法，如果没有其他参数，默认是‘ffill’
* axis    需要填充的轴，默认是axis=0
* inplace 修改被调用的对象，而不是生成一个备份。默认是生成备份，改为True时，为修改被调用对象
* limit   用于前向或后向填充时最大的填充范围