## 处理缺失数据

### 1. 滤除缺失数据
`dropna` 
- 对于Series，返回一个仅含非空数据和索引值的Series
- 对于DataFrame，默认返回一个丢弃了任何含有缺失值的行的DataFrame  
  
参数：  
`how='all'` 只丢弃全为NA的行  
`axis=1` 丢弃列  
`thresh=n` n为整数，保留至少有n个非NaN数据的行/列  

In [1]:
import numpy as np
import pandas as pd
from numpy import nan as NA

In [2]:
data = pd.Series([1, NA, 4, NA, 9])  # 对于Series的dropna操作
data.dropna()  # 等价于 data[data.notnull()]

0    1.0
2    4.0
4    9.0
dtype: float64

In [3]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [4]:
data.dropna()  # 只保留没有NA的行

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [5]:
data.dropna(how='all')  # 只丢弃全为NA的行

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [6]:
data[3] = NA
data

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [7]:
data.dropna(axis=1, how='all')  # 只丢弃全为NA的列

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
# 保留至少有2个非NaN数据的行，其中1行非NaN数据为1个，2行为0个，因此这两行不不保留，而2行和3行非NaN数据为3个和2个，均给与保留
data.dropna(thresh=2)

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
3,,6.5,3.0,


### 2. 填充缺失数据
`fillna`
- 通过常数调用该函数，就会将缺失值替换为那个常数
- 通过一个字典调用该函数，可以实现对不同列填充不同的值，若字典的列不存在则丢弃
- 默认返回一个新对象
  
参数  
`axis`：填充方向，默认0/'index'（对某列中的行进行填充，行方向），可选1/'columns'  
`method`：填充方式，默认'ffill'向前填充，可选'bfill'向后填充，使用方法类似于`reindex`的`method`  
`limit`：可以连续填充的最大数量，**注意是连续填充，并非单行/列可填充的最大数量**  
`inplace`：是否直接修改对象而不创建副本，默认'False'

In [9]:
data = pd.DataFrame(np.random.randint(50, size=35).reshape(
    7, 5), index=list('abcdefg'), columns=list('ABCDE'))
data.loc['c':'e', 'B'], data.loc['b':'d', 'C':'D'] = NA, NA
data.loc['b':'c', 'A'], data.loc['e':'g', 'A'] = NA, NA
data

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33
b,,28.0,,,15
c,,,,,10
d,5.0,,,,9
e,,,16.0,40.0,34
f,,44.0,11.0,7.0,35
g,,20.0,45.0,14.0,36


In [10]:
data.fillna(0)

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33
b,0.0,28.0,0.0,0.0,15
c,0.0,0.0,0.0,0.0,10
d,5.0,0.0,0.0,0.0,9
e,0.0,0.0,16.0,40.0,34
f,0.0,44.0,11.0,7.0,35
g,0.0,20.0,45.0,14.0,36


In [11]:
data.fillna({'B': 99, 'D': 55, 'X': 88})  # 使用字典填充，丢弃不存在的列

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33
b,,28.0,,55.0,15
c,,99.0,,55.0,10
d,5.0,99.0,,55.0,9
e,,99.0,16.0,40.0,34
f,,44.0,11.0,7.0,35
g,,20.0,45.0,14.0,36


In [12]:
# 向前填充，最大连续填充数量为2
# 注意A列用a行和d行的数值分别向前填充了2项，而并非整个A列最多填充2项
data.fillna(method='ffill', limit=2)

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33
b,35.0,28.0,38.0,7.0,15
c,35.0,28.0,38.0,7.0,10
d,5.0,28.0,,,9
e,5.0,,16.0,40.0,34
f,5.0,44.0,11.0,7.0,35
g,,20.0,45.0,14.0,36


In [13]:
# 向前填充，最大连续填充数量为2
# 注意A列用a行和d行的数值分别向前填充了2项，而并非整个A列最多填充2项
data.fillna(method='ffill', limit=2)

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33
b,35.0,28.0,38.0,7.0,15
c,35.0,28.0,38.0,7.0,10
d,5.0,28.0,,,9
e,5.0,,16.0,40.0,34
f,5.0,44.0,11.0,7.0,35
g,,20.0,45.0,14.0,36


In [14]:
# 对列方向进行向后填充，最大连续填充数量为2
data.fillna(axis=1, method='bfill', limit=2)

Unnamed: 0,A,B,C,D,E
a,35.0,30.0,38.0,7.0,33.0
b,28.0,28.0,15.0,15.0,15.0
c,,,10.0,10.0,10.0
d,5.0,,9.0,9.0,9.0
e,16.0,16.0,16.0,40.0,34.0
f,44.0,44.0,11.0,7.0,35.0
g,20.0,20.0,45.0,14.0,36.0
