Handling Missing Data

在pandas中，missing data呈现的方式有些缺点的，但对大部分用户能起到足够的效果。对于数值型数据，pandas用浮点值Nan(Not a Number)来表示缺失值。我们称之为识别符（sentinel value)，这种值能被轻易检测到：

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data=pd.Series(['aardvark','artichoke',np.nan,'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

在pandas中，我们使用了R语言中的一些传统，把缺失值表示为NA（not available）。在统计应用里，NA数据别是要么是数据不存在，要么是存在但不能被检测到。做数据清理的时候，对缺失值做分析是很重要的，我们要确定是否是数据收集的问题，或者缺失值是否会带来潜在的偏见。

内建的Python None值也被当做NA：


In [4]:
string_data[0]=None

In [5]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

1 Filtering Out Missing Data（过滤缺失值）

有一些方法来过滤缺失值。可以使用pandas.isnull和boolean indexing, 配合使用dropna。对于series，只会返回non-null数据和index values:

In [6]:
from numpy import nan as NA

In [7]:
data=pd.Series([1,NA,3.5,NA,7])

In [8]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [9]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame，会复杂一些。你可能想要删除包含有NA的row和column。dropna默认会删除包含有缺失值的row

In [10]:
data=pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])

In [11]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [12]:
cleaned=data.dropna()

In [13]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


设定how=all只会删除那些全是NA的行：

In [14]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [15]:
#删除列也一样，设置axis=1:

In [16]:
data[4]=NA

In [17]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [18]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [19]:
df=pd.DataFrame(np.random.randn(7,3))

In [20]:
df

Unnamed: 0,0,1,2
0,1.485574,0.895165,-0.772143
1,1.239477,0.258548,0.812189
2,-0.564295,-0.901558,-0.651904
3,1.429559,1.494176,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


In [21]:
df.iloc[:4,1]=NA

In [23]:
df

Unnamed: 0,0,1,2
0,1.485574,,-0.772143
1,1.239477,,0.812189
2,-0.564295,,-0.651904
3,1.429559,,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


In [24]:
df.iloc[:2,2]=NA

In [25]:
df

Unnamed: 0,0,1,2
0,1.485574,,
1,1.239477,,
2,-0.564295,,-0.651904
3,1.429559,,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


In [26]:
df.dropna()

Unnamed: 0,0,1,2
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


In [28]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,-0.564295,,-0.651904
3,1.429559,,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


2 Filling In Missing Data（填补缺失值）

不是删除缺失值，而是用一些数字填补。对于大部分目的，fillna是可以用的。调用fillna的时候设置好一个常用用来替换缺失值：


In [29]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.485574,0.0,0.0
1,1.239477,0.0,0.0
2,-0.564295,0.0,-0.651904
3,1.429559,0.0,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


给fillna传入一个dict，可以给不同列替换不同的值：

In [30]:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,1.485574,0.5,0.0
1,1.239477,0.5,0.0
2,-0.564295,0.5,-0.651904
3,1.429559,0.5,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


fillna返回一个新对象，但你可以使用in-place来直接更改原有的数据：

In [31]:
_=df.fillna(0,inplace=True)

In [32]:
df

Unnamed: 0,0,1,2
0,1.485574,0.0,0.0
1,1.239477,0.0,0.0
2,-0.564295,0.0,-0.651904
3,1.429559,0.0,-0.309991
4,0.773164,-0.1237,-0.3171
5,0.26892,-0.416628,0.981193
6,-0.492067,0.381191,-0.997218


在使用fillna的时候，这种插入法同样能用于reindexing：