## 表示缺失值

pandas用两种数据类型表示缺失值：None和np.nan，None是python自带的数据结构，np.nan是numpy提供的数据结构，本质上是浮点值(float64).

In [1]:
import pandas as pd
import numpy as np

## 侦测缺失值

isnull(), notnull()

In [2]:
# 如果序列包含的元素是数值，自动将None转化为np.nan
ser = pd.Series([1, None, 3, np.nan, 5])
ser

0    1.0
1    NaN
2    3.0
3    NaN
4    5.0
dtype: float64

isnull()返回布尔数组，如果元素是None或np.nan，返回True，否则返回False。

In [3]:
ser.isnull()

0    False
1     True
2    False
3     True
4    False
dtype: bool

用布尔数组筛选非缺失的数据。

In [4]:
ser[ser.notnull()]

0    1.0
2    3.0
4    5.0
dtype: float64

## 剔除缺失值

dropna()

In [5]:
arr = np.array([
    [1, 2, np.nan],
    [np.nan, np.nan, np.nan],
    [np.nan, 4, 5],
    [6, 7, 8]
])
df = pd.DataFrame(arr, columns=["A", "B", "C"])

df

Unnamed: 0,A,B,C
0,1.0,2.0,
1,,,
2,,4.0,5.0
3,6.0,7.0,8.0


dropna()默认剔除包含任意缺失值的行

In [6]:
df.dropna()

Unnamed: 0,A,B,C
3,6.0,7.0,8.0


指定how="all"，剔除全是缺失值的行

In [7]:
df.dropna(how="all")

Unnamed: 0,A,B,C
0,1.0,2.0,
2,,4.0,5.0
3,6.0,7.0,8.0


## 填充缺失值

fillna()

In [8]:
df

Unnamed: 0,A,B,C
0,1.0,2.0,
1,,,
2,,4.0,5.0
3,6.0,7.0,8.0


用单一值填充，例如0

In [9]:
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,2.0,0.0
1,0.0,0.0,0.0
2,0.0,4.0,5.0
3,6.0,7.0,8.0


向前填充或向后填充

In [10]:
df.fillna(method="ffill")

Unnamed: 0,A,B,C
0,1.0,2.0,
1,1.0,2.0,
2,1.0,4.0,5.0
3,6.0,7.0,8.0


In [11]:
df.fillna(method="bfill")

Unnamed: 0,A,B,C
0,1.0,2.0,5.0
1,6.0,4.0,5.0
2,6.0,4.0,5.0
3,6.0,7.0,8.0


用变量的推断值(例如均值)填充

In [12]:
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,1.0,2.0,6.5
1,3.5,4.333333,6.5
2,3.5,4.0,5.0
3,6.0,7.0,8.0
