# Pandas处理缺失数据
- 处理缺失数据
    - isnull() 返回布尔Series
- 过滤缺失数据
    - SeriesObj.dropna()，可以有参数thresh=3
    - SeriesObj[SeriesObj.notnull()] 与上面的表达是等价的
    - 对于DataFrame，dropna(axis=1, how='all')
- 填充缺失数据
    - fillna(0)：填充零
    - fillna({1: 0.5, 3: -1, 2: 0})：这里key是column name
    - fillna(0, inplace=True)
    - fillna(method='ffill', limit=2)
    - fillna(SeriesObj.mean())：填充均值

In [1]:
# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## 处理缺失数据
For numeric data, pandas uses the floating point value `NaN` (Not a Number) to represent missing data. We call this a `sentinel` value that can be easily detected:

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python `None` value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Table 7-1. NA handling methods

Argument | Description
---------|------------
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnull | Return like-type object containing boolean values indicating which values are missing / NA.
notnull | Negation of isnull.

## 过滤缺失数据
Series:While doing it by hand using pandas.`isnull` and boolean indexing is always an option, the `dropna` can be helpful.

In [5]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

DataFrame:`dropna` by default drops any row containing a missing value:

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` will only drop rows that are all NA:

In [10]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass `axis=1`:

In [11]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [12]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the `thresh` argument:

In [13]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA; df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.250143,,
1,-1.336657,,
2,0.964016,,-0.231448
3,-0.013446,,-0.261505
4,0.768341,-1.103,0.956387
5,0.323163,-0.073481,-0.508503
6,1.033444,-1.451471,-0.676776


In [14]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
4,0.768341,-1.103,0.956387
5,0.323163,-0.073481,-0.508503
6,1.033444,-1.451471,-0.676776


## 填充缺失数据
Calling `fillna` with a constant replaces missing values with that value:

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.250143,0.0,0.0
1,-1.336657,0.0,0.0
2,0.964016,0.0,-0.231448
3,-0.013446,0.0,-0.261505
4,0.768341,-1.103,0.956387
5,0.323163,-0.073481,-0.508503
6,1.033444,-1.451471,-0.676776


Calling `fillna` with a dict you can use a different fill value for each column:

In [16]:
df.fillna({1: 0.5, 3: -1, 2: 0})

Unnamed: 0,0,1,2
0,-0.250143,0.5,0.0
1,-1.336657,0.5,0.0
2,0.964016,0.5,-0.231448
3,-0.013446,0.5,-0.261505
4,0.768341,-1.103,0.956387
5,0.323163,-0.073481,-0.508503
6,1.033444,-1.451471,-0.676776


`fillna` returns a new object, but you can modify the existing object in place:

In [17]:
# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.250143,0.0,0.0
1,-1.336657,0.0,0.0
2,0.964016,0.0,-0.231448
3,-0.013446,0.0,-0.261505
4,0.768341,-1.103,0.956387
5,0.323163,-0.073481,-0.508503
6,1.033444,-1.451471,-0.676776


The same interpolation methods available for reindexing can be used with `fillna`:

In [18]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA; df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.848459,0.583188,-0.272467
1,-0.657926,0.451103,0.568698
2,0.27288,,-1.428285
3,-0.310357,,-1.619695
4,-0.213444,,
5,0.512265,,


In [19]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.848459,0.583188,-0.272467
1,-0.657926,0.451103,0.568698
2,0.27288,0.451103,-1.428285
3,-0.310357,0.451103,-1.619695
4,-0.213444,0.451103,-1.619695
5,0.512265,0.451103,-1.619695


In [20]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.848459,0.583188,-0.272467
1,-0.657926,0.451103,0.568698
2,0.27288,0.451103,-1.428285
3,-0.310357,0.451103,-1.619695
4,-0.213444,,-1.619695
5,0.512265,,-1.619695


You might pass the mean or median value of a Series:

In [21]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table 7-2. fillna function arguments

Argument | Description
---------|------------
value | Scalar value or dict-like object to use to fill missing values
method | Interpolation, by default 'ffill' if function called with no other arguments
axis | Axis to fill on, default axis=0
inplace | Modify the calling object without producing a copy
limit | For forward and backward filling, maximum number of consecutive periods to fill