# pandas 文档，关于缺失值数据 working with missing data

https://pandas.pydata.org/pandas-docs/dev/user_guide/missing_data.html

`NA` 意思是 not available

## values considered missing

可以用一条命令把 inf 和 -inf 当成 NA 来处理。

`pandas.options.mode.use_inf_as_na = True`

In [4]:
import pandas as pd
import numpy as np

In [41]:
df = pd.DataFrame(
    np.random.randn(5, 3),
    index=["a", "c", "e", "f", "h"],
    columns=["one", "two", "three"],
)

In [42]:
df

Unnamed: 0,one,two,three
a,-0.486674,-0.907531,0.682221
c,-1.121301,1.992182,1.128709
e,-0.097348,0.269988,-0.536821
f,0.840448,-0.425848,0.021044
h,1.208747,0.977869,0.474883


In [43]:
df["four"] = "bar"

df["five"] = df["one"] > 0

In [44]:
df

Unnamed: 0,one,two,three,four,five
a,-0.486674,-0.907531,0.682221,bar,False
c,-1.121301,1.992182,1.128709,bar,False
e,-0.097348,0.269988,-0.536821,bar,False
f,0.840448,-0.425848,0.021044,bar,True
h,1.208747,0.977869,0.474883,bar,True


In [9]:
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

df2

Unnamed: 0,one,two,three,four,five
a,-1.204231,-0.97181,-2.405019,bar,False
b,,,,,
c,-0.442511,-0.117058,-0.118704,bar,False
d,,,,,
e,1.15651,0.027709,1.986025,bar,True
f,0.980257,0.686309,-1.614396,bar,True
g,,,,,
h,-0.162094,0.01017,-0.282719,bar,False


In [10]:
pd.isna(df2['one'])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [11]:
df2['four'].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

In [12]:
df2.isna()

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


关于 nan 和 None 量之间的比较问题

In [13]:
None == None

True

In [14]:
np.nan == np.nan

False

In [17]:
df2['one']

a   -1.204231
b         NaN
c   -0.442511
d         NaN
e    1.156510
f    0.980257
g         NaN
h   -0.162094
Name: one, dtype: float64

In [18]:
df2['one'] == np.nan

a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

### integer dtypes and missing data

NaN 是 float 类型，会自动进行类型转换。

In [19]:
pd.Series([1, 2, np.nan, 4])

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [20]:
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())

0       1
1       2
2    <NA>
3       4
dtype: Int64

In [21]:
pd.Series([1, 2, np.nan, 4], dtype='Int64')

0       1
1       2
2    <NA>
3       4
dtype: Int64

In [23]:
#pd.Series([1, 2, np.nan, 4], dtype='int64')
# 报错
#cannot convert float NaN to integer

---
关于 `NA` 整型缺失值
https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#gotchas-intna

In [27]:
s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))

In [28]:
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [29]:
s.dtype

dtype('int64')

In [30]:
s2 = s.reindex(["a", "b", "c", "f", "u"])
s2

a    1.0
b    2.0
c    3.0
f    NaN
u    NaN
dtype: float64

In [31]:
s2.dtype

dtype('float64')

pandas 提供的几种可空的整型数据类型。

- Int8Dtype
- Int15Dtype
- Int32Dtype
- Int64Dtype

In [34]:
s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())
s_int
# 注意这个类型里 Int 第一个字母大写了

a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [35]:
s_int.dtype

Int64Dtype()

In [36]:
s2_int = s_int.reindex(["a", "b", "c", "f", "u"])
s2_int

a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [38]:
s2_int.dtype

Int64Dtype()

当引入 NA 数据类型后，原本的float类型和object类型不变，integer和boolean类型会分别转换为浮点型和object类型。

---
这个例子来自 cookbook

https://pandas.pydata.org/pandas-docs/dev/user_guide/cookbook.html#cookbook-missing-data

In [24]:
df = pd.DataFrame(
    np.random.randn(6, 1),
    index=pd.date_range("2013-08-01", periods=6, freq="B"),
    columns=list("A"),
)

df

Unnamed: 0,A
2013-08-01,0.150991
2013-08-02,-0.042041
2013-08-05,0.549513
2013-08-06,1.46763
2013-08-07,0.677292
2013-08-08,-0.73647


In [25]:
df.loc[df.index[3], 'A'] = np.nan
df

Unnamed: 0,A
2013-08-01,0.150991
2013-08-02,-0.042041
2013-08-05,0.549513
2013-08-06,
2013-08-07,0.677292
2013-08-08,-0.73647


In [26]:
df.bfill()

Unnamed: 0,A
2013-08-01,0.150991
2013-08-02,-0.042041
2013-08-05,0.549513
2013-08-06,0.677292
2013-08-07,0.677292
2013-08-08,-0.73647


---

### datetimes

In [45]:
df2 = df.copy()
df2["timestamp"] = pd.Timestamp("20120101")
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,-0.486674,-0.907531,0.682221,bar,False,2012-01-01
c,-1.121301,1.992182,1.128709,bar,False,2012-01-01
e,-0.097348,0.269988,-0.536821,bar,False,2012-01-01
f,0.840448,-0.425848,0.021044,bar,True,2012-01-01
h,1.208747,0.977869,0.474883,bar,True,2012-01-01


In [46]:
df2.loc[["a", "c", "h"], ["one", "timestamp"]] = np.nan

In [47]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.907531,0.682221,bar,False,NaT
c,,1.992182,1.128709,bar,False,NaT
e,-0.097348,0.269988,-0.536821,bar,False,2012-01-01
f,0.840448,-0.425848,0.021044,bar,True,2012-01-01
h,,0.977869,0.474883,bar,True,NaT


In [48]:
df2.dtypes.value_counts()

float64           3
object            1
bool              1
datetime64[ns]    1
dtype: int64

pandas 用 `Nat` 来表示时间序列中的缺失值，它可以在numpy中替换为 `datetime64[ns]` 类型

## inserting missing data

插入缺失值

In [49]:
s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

In [50]:
s.loc[0] = None
s

0    NaN
1    2.0
2    3.0
dtype: float64

In [51]:
s = pd.Series(["a", "b", "c"])
s

0    a
1    b
2    c
dtype: object

In [52]:
s.loc[0] = None
s.loc[1] = np.nan
s

0    None
1     NaN
2       c
dtype: object

## 与缺失值有关的计算问题



求和时，缺失值被当成 0.累计求和/求乘积，默认忽略缺失值。

In [60]:
df2['one']

a         NaN
c         NaN
e   -0.097348
f    0.840448
h         NaN
Name: one, dtype: float64

In [59]:
df2['one'].sum()

0.7430998752918363

In [61]:
df2.cumsum()

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.907531,0.682221,bar,0,NaT
c,,1.084652,1.81093,barbar,0,NaT
e,-0.097348,1.35464,1.274109,barbarbar,0,2012-01-01
f,0.7431,0.928792,1.295154,barbarbarbar,1,2053-12-31
h,,1.906661,1.770036,barbarbarbarbar,2,NaT


In [62]:
df2.cumsum(skipna=False)

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.907531,0.682221,bar,0,NaT
c,,1.084652,1.81093,barbar,0,1970-01-01 00:00:00.000000000
e,,1.35464,1.274109,barbarbar,0,2012-01-01 00:00:00.000000000
f,,0.928792,1.295154,barbarbarbar,1,2053-12-31 00:00:00.000000000
h,,1.906661,1.770036,barbarbarbarbar,2,1761-09-21 00:12:43.145224192


空表、全为缺失值的表，求和，求乘积。

In [63]:
pd.Series([np.nan]).sum()

0.0

In [64]:
pd.Series([], dtype="float64").sum()

0.0

In [65]:
pd.Series([np.nan]).prod()

1.0

In [66]:
pd.Series([], dtype="float64").prod()

1.0

In [67]:
pd.Series([], dtype="int64").prod()

1

## Groupby 中的 NA

自动排除

可以设置参数 `dropna=False` 把 NA 当成一个类。

In [71]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.907531,0.682221,bar,False,NaT
c,,1.992182,1.128709,bar,False,NaT
e,-0.097348,0.269988,-0.536821,bar,False,2012-01-01
f,0.840448,-0.425848,0.021044,bar,True,2012-01-01
h,,0.977869,0.474883,bar,True,NaT


In [72]:
df2.groupby('one').mean()

Unnamed: 0_level_0,two,three,five
one,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-0.097348,0.269988,-0.536821,0.0
0.840448,-0.425848,0.021044,1.0


In [73]:
df2.groupby('one', dropna=False).mean()

Unnamed: 0_level_0,two,three,five
one,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-0.097348,0.269988,-0.536821,0.0
0.840448,-0.425848,0.021044,1.0
,0.687507,0.761938,0.333333


## fillna 填充缺失值

In [74]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.907531,0.682221,bar,False,NaT
c,,1.992182,1.128709,bar,False,NaT
e,-0.097348,0.269988,-0.536821,bar,False,2012-01-01
f,0.840448,-0.425848,0.021044,bar,True,2012-01-01
h,,0.977869,0.474883,bar,True,NaT


In [76]:
df2.fillna(0)

Unnamed: 0,one,two,three,four,five,timestamp
a,0.0,-0.907531,0.682221,bar,False,0
c,0.0,1.992182,1.128709,bar,False,0
e,-0.097348,0.269988,-0.536821,bar,False,2012-01-01 00:00:00
f,0.840448,-0.425848,0.021044,bar,True,2012-01-01 00:00:00
h,0.0,0.977869,0.474883,bar,True,0


## 以 pands 对象进行填充

In [77]:
dff = pd.DataFrame(np.random.randn(10, 3), columns=list("ABC"))
dff.iloc[3:5, 0] = np.nan

dff.iloc[4:6, 1] = np.nan

dff.iloc[5:8, 2] = np.nan

In [78]:
dff

Unnamed: 0,A,B,C
0,0.488549,-1.462094,-0.4104
1,0.186887,0.609615,-0.322586
2,0.399086,0.42597,1.774583
3,,-0.438308,-0.658296
4,,,1.673235
5,0.016675,,
6,-1.086368,-0.179689,
7,-0.815642,2.119684,
8,-0.202702,0.911234,-0.947015
9,-0.07412,1.150594,-1.242232


In [79]:
dff.fillna(dff.mean())

Unnamed: 0,A,B,C
0,0.488549,-1.462094,-0.4104
1,0.186887,0.609615,-0.322586
2,0.399086,0.42597,1.774583
3,-0.135954,-0.438308,-0.658296
4,-0.135954,0.392126,1.673235
5,0.016675,0.392126,-0.018959
6,-1.086368,-0.179689,-0.018959
7,-0.815642,2.119684,-0.018959
8,-0.202702,0.911234,-0.947015
9,-0.07412,1.150594,-1.242232


In [80]:
dff.fillna(dff.mean()['B': 'C'])

Unnamed: 0,A,B,C
0,0.488549,-1.462094,-0.4104
1,0.186887,0.609615,-0.322586
2,0.399086,0.42597,1.774583
3,,-0.438308,-0.658296
4,,0.392126,1.673235
5,0.016675,0.392126,-0.018959
6,-1.086368,-0.179689,-0.018959
7,-0.815642,2.119684,-0.018959
8,-0.202702,0.911234,-0.947015
9,-0.07412,1.150594,-1.242232


In [81]:
dff.where(pd.notna(dff), dff.mean(), axis="columns")

Unnamed: 0,A,B,C
0,0.488549,-1.462094,-0.4104
1,0.186887,0.609615,-0.322586
2,0.399086,0.42597,1.774583
3,-0.135954,-0.438308,-0.658296
4,-0.135954,0.392126,1.673235
5,0.016675,0.392126,-0.018959
6,-1.086368,-0.179689,-0.018959
7,-0.815642,2.119684,-0.018959
8,-0.202702,0.911234,-0.947015
9,-0.07412,1.150594,-1.242232


In [83]:
dff.where(pd.notna(dff))
# where 是不满足时填充

Unnamed: 0,A,B,C
0,0.488549,-1.462094,-0.4104
1,0.186887,0.609615,-0.322586
2,0.399086,0.42597,1.774583
3,,-0.438308,-0.658296
4,,,1.673235
5,0.016675,,
6,-1.086368,-0.179689,
7,-0.815642,2.119684,
8,-0.202702,0.911234,-0.947015
9,-0.07412,1.150594,-1.242232


## dropna

In [87]:
df3 = df2[['one', 'two', 'three']]

In [89]:
df3

Unnamed: 0,one,two,three
a,,-0.907531,0.682221
c,,1.992182,1.128709
e,-0.097348,0.269988,-0.536821
f,0.840448,-0.425848,0.021044
h,,0.977869,0.474883


In [90]:
df3.dropna(axis=0)

Unnamed: 0,one,two,three
e,-0.097348,0.269988,-0.536821
f,0.840448,-0.425848,0.021044


In [91]:
df3.dropna(axis=1)

Unnamed: 0,two,three
a,-0.907531,0.682221
c,1.992182,1.128709
e,0.269988,-0.536821
f,-0.425848,0.021044
h,0.977869,0.474883


In [93]:
df3['one'].dropna()

e   -0.097348
f    0.840448
Name: one, dtype: float64

## interpolation

## 替换任意值

In [94]:
ser = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])
ser

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [95]:
ser.replace(0, 5)

0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [96]:
ser.replace({0: 10, 1: 100})

0     10.0
1    100.0
2      2.0
3      3.0
4      4.0
dtype: float64

### 字符串替换 正则表达式替换

In [97]:
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
df = pd.DataFrame(d)

In [98]:
df

Unnamed: 0,a,b,c
0,0,a,a
1,1,b,b
2,2,.,
3,3,.,d


In [99]:
df.replace(["a", "."], ["b", np.nan])

Unnamed: 0,a,b,c
0,0,b,b
1,1,b,b
2,2,,
3,3,,d


### 数值替换

In [100]:
df = pd.DataFrame(np.random.randn(10, 2))
df[np.random.rand(df.shape[0]) > 0.5] = 1.5
df

Unnamed: 0,0,1
0,1.5,1.5
1,1.398698,1.824802
2,0.442663,1.81593
3,1.5,1.5
4,1.442578,1.997778
5,1.5,1.5
6,1.5,1.5
7,1.5,1.5
8,-2.272937,-0.451796
9,1.5,1.5


In [101]:
df00 = df.iloc[0, 0]
df00

1.5

In [102]:
df.replace([1.5, df00], [np.nan, "a"])

Unnamed: 0,0,1
0,a,a
1,1.398698,1.824802
2,0.442663,1.81593
3,a,a
4,1.442578,1.997778
5,a,a
6,a,a
7,a,a
8,-2.272937,-0.451796
9,a,a


In [103]:
df.replace(df00, np.nan)

Unnamed: 0,0,1
0,,
1,1.398698,1.824802
2,0.442663,1.81593
3,,
4,1.442578,1.997778
5,,
6,,
7,,
8,-2.272937,-0.451796
9,,


## missing data casting rules and indexing

就是当某一列出现缺失值之后，这一列对应的原本的数据类型会转换。

比如 np.nan 其实是浮点类型。

In [104]:
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])

In [105]:
(s>0).dtype

dtype('bool')

In [107]:
crit = (s>0).reindex(list(range(8)))
crit

0     True
1      NaN
2    False
3      NaN
4    False
5      NaN
6    False
7    False
dtype: object

In [108]:
crit.dtype

dtype('O')

In [109]:
reindexed = s.reindex(list(range(8))).fillna(0)
reindexed

0    0.490246
1    0.000000
2   -0.705880
3    0.000000
4   -0.964510
5    0.000000
6   -0.677250
7   -0.312600
dtype: float64

In [116]:
reindexed[crit]

ValueError: Cannot mask with non-boolean array containing NA / NaN values

In [117]:
reindexed[crit.fillna(False)]

0    0.490246
dtype: float64

In [118]:
reindexed[crit.fillna(True)]

0    0.490246
1    0.000000
3    0.000000
5    0.000000
dtype: float64

In [121]:
crit

0     True
1      NaN
2    False
3      NaN
4    False
5      NaN
6    False
7    False
dtype: object

## experimental NA scalar to denote missing values



In [123]:
pd.NA
# 整型空值、布尔型空值

<NA>

In [124]:
s = pd.Series([1, 2, None], dtype="Int64")
s

0       1
1       2
2    <NA>
dtype: Int64

In [125]:
s[2] is pd.NA

True

目前pandas没有默认使用这一空值变量，所以需要显式地指定。

### 代数运算和比较操作符中的广播法则

In [126]:
pd.NA + 1

<NA>

In [127]:
'a' * pd.NA

<NA>

In [128]:
pd.NA ** 0

1

In [129]:
1 ** pd.NA

1

In [130]:
pd.NA == 1

<NA>

In [131]:
pd.NA == pd.NA

<NA>

In [132]:
pd.NA < 2.5

<NA>

用 isna() 判断一个值是不是 pd.NA

In [133]:
pd.isna(pd.NA)

True

### 逻辑操作符

仅当需要的时候才广播缺失值。

In [134]:
True | False

True

In [135]:
True | pd.NA

True

In [136]:
pd.NA | True

True

In [137]:
False | True

True

In [138]:
False | False

False

In [139]:
False | pd.NA

<NA>

In [140]:
False & True

False

In [141]:
False & False

False

In [142]:
False & pd.NA

False

In [143]:
True & True

True

In [144]:
True & False

False

In [145]:
True & pd.NA

<NA>

### boolean 情境下的 NA

In [146]:
bool(pd.NA)

TypeError: boolean value of NA is ambiguous

（上）NA的真实值并不知道，所以无法转换为布尔值。

所以 NA 无法用于条件判断语境中。

pd.isna() 可以用于判断，NA 有关的条件。

### 新类型 NA 和 老类型 np.nan 的转换

In [147]:
convert_dtypes()

NameError: name 'convert_dtypes' is not defined

# Nullable integer data type

https://pandas.pydata.org/pandas-docs/dev/user_guide/integer_na.html#integer-na



np.nan，或者 pandas 中的 NaN，是浮点类型。

In [148]:
type(np.nan)

float

pd.NA 是整型的空变量

## construction


In [149]:
arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
arr

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

In [152]:
pd.array([1, 2, None])
# 默认就是整型空值

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

In [153]:
pd.array([1, 2, np.nan, None, pd.NA])

<IntegerArray>
[1, 2, <NA>, <NA>, <NA>]
Length: 5, dtype: Int64

In [154]:
pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")

<IntegerArray>
[1, 2, <NA>, <NA>, <NA>]
Length: 5, dtype: Int64

In [155]:
pd.Series([1, None])
# series 会转换为浮点型的 nan

0    1.0
1    NaN
dtype: float64

In [156]:
pd.Series([1, 2])

0    1
1    2
dtype: int64

建议显式地指明控制类型

In [157]:
pd.array([1, None], dtype="Int64")

<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64

In [158]:
pd.Series([1, None], dtype="Int64")

0       1
1    <NA>
dtype: Int64

### operations

In [159]:
s = pd.Series([1, 2, None], dtype="Int64")

In [160]:
s + 1

0       2
1       3
2    <NA>
dtype: Int64

In [161]:
s == 1

0     True
1    False
2     <NA>
dtype: boolean

In [162]:
s + 0.01

0    1.01
1    2.01
2    <NA>
dtype: Float64