<center><h1>第七章 缺失数据</h1></center>

In [1]:
import numpy as np
import pandas as pd

## 一、缺失值的统计和删除
### 1. 缺失信息的统计

缺失数据可以使用`isna`或`isnull`（两个函数没有区别）来查看每个单元格是否缺失，结合`mean`可以计算出每列缺失值的比例：

In [2]:
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])
df.isna().head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,True,False,False
4,False,False,False,False,False,False


In [3]:
df.isna().mean() # 查看缺失的比例

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

如果想要查看某一列缺失或者非缺失的行，可以利用`Series`上的`isna`或者`notna`进行布尔索引。例如，查看身高缺失的行：

In [4]:
df[df.Height.isna()].head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
12,Senior,Peng You,Female,,48.0,
26,Junior,Yanli You,Female,,48.0,N
36,Freshman,Xiaojuan Qin,Male,,79.0,Y
60,Freshman,Yanpeng Lv,Male,,65.0,N


如果想要同时对几个列，检索出全部为缺失或者至少有一个缺失或者没有缺失的行，可以使用`isna, notna`和`any, all`的组合。例如，对身高、体重和转系情况这3列分别进行这三种情况的检索：

In [5]:
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)] # 全部缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
102,Junior,Chengli Zhao,Male,,,


In [6]:
df[sub_set.isna().any(1)].head() # 至少有一个缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
9,Junior,Juan Xu,Female,164.8,,N
12,Senior,Peng You,Female,,48.0,
21,Senior,Xiaopeng Shen,Male,166.0,62.0,
26,Junior,Yanli You,Female,,48.0,N


In [7]:
df[sub_set.notna().all(1)].head() # 没有缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N
5,Freshman,Xiaoli Qian,Female,158.0,51.0,N


### 2. 缺失信息的删除

数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除，`pandas`中提供了`dropna`函数来进行操作。

`dropna`的主要参数为轴方向`axis`（默认为0，即删除行）、删除方式`how`、删除的非缺失值个数阈值`thresh`（$\color{red}{非缺失值}$没有达到这个数量的相应维度会被删除）、备选的删除子集`subset`，其中`how`主要有`any`和`all`两种参数可以选择。

例如，删除身高体重至少有一个缺失的行：

In [8]:
df.shape

(200, 6)

In [9]:
res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
res.shape

(174, 6)

例如，删除超过15个缺失值的列：

In [12]:
df.head(5)

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N


In [11]:
res = df.dropna(axis=1, thresh=df.shape[0]-15) # 身高被删除
res.head()

Unnamed: 0,Grade,Name,Gender,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,46.0,N
1,Freshman,Changqiang You,Male,70.0,N
2,Senior,Mei Sun,Male,89.0,N
3,Sophomore,Xiaojuan Sun,Female,41.0,N
4,Sophomore,Gaojuan You,Male,74.0,N


当然，不用`dropna`同样是可行的，例如上述的两个操作，也可以使用布尔索引来完成：

In [13]:
res = df.loc[df[['Height', 'Weight']].notna().all(1)]
res.shape
# 删掉至少有一个缺失值的行，即保留全部不是缺失值的行

(174, 6)

In [16]:
df[['Height', 'Weight']].notna().all(1)

0       True
1       True
2       True
3      False
4       True
       ...  
195     True
196     True
197     True
198     True
199     True
Length: 200, dtype: bool

In [17]:
df.isna().sum()

Grade        0
Name         0
Gender       0
Height      17
Weight      11
Transfer    12
dtype: int64

In [19]:
df.isna().sum(axis=0)

Grade        0
Name         0
Gender       0
Height      17
Weight      11
Transfer    12
dtype: int64

In [20]:
df.isna().sum(axis=1)

0      0
1      0
2      0
3      1
4      0
      ..
195    0
196    0
197    0
198    0
199    0
Length: 200, dtype: int64

In [21]:
res = df.loc[:, ~(df.isna().sum()>15)]
res.head()

Unnamed: 0,Grade,Name,Gender,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,46.0,N
1,Freshman,Changqiang You,Male,70.0,N
2,Senior,Mei Sun,Male,89.0,N
3,Sophomore,Xiaojuan Sun,Female,41.0,N
4,Sophomore,Gaojuan You,Male,74.0,N


## 二、缺失值的填充和插值
### 1. 利用fillna进行填充

在[`fillna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)中有三个参数是常用的：`value, method, limit`。其中，`value`为填充值，可以是标量，也可以是索引到元素的字典映射；`method`为填充方法，有用前面的元素填充`ffill`和用后面的元素填充`bfill`两种类型，`limit`参数表示连续缺失值的最大填充次数。

下面构造一个简单的`Series`来说明用法：

In [22]:
s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

In [23]:
s.fillna(method='ffill') # 用前面的值向后填充

a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

In [24]:
s.fillna(method='ffill', limit=1) # 连续出现的缺失，最多填充一次

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

In [25]:
s.fillna(s.mean()) # value为标量

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

In [27]:
s.fillna(value=s.mean())

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

In [28]:
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

In [29]:
s.fillna({'a': 100, 'd': 200}) # 通过索引映射填充的值

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

有时为了更加合理地填充，需要先进行分组后再操作。例如，根据年级进行身高的均值填充：

In [31]:
df.head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N


In [32]:
df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

In [40]:
df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).value_counts()

163.002128    5
163.075862    5
162.744444    5
166.800000    4
158.900000    3
             ..
152.100000    1
168.900000    1
166.700000    1
161.300000    1
175.300000    1
Name: Height, Length: 145, dtype: int64

In [41]:
df.groupby('Grade')['Height'].transform(lambda x: x.mean()).value_counts()

162.744444    59
163.969811    55
163.002128    52
163.075862    34
Name: Height, dtype: int64

#### 【练一练】
对一个序列以如下规则填充缺失值：如果单独出现的缺失值，就用前后均值填充，如果连续出现的缺失值就不填充，即序列`[1, NaN, 3, NaN, NaN]`填充后为`[1, 2, 3, NaN, NaN]`，请利用`fillna`函数实现。（提示：利用`limit`参数）

In [42]:
test = pd.Series([1, np.nan, 3, np.nan, np.nan])

In [44]:
f_1 = test.fillna(method='ffill', limit=1)
f_1

0    1.0
1    1.0
2    3.0
3    3.0
4    NaN
dtype: float64

In [45]:
f_2 = test.fillna(method='bfill', limit=1)
f_2

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

In [46]:
f = (f_1 + f_2) / 2
f

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

感觉，用一个 `ffill`，用一个 `bfill`，然后加起来平均一下就可以了。

#### 【END】
### 2. 插值函数

在关于`interpolate`函数的[文档](https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html#pandas.Series.interpolate)描述中，列举了许多插值法，包括了大量`Scipy`中的方法。由于很多插值方法涉及到比较复杂的数学知识，因此这里只讨论比较常用且简单的三类情况，即线性插值、最近邻插值和索引插值。

对于`interpolate`而言，除了插值方法（默认为`linear`线性插值）之外，有与`fillna`类似的两个常用参数，一个是控制方向的`limit_direction`，另一个是控制最大连续缺失值插值个数的`limit`。其中，限制插值的方向默认为`forward`，这与`fillna`的`method`中的`ffill`是类似的，若想要后向限制插值或者双向限制插值可以指定为`backward`或`both`。

In [47]:
s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

例如，在默认线性插值法下分别进行`backward`和双向限制插值，同时限制最大连续条数为1：

In [48]:
res = s.interpolate(limit_direction='backward', limit=1)
res.values

array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])

In [49]:
res = s.interpolate(limit_direction='both', limit=1)
res.values

array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

第二种常见的插值是最近邻插补，即缺失值的元素和离它最近的非缺失值元素一样：

In [50]:
s.interpolate('nearest')

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    2.0
6    2.0
7    NaN
8    NaN
dtype: float64

In [51]:
s.interpolate("nearest").values

array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

最后来介绍索引插值，即根据索引大小进行线性插值。例如，构造不等间距的索引进行演示：

In [52]:
s = pd.Series([0,np.nan,10],index=[0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

In [53]:
s.interpolate() # 默认的线性插值，等价于计算中点的值

0      0.0
1      5.0
10    10.0
dtype: float64

In [54]:
s.interpolate(method='index') # 和索引有关的线性插值，计算相应索引大小对应的值

0      0.0
1      1.0
10    10.0
dtype: float64

同时，这种方法对于时间戳索引也是可以使用的，有关时间序列的其他话题会在第十章进行讨论，这里举一个简单的例子：

In [55]:
s = pd.Series([0,np.nan,10], index=pd.to_datetime(['20200101', '20200102', '20200111']))
s

2020-01-01     0.0
2020-01-02     NaN
2020-01-11    10.0
dtype: float64

In [56]:
s.interpolate()

2020-01-01     0.0
2020-01-02     5.0
2020-01-11    10.0
dtype: float64

In [57]:
s.interpolate(method='index')

2020-01-01     0.0
2020-01-02     1.0
2020-01-11    10.0
dtype: float64

#### 【NOTE】关于polynomial和spline插值的注意事项
在`interpolate`中如果选用`polynomial`的插值方法，它内部调用的是`scipy.interpolate.interp1d(*,*,kind=order)`，这个函数内部调用的是`make_interp_spline`方法，因此其实是样条插值而不是类似于`numpy`中的`polyfit`多项式拟合插值；而当选用`spline`方法时，`pandas`调用的是`scipy.interpolate.UnivariateSpline`而不是普通的样条插值。这一部分的文档描述比较混乱，而且这种参数的设计也是不合理的，当使用这两类插值方法时，用户一定要小心谨慎地根据自己的实际需求选取恰当的插值方法。
#### 【END】
## 三、Nullable类型
### 1. 缺失记号及其缺陷

在`python`中的缺失值用`None`表示，该元素除了等于自己本身之外，与其他任何元素不相等：

In [58]:
None == None

True

In [59]:
None == False

False

In [60]:
None == []

False

In [61]:
None == ''

False

在`numpy`中利用`np.nan`来表示缺失值，该元素除了不和其他任何元素相等之外，和自身的比较结果也返回`False`：

In [62]:
np.nan == np.nan

False

In [63]:
np.nan == None

False

In [64]:
np.nan == False

False

值得注意的是，虽然在对缺失序列或表格的元素进行比较操作的时候，`np.nan`的对应位置会返回`False`，但是在使用`equals`函数进行两张表或两个序列的相同性检验时，会自动跳过两侧表都是缺失值的位置，直接返回`True`：

In [65]:
s1 = pd.Series([1, np.nan])
s2 = pd.Series([1, 2])
s3 = pd.Series([1, np.nan])
s1 == 1

0     True
1    False
dtype: bool

In [66]:
s1.equals(s2)

False

In [67]:
s1.equals(s3)

True

在时间序列的对象中，`pandas`利用`pd.NaT`来指代缺失值，它的作用和`np.nan`是一致的（时间序列的对象和构造将在第十章讨论）：

In [68]:
pd.to_timedelta(['30s', np.nan]) # Timedelta中的NaT

TimedeltaIndex(['0 days 00:00:30', NaT], dtype='timedelta64[ns]', freq=None)

In [69]:
pd.to_datetime(['20200101', np.nan]) # Datetime中的NaT

DatetimeIndex(['2020-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

那么为什么要引入`pd.NaT`来表示时间对象中的缺失呢？仍然以`np.nan`的形式存放会有什么问题？在`pandas`中可以看到`object`类型的对象，而`object`是一种混杂对象类型，如果出现了多个类型的元素同时存储在`Series`中，它的类型就会变成`object`。例如，同时存放整数和字符串的列表：

In [70]:
pd.Series([1, 'two'])

0      1
1    two
dtype: object

`NaT`问题的根源来自于`np.nan`的本身是一种浮点类型，而如果浮点和时间类型混合存储，如果不设计新的内置缺失类型来处理，就会变成含糊不清的`object`类型，这显然是不希望看到的。

In [71]:
type(np.nan)

float

In [72]:
type(np.NaT)

AttributeError: module 'numpy' has no attribute 'NaT'

同时，由于`np.nan`的浮点性质，如果在一个整数的`Series`中出现缺失，那么其类型会转变为`float64`；而如果在一个布尔类型的序列中出现缺失，那么其类型就会转为`object`而不是`bool`：

In [73]:
pd.Series([1, np.nan]).dtype

dtype('float64')

In [75]:
pd.Series([True, False, np.nan]).dtype

dtype('O')

In [76]:
t = pd.Series([True, False, np.nan])
t

0     True
1    False
2      NaN
dtype: object

因此，在进入`1.0.0`版本后，`pandas`尝试设计了一种新的缺失类型`pd.NA`以及三种`Nullable`序列类型来应对这些缺陷，它们分别是`Int, boolean`和`string`。

### 2. Nullable类型的性质

从字面意义上看`Nullable`就是可空的，言下之意就是序列类型不受缺失值的影响。例如，在上述三个`Nullable`类型中存储缺失值，都会转为`pandas`内置的`pd.NA`：

In [77]:
pd.Series([np.nan, 1], dtype = 'Int64') # "i"是大写的

0    <NA>
1       1
dtype: Int64

In [78]:
pd.Series([np.nan, 1])

0    NaN
1    1.0
dtype: float64

In [79]:
pd.Series([np.nan, True], dtype = 'boolean')

0    <NA>
1    True
dtype: boolean

In [80]:
pd.Series([np.nan, 'my_str'], dtype = 'string')

0      <NA>
1    my_str
dtype: string

在`Int`的序列中，返回的结果会尽可能地成为`Nullable`的类型：

In [81]:
pd.Series([np.nan, 0], dtype = 'Int64') + 1

0    <NA>
1       1
dtype: Int64

In [82]:
pd.Series([np.nan, 0]) + 1

0    NaN
1    1.0
dtype: float64

In [83]:
pd.Series([np.nan, 0], dtype = 'Int64') == 0

0    <NA>
1    True
dtype: boolean

In [84]:
pd.Series([np.nan, 0], dtype = 'Int64') * 0.5 # 只能是浮点

0    <NA>
1     0.0
dtype: Float64

对于`boolean`类型的序列而言，其和`bool`序列的行为主要有两点区别：

第一点是带有缺失的布尔列表无法进行索引器中的选择，而`boolean`会把缺失值看作`False`：

In [85]:
s = pd.Series(['a', 'b'])
s_bool = pd.Series([True, np.nan])
s_boolean = pd.Series([True, np.nan]).astype('boolean')
# s[s_bool] # 报错
s[s_boolean]

0    a
dtype: object

第二点是在进行逻辑运算时，`bool`类型在缺失处返回的永远是`False`，而`boolean`会根据逻辑运算是否能确定唯一结果来返回相应的值。那什么叫能否确定唯一结果呢？举个简单例子：`True | pd.NA`中无论缺失值为什么值，必然返回`True`；`False | pd.NA`中的结果会根据缺失值取值的不同而变化，此时返回`pd.NA`；`False & pd.NA`中无论缺失值为什么值，必然返回`False`。

In [86]:
s_boolean & True

0    True
1    <NA>
dtype: boolean

In [87]:
s_bool & True

0     True
1    False
dtype: bool

In [88]:
s_boolean | True

0    True
1    True
dtype: boolean

In [89]:
s_bool | True

0     True
1    False
dtype: bool

In [90]:
~s_boolean # 取反操作同样是无法唯一地判断缺失结果

0    False
1     <NA>
dtype: boolean

关于`string`类型的具体性质将在下一章文本数据中进行讨论。

一般在实际数据处理时，可以在数据集读入后，先通过`convert_dtypes`转为`Nullable`类型：

In [91]:
df = pd.read_csv('../data/learn_pandas.csv')
df.dtypes

School          object
Grade           object
Name            object
Gender          object
Height         float64
Weight         float64
Transfer        object
Test_Number      int64
Test_Date       object
Time_Record     object
dtype: object

In [92]:
df = df.convert_dtypes()
df.dtypes

School          string
Grade           string
Name            string
Gender          string
Height         Float64
Weight           Int64
Transfer        string
Test_Number      Int64
Test_Date       string
Time_Record     string
dtype: object

### 3. 缺失数据的计算和分组

当调用函数`sum, prod`使用加法和乘法的时候，缺失数据等价于被分别视作0和1，即不改变原来的计算结果：

In [93]:
s = pd.Series([2,3,np.nan,4,5])
s.sum()

14.0

In [94]:
s.prod()

120.0

当使用累计函数时，会自动跳过缺失值所处的位置：

In [95]:
s.cumsum()

0     2.0
1     5.0
2     NaN
3     9.0
4    14.0
dtype: float64

当进行单个标量运算的时候，除了`np.nan ** 0`和`1 ** np.nan`这两种情况为确定的值之外，所有运算结果全为缺失（`pd.NA`的行为与此一致 ），并且`np.nan`在比较操作时一定返回`False`，而`pd.NA`返回`pd.NA`：

In [97]:
np.nan ** 0
# 任何数的 0 次方是 1.

1.0

In [98]:
1 ** np.nan
# 1 的任何次方是 1

1.0

In [99]:
0 ** 0

1

In [100]:
np.nan == 0

False

In [101]:
pd.NA == 0

<NA>

In [102]:
np.nan > 0

False

In [103]:
pd.NA > 0

<NA>

In [104]:
np.nan + 1

nan

In [105]:
np.log(np.nan)

nan

In [106]:
np.add(np.nan, 1)

nan

In [107]:
np.nan ** 0

1.0

In [108]:
pd.NA ** 0

1

In [109]:
1 ** np.nan

1.0

In [110]:
1 ** pd.NA

1

另外需要注意的是，[`diff`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html), [`pct_change`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html)这两个函数虽然功能相似，但是对于缺失的处理不同，前者凡是参与缺失计算的部分全部设为了缺失值，而后者缺失值位置会被设为 0% 的变化率：

In [111]:
s

0    2.0
1    3.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [112]:
s.diff()

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
dtype: float64

In [113]:
s.pct_change()

0         NaN
1    0.500000
2    0.000000
3    0.333333
4    0.250000
dtype: float64

对于一些函数而言，缺失可以作为一个类别处理，例如在`groupby, get_dummies`中可以设置相应的参数来进行增加缺失类别：

In [114]:
df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan], 'value':[1,3,5,7,9]})
df_nan

Unnamed: 0,category,value
0,a,1
1,a,3
2,b,5
3,,7
4,,9


In [115]:
df_nan.groupby('category', dropna=False)['value'].mean() # pandas版本大于1.1.0

category
a      2.0
b      5.0
NaN    8.0
Name: value, dtype: float64

In [116]:
pd.get_dummies(df_nan.category, dummy_na=True)
# 转换为指示变量
# https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Unnamed: 0,a,b,NaN
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,0,1


## 四、练习
### Ex1：缺失值与类别的相关性检验
在数据处理中，含有过多缺失值的列往往会被删除，除非缺失情况与标签强相关。下面有一份关于二分类问题的数据集，其中`X_1, X_2`为特征变量，`y`为二分类标签。

In [117]:
df = pd.read_csv('../data/missing_chi.csv')
df.head()

Unnamed: 0,X_1,X_2,y
0,,,0
1,,,0
2,,,0
3,43.0,,0
4,,,0


In [118]:
df.shape

(1000, 3)

In [120]:
df.isna().mean()

X_1    0.855
X_2    0.894
y      0.000
dtype: float64

In [121]:
df.y.value_counts(normalize=True)

0    0.918
1    0.082
Name: y, dtype: float64

In [122]:
df.y.value_counts()

0    918
1     82
Name: y, dtype: int64

In [126]:
df.isna()

Unnamed: 0,X_1,X_2,y
0,True,True,False
1,True,True,False
2,True,True,False
3,False,True,False
4,True,True,False
...,...,...,...
995,False,True,False
996,True,True,False
997,True,True,False
998,True,True,False


In [127]:
df.isna().value_counts(dropna=False)

X_1    X_2    y    
True   True   False    767
False  True   False    127
True   False  False     88
False  False  False     18
dtype: int64

事实上，有时缺失值出现或者不出现本身就是一种特征，并且在一些场合下可能与标签的正负是相关的。关于缺失出现与否和标签的正负性，在统计学中可以利用卡方检验来断言它们是否存在相关性。按照特征缺失的正例、特征缺失的负例、特征不缺失的正例、特征不缺失的负例，可以分为四种情况，设它们分别对应的样例数为$n_{11}, n_{10}, n_{01}, n_{00}$。假若它们是不相关的，那么特征缺失中正例的理论值，就应该接近于特征缺失总数$\times$总体正例的比例，即：

$$E_{11} = n_{11} \approx (n_{11}+n_{10})\times\frac{n_{11}+n_{01}}{n_{11}+n_{10}+n_{01}+n_{00}} = F_{11}$$

其他的三种情况同理。现将实际值和理论值分别记作$E_{ij}, F_{ij}$，那么希望下面的统计量越小越好，即代表实际值接近不相关情况的理论值：

$$S = \sum_{i\in \{0,1\}}\sum_{j\in \{0,1\}} \frac{(E_{ij}-F_{ij})^2}{F_{ij}}$$

可以证明上面的统计量近似服从自由度为$1$的卡方分布，即$S\overset{\cdot}{\sim} \chi^2(1)$。因此，可通过计算$P(\chi^2(1)>S)$的概率来进行相关性的判别，一般认为当此概率小于$0.05$时缺失情况与标签正负存在相关关系，即不相关条件下的理论值与实际值相差较大。

上面所说的概率即为统计学上关于$2\times2$列联表检验问题的$p$值， 它可以通过`scipy.stats.chi2.sf(S, 1)`得到。请根据上面的材料，分别对`X_1, X_2`列进行检验。



In [134]:
x1 = df[['X_1', 'y']]
x2 = df[['X_2', 'y']]

In [137]:
x1.value_counts(dropna=False)

X_1   y
NaN   0    785
      1     70
50.0  0      6
44.0  0      6
30.0  0      6
28.0  0      6
17.0  0      5
25.0  0      5
1.0   0      5
23.0  0      4
19.0  0      4
29.0  0      4
34.0  0      4
37.0  0      4
10.0  0      4
32.0  0      3
12.0  0      3
4.0   0      3
9.0   0      3
24.0  0      3
5.0   0      3
47.0  0      3
43.0  0      3
49.0  0      3
15.0  0      3
14.0  0      3
35.0  0      3
13.0  0      3
33.0  0      2
27.0  0      2
31.0  0      2
36.0  0      2
8.0   0      2
2.0   0      2
46.0  0      2
21.0  0      2
7.0   0      2
3.0   1      2
      0      2
41.0  0      2
42.0  0      1
39.0  0      1
48.0  0      1
      1      1
50.0  1      1
28.0  1      1
38.0  0      1
5.0   1      1
26.0  0      1
6.0   0      1
22.0  1      1
20.0  1      1
18.0  0      1
17.0  1      1
16.0  0      1
13.0  1      1
11.0  0      1
9.0   1      1
24.0  1      1
dtype: int64

x1

- n11, 70
- n10, 785
- n01, 12
- n00, 133

In [140]:
x2.y.value_counts()

0    918
1     82
Name: y, dtype: int64

In [138]:
x2.value_counts(dropna=False)

X_2    y
NaN    0    894
431.0  1      2
264.0  1      2
289.0  1      2
153.0  1      2
           ... 
164.0  1      1
162.0  1      1
155.0  1      1
143.0  1      1
248.0  0      1
Length: 98, dtype: int64

In [141]:
x2.isna().value_counts()

X_2    y    
True   False    894
False  False    106
dtype: int64

x2

- n11, 0
- n10, 894
- n01, 82
- n00, 24

In [147]:
np.zeros((2, 2))

array([[0., 0.],
       [0., 0.]])

In [170]:
def chi_s(n11, n10, n01, n00):
    # 计算卡方检验量s
    m = np.array([[n00, n01], [n10, n11]])
    ssum = m.sum()
    F = [(m[i, 1] + m[i, 0]) * (m[0, j] + m[1, j]) / ssum for i in [0, 1] for j in [0, 1]]
    F = np.array(F).reshape(2, 2)
    s = ((m - F)**2 / F).sum()
    return s

In [175]:
# x1
n11 = 70
n10 = 785
n01 = 12
n00 = 133

s1 = chi_s(n11, n10, n01, n00)
s1

0.0012965662713972017

In [177]:
# x2
n11 = 0
n10 = 894
n01 = 82
n00 = 24

s2 = chi_s(n11, n10, n01, n00)
s2

753.3604636823281

In [186]:
from scipy.stats import chi2

In [187]:
chi2.sf(s1, 1)

0.9712760884395901

In [189]:
chi2.sf(s2, 1)

7.459641265637543e-166

### Ex2：用回归模型解决分类问题

`KNN`是一种监督式学习模型，既可以解决回归问题，又可以解决分类问题。对于分类变量，利用`KNN`分类模型可以实现其缺失值的插补，思路是度量缺失样本的特征与所有其他样本特征的距离，当给定了模型参数`n_neighbors=n`时，计算离该样本距离最近的$n$个样本点中最多的那个类别，并把这个类别作为该样本的缺失预测类别，具体如下图所示，未知的类别被预测为黄色：

<img src="../source/_static/ch7_ex.png" width="50%">

上面有色点的特征数据提供如下：

In [2]:
import pandas as pd
import numpy as np

In [16]:
df = pd.read_excel('../data/color.xlsx')
df.head(3)

Unnamed: 0,X1,X2,Color
0,-2.5,2.8,Blue
1,-1.5,1.8,Blue
2,-0.8,2.8,Blue


In [4]:
df.shape

(23, 3)

已知待预测的样本点为$X_1=0.8, X_2=-0.2$，那么预测类别可以如下写出：

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [6]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])



array(['Yellow'], dtype=object)

1. 对于回归问题而言，需要得到的是一个具体的数值，因此预测值由最近的 $n$ 个样本对应的平均值获得。请把上面的这个分类问题转化为回归问题，仅使用`KNeighborsRegressor`来完成上述的`KNeighborsClassifier`功能。

In [8]:
df.Color.value_counts()

Yellow    8
Green     8
Blue      7
Name: Color, dtype: int64

In [126]:
# 回归问题，需要对标签进行处理变换
# https://blog.csdn.net/qq_32863549/article/details/106972347

color_trans = {'Yellow': 1, 'Green': 2, 'Blue': 3}
df['Color'] = df['Color'].map(color_trans)
df.head(3)

Unnamed: 0,X1,X2,Color
0,-2.5,2.8,3
1,-1.5,1.8,3
2,-0.8,2.8,3


In [127]:
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=6, weights='distance')
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])



array([1.77016909])

In [128]:
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=6)
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])



array([1.66666667])

这样的预测结果却是绿色，然后从图上看，感觉是绿色更合理一些。
其实也难说。

2. 请根据第1问中的方法，对`audit`数据集中的`Employment`变量进行缺失值插补。

In [91]:
df = pd.read_csv('../data/audit.csv')
df.head(3)

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,Private,Unmarried,81838.0,Female,72
1,1010229,35,Private,Absent,72099.0,Male,30
2,1024587,32,Private,Divorced,154676.74,Male,40


In [92]:
df.isna().value_counts()

ID     Age    Employment  Marital  Income  Gender  Hours
False  False  False       False    False   False   False    1900
              True        False    False   False   False     100
dtype: int64

In [93]:
df['Employment'] = pd.Categorical(df['Employment']).codes
df['Marital'] = pd.Categorical(df['Marital']).codes
df['Gender'] = pd.Categorical(df['Gender']).codes
df

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,4,4,81838.00,0,72
1,1010229,35,4,0,72099.00,1,30
2,1024587,32,4,1,154676.74,1,40
3,1038288,45,4,2,27743.82,1,55
4,1044221,60,4,2,7568.23,1,40
...,...,...,...,...,...,...,...
1995,9957280,62,4,2,24080.59,1,40
1996,9964393,35,0,2,57497.30,1,40
1997,9972967,32,4,2,30538.18,1,44
1998,9991103,34,4,4,113425.67,1,45


In [94]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df = scaler.fit_transform(df)

In [97]:
df = pd.DataFrame(df)
df

Unnamed: 0,0,1,2,3,4,5,6
0,0.000000,0.287671,0.625,0.8,0.168997,0.0,0.724490
1,0.000621,0.246575,0.625,0.0,0.148735,1.0,0.295918
2,0.002218,0.205479,0.625,0.2,0.320539,1.0,0.397959
3,0.003742,0.383562,0.625,0.4,0.056453,1.0,0.551020
4,0.004402,0.589041,0.625,0.4,0.014477,1.0,0.397959
...,...,...,...,...,...,...,...
1995,0.995682,0.616438,0.625,0.4,0.048832,1.0,0.397959
1996,0.996474,0.246575,0.125,0.4,0.118356,1.0,0.397959
1997,0.997427,0.205479,0.625,0.4,0.062267,1.0,0.438776
1998,0.999444,0.232877,0.625,0.8,0.234715,1.0,0.448980


In [99]:
df.iloc[:, 2].value_counts()

0.625    1411
0.125     148
0.375     119
0.000     100
0.750      79
0.500      72
0.250      69
0.875       1
1.000       1
Name: 2, dtype: int64

In [101]:
train = df[df[2]!=0.000]

In [102]:
train.shape

(1900, 7)

In [103]:
train

Unnamed: 0,0,1,2,3,4,5,6
0,0.000000,0.287671,0.625,0.8,0.168997,0.0,0.724490
1,0.000621,0.246575,0.625,0.0,0.148735,1.0,0.295918
2,0.002218,0.205479,0.625,0.2,0.320539,1.0,0.397959
3,0.003742,0.383562,0.625,0.4,0.056453,1.0,0.551020
4,0.004402,0.589041,0.625,0.4,0.014477,1.0,0.397959
...,...,...,...,...,...,...,...
1995,0.995682,0.616438,0.625,0.4,0.048832,1.0,0.397959
1996,0.996474,0.246575,0.125,0.4,0.118356,1.0,0.397959
1997,0.997427,0.205479,0.625,0.4,0.062267,1.0,0.438776
1998,0.999444,0.232877,0.625,0.8,0.234715,1.0,0.448980


In [104]:
train_label = train.iloc[:, [0, 1, 3, 4, 5, 6]]
train_target = train.iloc[:, 2]

In [105]:
pred = df[df[2]==0.000]

In [106]:
pred_label = pred.iloc[:, [0, 1, 3, 4, 5, 6]]
pred_target = pred.iloc[:, 2]

In [110]:
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=6)
clf.fit(train_label, train_target)
clf.predict(pred_label)

array([0.4375    , 0.4375    , 0.625     , 0.58333333, 0.58333333,
       0.54166667, 0.625     , 0.54166667, 0.54166667, 0.58333333,
       0.58333333, 0.54166667, 0.58333333, 0.625     , 0.33333333,
       0.47916667, 0.5625    , 0.47916667, 0.47916667, 0.625     ,
       0.58333333, 0.625     , 0.54166667, 0.54166667, 0.625     ,
       0.625     , 0.625     , 0.5625    , 0.5       , 0.60416667,
       0.5625    , 0.625     , 0.625     , 0.5625    , 0.58333333,
       0.58333333, 0.52083333, 0.5625    , 0.625     , 0.64583333,
       0.64583333, 0.54166667, 0.5       , 0.5625    , 0.5625    ,
       0.625     , 0.5625    , 0.5625    , 0.625     , 0.5       ,
       0.58333333, 0.39583333, 0.4375    , 0.52083333, 0.5625    ,
       0.625     , 0.54166667, 0.5625    , 0.5625    , 0.625     ,
       0.66666667, 0.5       , 0.625     , 0.47916667, 0.5625    ,
       0.58333333, 0.4375    , 0.60416667, 0.4375    , 0.52083333,
       0.5       , 0.54166667, 0.64583333, 0.625     , 0.625  

In [111]:
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(train_label, train_target)
clf.predict(pred_label)

ValueError: Unknown label type: 'continuous'

参考答案的写法

In [112]:
df = pd.read_excel('../data/color.xlsx')

In [113]:
df_dummies = pd.get_dummies(df.Color)

In [138]:
df_dummies.head(5)

Unnamed: 0,Blue,Green,Yellow
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [120]:
stack_list = []

for col in df_dummies.columns:
    clf = KNeighborsRegressor(n_neighbors=6)
    clf.fit(df.iloc[:, :2], df_dummies[col])
    res = clf.predict([[0.8, -0.2]]).reshape(-1, 1)
    stack_list.append(res)



In [121]:
stack_list

[array([[0.16666667]]), array([[0.33333333]]), array([[0.5]])]

In [139]:
code_res = pd.Series(np.hstack(stack_list).argmax(1))
code_res
# 这一步  为啥要找最大的那个 不是应该求平均值吗？然后求平均值好像也不太对。
# 问了久玉师兄，数值大小其实反映的是接近1的程度，或者接近0的程度。更接近1，就说明更接近那个位置对应的颜色。所以，这里要选最大的。

0    2
dtype: int64

In [130]:
df_dummies.columns[code_res[0]]

'Yellow'

 第二题

In [2]:
df = pd.read_csv('../data/audit.csv')
df.head(3)

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,Private,Unmarried,81838.0,Female,72
1,1010229,35,Private,Absent,72099.0,Male,30
2,1024587,32,Private,Divorced,154676.74,Male,40


In [3]:
res_df = df.copy()

In [4]:
df = pd.concat([pd.get_dummies(df[['Marital', 'Gender']]),
                df[['Age', 'Income', 'Hours']].apply(lambda x: (x-x.min()) / (x.max()-x.min())), df.Employment], 1)

  df = pd.concat([pd.get_dummies(df[['Marital', 'Gender']]),


In [5]:
df
# 对适当的数据进行适当的处理 或者转换标签，或者归一化。连续变量和离散变量

Unnamed: 0,Marital_Absent,Marital_Divorced,Marital_Married,Marital_Married-spouse-absent,Marital_Unmarried,Marital_Widowed,Gender_Female,Gender_Male,Age,Income,Hours,Employment
0,0,0,0,0,1,0,1,0,0.287671,0.168997,0.724490,Private
1,1,0,0,0,0,0,0,1,0.246575,0.148735,0.295918,Private
2,0,1,0,0,0,0,0,1,0.205479,0.320539,0.397959,Private
3,0,0,1,0,0,0,0,1,0.383562,0.056453,0.551020,Private
4,0,0,1,0,0,0,0,1,0.589041,0.014477,0.397959,Private
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0,0,1,0,0,0,0,1,0.616438,0.048832,0.397959,Private
1996,0,0,1,0,0,0,0,1,0.246575,0.118356,0.397959,Consultant
1997,0,0,1,0,0,0,0,1,0.205479,0.062267,0.438776,Private
1998,0,0,0,0,1,0,0,1,0.232877,0.234715,0.448980,Private


In [6]:
X_train = df.query('Employment.notna()')

In [7]:
X_train

Unnamed: 0,Marital_Absent,Marital_Divorced,Marital_Married,Marital_Married-spouse-absent,Marital_Unmarried,Marital_Widowed,Gender_Female,Gender_Male,Age,Income,Hours,Employment
0,0,0,0,0,1,0,1,0,0.287671,0.168997,0.724490,Private
1,1,0,0,0,0,0,0,1,0.246575,0.148735,0.295918,Private
2,0,1,0,0,0,0,0,1,0.205479,0.320539,0.397959,Private
3,0,0,1,0,0,0,0,1,0.383562,0.056453,0.551020,Private
4,0,0,1,0,0,0,0,1,0.589041,0.014477,0.397959,Private
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0,0,1,0,0,0,0,1,0.616438,0.048832,0.397959,Private
1996,0,0,1,0,0,0,0,1,0.246575,0.118356,0.397959,Consultant
1997,0,0,1,0,0,0,0,1,0.205479,0.062267,0.438776,Private
1998,0,0,0,0,1,0,0,1,0.232877,0.234715,0.448980,Private


In [8]:
X_test = df.query("Employment.isna()")

In [9]:
X_test

Unnamed: 0,Marital_Absent,Marital_Divorced,Marital_Married,Marital_Married-spouse-absent,Marital_Unmarried,Marital_Widowed,Gender_Female,Gender_Male,Age,Income,Hours,Employment
60,1,0,0,0,0,0,1,0,0.356164,0.126044,0.397959,
222,0,0,1,0,0,0,0,1,0.753425,0.031294,0.071429,
226,0,1,0,0,0,0,1,0,0.068493,0.167540,0.397959,
260,1,0,0,0,0,0,0,1,0.013699,0.535839,0.153061,
278,1,0,0,0,0,0,0,1,0.068493,0.190168,0.397959,
...,...,...,...,...,...,...,...,...,...,...,...,...
1952,0,1,0,0,0,0,1,0,0.616438,0.135874,0.397959,
1954,0,0,0,0,0,1,0,1,1.000000,0.242137,0.030612,
1969,0,0,1,0,0,0,1,0,0.054795,0.580429,0.561224,
1975,0,1,0,0,0,0,1,0,0.301370,0.134613,0.193878,


In [10]:
df_dummies = pd.get_dummies(X_train.Employment)
df_dummies.head(5)

Unnamed: 0,Consultant,PSFederal,PSLocal,PSState,Private,SelfEmp,Unemployed,Volunteer
0,0,0,0,0,1,0,0,0
1,0,0,0,0,1,0,0,0
2,0,0,0,0,1,0,0,0
3,0,0,0,0,1,0,0,0
4,0,0,0,0,1,0,0,0


In [11]:
stack_list = []

In [15]:
from sklearn.neighbors import KNeighborsRegressor

In [16]:
for col in df_dummies.columns:
    clf = KNeighborsRegressor(n_neighbors=6)
    clf.fit(X_train.iloc[:, :-1], df_dummies[col])
    res = clf.predict(X_test.iloc[:, :-1]).reshape(-1, 1)
    stack_list.append(res)

In [156]:
code_res = pd.Series(np.hstack(stack_list).argmax(1))

In [157]:
code_res

0     2
1     0
2     4
3     4
4     4
     ..
95    4
96    4
97    4
98    4
99    4
Length: 100, dtype: int64

In [167]:
cat_res = code_res.replace(dict(zip(list(
    range(df_dummies.shape[0])), df_dummies.columns)))

In [None]:
res_df.loc[res_df.Employment.isna(), 'Employment'] = cat_res.values

In [170]:
res_df['Employment'].value_counts()

Private       1411
Consultant     148
PSLocal        119
SelfEmp         79
PSState         72
PSFederal       69
Unemployed       1
Volunteer        1
Name: Employment, dtype: int64

In [171]:
res_df.isna().sum()

ID              0
Age             0
Employment    100
Marital         0
Income          0
Gender          0
Hours           0
dtype: int64