
### 填充缺失值

处理缺失值的最常见方法之一是填充缺失值，填充就是将某个值输入到缺失位置。

以下是几种常见的填充缺失值方法：

1. 用 **mean**（均值）来填充。<br><br>

2. 如果要处理的分类变量或具有异常值的变量，那就用 **mode**（众数）来填充。<br><br>

3. 用 0、非常小的值、或者非常大的值来填充，使缺失值和其他值可以很好地区分。<br><br>

4. 使用 KNN（K 近邻算法），根据最相似的特征来填充缺失值。<br><br>

通常，在处理缺失值之前，你应该非常谨慎，了解数据的现实意义以及出现缺失值的原因。同时，这些解决方案都非常快，可以使你能够建立模型，之后你可以对特征工程进行迭代，在时间允许的情况下，采用更谨慎的方案。

我们来看一下具体怎样实现。Chris 写的文档对这些内容也很有帮助 — 你可以在[这里](https://chrisalbon.com/)找到。他用的是 [sklearn.preprocessing 库](http://scikit-learn.org/stable/modules/preprocessing.html)。Pandas 里也有很多填充缺失值的方法，你可以在[这里](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)找到。

运行下面的单元格，创建你即将在此 Notebook 中使用的数据集。


In [1]:
import pandas as pd
import numpy as np
import ImputationMethods as t

df = pd.DataFrame({'A':[np.nan, 2, np.nan, 0, 7, 10, 15],
                   'B':[3, 4, 5, 1, 2, 3, 5],
                   'C':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
                   'D':[np.nan, True, np.nan, False, True, False, np.nan],
                   'E':['Yes', 'No', 'Maybe', np.nan, np.nan, 'Yes', np.nan]})

df

Unnamed: 0,A,B,C,D,E
0,,3,,,Yes
1,2.0,4,,True,No
2,,5,,,Maybe
3,0.0,1,,False,
4,7.0,2,,True,
5,10.0,3,,False,Yes
6,15.0,5,,,


#### Question 1

**1.** 用下面的字典，标注出每列对应的数据类型。

In [2]:
a = 'categorical'
b = 'quantitative'
c = 'we cannot tell'
d = 'boolean - can treat either way'

question1_solution = {'Column A is': b,
                      'Column B is': b,
                      'Column C is': c,
                      'Column D is': d,
                      'Column E is': a
                     }

# Check your answer
t.var_test(question1_solution)

Nice job! That looks right to me!


#### Question 2

**2.** 数据中有没有可以放心删除的行或列？

In [3]:
a = "Yes"
b = "No"

should_we_drop = a

#Check your answer
t.can_we_drop(should_we_drop)

That's right! You should feel comfortable dropping any rows or columns that are completely missing values (or if they are all the exact same value).  However, dropping other columns or rows, even if only containing a few values, should go through further consideration.


如果有需要删除的行或列，将其删除，然后保存新的数据集为 **new_df**。

In [4]:
# Use this cell to drop any columns or rows you feel comfortable dropping based on the above
new_df = df.drop('C',axis=1)
df.drop
new_df

Unnamed: 0,A,B,D,E
0,,3,,Yes
1,2.0,4,True,No
2,,5,,Maybe
3,0.0,1,False,
4,7.0,2,True,
5,10.0,3,False,Yes
6,15.0,5,,


#### Question 3

**3.** 这一练习使用上面创建的 **new_df**。我编写了一个 lambda 函数，用均值填充缺失值，你可以通过 **apply** 方法应用到 **new_df** 的各列。用空白的单元格来解答字典 **impute_q3** 里的问题。

In [16]:
fill_mean = lambda col: col.fillna(col.mean())

try:
    new_df.apply(fill_mean)
except:
    print('That broke...')

That broke...


In [10]:
new_df['A'].apply(fill_mean)

AttributeError: 'float' object has no attribute 'fillna'

In [14]:
a = "fills with the mean, but that doesn't actually make sense in this case."
b = "gives an error."
c = "is no problem - it fills the NaN values with the mean as expected."


impute_q3 = {'Filling column A': c,
             'Filling column D': a,
             'Filling column E': b    
}

#Check your answer
t.impute_q3_check(impute_q3)

Nice job! That's right only the first column fills with the mean correctly.  We can't fill the mean of a categorical variable, and the boolean treats the True as 1 and False as 0 to give values that are not 1 or 0.


#### Question 4

**4.** 根据上面的结果，某些列用众数来填充可能更有意义。自己编写一个函数，用众数填充缺失值，并且将其应用到数据集中更应该使用众数来填充缺失值的两列。根据你的发现来回答字典 **impute_q4** 里的几个问题。

In [19]:
#Similar to the above write a function and apply it to compte the mode for each column
#If you get stuck, here is a helpful resource https://stackoverflow.com/questions/42789324/pandas-fillna-mode
# data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])
fill_mode = lambda col: col.fillna(col.mode()[0])
new_df[['A','B']] = new_df[['A','B']].apply(fill_mean)
new_df[['D','E']] = new_df[['D','E']].apply(fill_mode)


new_df

Unnamed: 0,A,B,D,E
0,6.8,3,False,Yes
1,2.0,4,True,No
2,6.8,5,False,Maybe
3,0.0,1,False,Yes
4,7.0,2,True,Yes
5,10.0,3,False,Yes
6,15.0,5,False,Yes


**该方法的返回值是一个 Series，其中包含排序后的所有众数，也就是说：如果有多个众数的话，会按照从小到大的顺序排序好，存储在一个 Series 中返回。因为我们使用的是 mode()[0]，所以只会获取所有众数排序后的第一个值，也就是最小的众数。**

In [20]:
new_df.head()

Unnamed: 0,A,B,D,E
0,6.8,3,False,Yes
1,2.0,4,True,No
2,6.8,5,False,Maybe
3,0.0,1,False,Yes
4,7.0,2,True,Yes


In [24]:
a = "Did not impute the mode."
b = "Imputes the mode."


impute_q4 = {'Filling column A': a,
             'Filling column D': a,
             'Filling column E': b
            }

#Check your answer
t.impute_q4_check(impute_q4)

Nice job! That's right only one of these columns actually imputed a mode.  None of the values in the first column appeared more than once, and 0 was imputed for all of the NaN values.  There were an even number of True and False values, and False was imputed for all the NaN values.


你在这个 Notebook 中看到了两种填充缺失值的方法，希望你也意识到了这两种方法的局限。再次说明，这些方法可以让你非常快速地建立模型，但是同时也为数据引入了噪音。