### Data Preprocessing
- 构造feature之前的数据预处理
    - 有时数据会分散在几个不同的文件中，需要 Join 起来。
    - 处理 Missing Data。
    - 处理 Outlier。
    - 必要时转换某些 Categorical Variable 的表示方式。
    - 有些 Float 变量可能是从未知的 Int 变量转换得到的，这个过程中发生精度损失会在数据中产生不必要的 Noise，即两个数值原本是相同的却在小数点后某一位开始有不同。这对 Model 可能会产生很负面的影响，需要设法去除或者减弱 Noise。

这一部分的处理策略多半依赖于在前一步中探索数据集所得到的结论以及创建的可视化图表。

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# First, we'll import pandas, a data processing and CSV file I/O library
import pandas as pd

# We'll also import seaborn, a Python graphing library
import warnings # current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)

# Next, we'll load the Iris flower dataset, which is in the "../input/" directory
iris = pd.read_csv("../input/Iris.csv") # the iris dataset is now a Pandas DataFrame

# Let's see what's in the iris data - Jupyter notebooks print the result of the last thing you do
iris.head()


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
# 查看每个特征数据量情况
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


### 缺失值较多的特征处理
一般如果特征的缺失量过大，会直接将该特征舍弃掉，否则可能反倒会带入较大的noise，这里先用两类分类
```
def set_salary_change(df):
    df,loc[(df.salary_change.notnull()), 'salary_change' = "Yes"]
    df,loc[(df.salary_change.isnull()), 'salary_change' = "No"]
    return df
```

### 缺失值较少的特征处理
#### 如特征缺失值在10%以内，可以采取以下方式处理：
- 把NaN直接作为一个特征，假设用0表示，实现如下：
```
data_train.fillna(0)
```

- 用均值填充：
```
# 所有行用各自的均值填充
data_train.fillna(data_train.mean())
# 指定某些列填充
data——train.fillna(data_train.mean()['browse_his':'card_num'])
```
如果训练集train中有缺失值，而test中无缺失值，应该对缺失值取条件中值或者条件均值，根据用户label值类别取所有该label下用户该属性的均值或中值

- 用上下数据进行填充
```
data_train.fillna(method='pad')
data_train.fillna(method='bfill')
```

- 用插值法填充
```
# 插值法就是用（x0, y0）,(x1, y1)估计中间点的值
interpolate()
```

- 用算法拟合进行填充
```
def set_missing_browse_his(df):
    # 把已有的数值型特征取出来输入到RandomForestRegressor中
    process_df = df[['browse_his', 'gender', 'job', 'edu', 'marriage', 'family_type']]
    
    # 分为已知该特征的和未知该特征的，两部分
    known = process_df[process_df.browse_his.notnull()].as_matrix()
    unknown = process_df[process_df.browse_his.isnull()].as_matrix()
    
    # X为特征属性值
    X = known[:, 1:]
    
    # y为结果标签值
    y = known[:, 0]
    
    # fit到RandomForestRegressor中
    rfr = RandomForestRegressor(random_state=0, n_estimator=2000, n_jobs=-1)
    rfr.fit(X, y)
    
    # 用得到的模型进行位置特征值的预测
    predicted = rfr.predict(unknown[:, 1::])
    
    # 用得到的预测结果填补原缺失数据
    df.loc[df.browse_his.isnull(), 'browse_his'] = predicted
    
    return df, rfr 
```
对于缺失值比例不是很大的特征都采用算法拟合来填充，用没有缺失的特征属性来预测某些有缺失的特征属性；

目前有三类处理方法：

1. 用平均值、中值、分位数、众数、随机值等替代。效果一般，因为等于人为增加了噪声。

2. 用其他变量做预测模型来算出缺失变量。效果比方法1略好。有一个根本缺陷，如果其他变量和缺失变量无关，则预测的结果无意义。如果预测结果相当准确，则又说明这个变量是没必要加入建模的。一般情况下，介于两者之间。

3. 最精确的做法，把变量映射到高维空间。比如性别，有男、女、缺失三种情况，则映射成3个变量：是否男、是否女、是否缺失。连续型变量也可以这样处理。比如Google、百度的CTR预估模型，预处理时会把所有变量都这样处理，达到几亿维。这样做的好处是完整保留了原始数据的全部信息、不用考虑缺失值、不用考虑线性不可分之类的问题。缺点是计算量大大提升。
而且只有在样本量非常大的时候效果才好，否则会因为过于稀疏，效果很差。

数值型的话，均值和近邻或许是更好的方法。做成哑变量更适合分类、顺序型变量

连续变量可以离散化，比如1-10 之间的连续变量可以离散化成10个区间

##### 实现one hot encode的两种方法：Refer：
https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

- **利用pandas实现one hot  encode:**

```
#  transform a given column into one hot. Use prefix to have multiple dummies
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['b', 'a', 'c']})
>>> # Get one hot encoding of columns B
... 
>>> df
   A  B
0  a  b
1  b  a
2  c  c
>>> one_hot = pd.get_dummies(df['B'])
>>> # Drop columns B as it is now encoded
... 
>>> df = df.drop('B', axis=1)
>>> # Join the encoded df
... 
>>> df = df.join(one_hot)
>>> df
   A  a  b  c
0  a  0  1  0
1  b  1  0  0
2  c  0  0  1
```
- **一个定性特征哑编码的demo：**
```
def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df
```

- **使用 sklearn进行特征变量哑编码：**
```
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1,1,0], [0,2,1], [1,0,2]])
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0,1,1]])
<1x9 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>
>>> enc.transform([[0,1,1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])
```

- **一个保存在全局的Label_Binarizer的demo：**

```
from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)
```

##### Python pandas: check if any value is NaN in DataFrame
```
# 查看每一列是否有NaN：
df.isnull().any(axis=0)
# 查看每一行是否有NaN：
df.isnull().any(axis=1)

# 查看所有数据中是否有NaN最快的：
df.isnull().values.any()

# In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

# df.isnull().sum().sum() is a bit slower, but of course, has additional information -- the number of NaNs.
```
##### **pandas中df.ix, df.loc, df.iloc 的使用场景以及区别：**
https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation
```
# Note: in pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead.

# First, a recap:
  ● loc works on labels in the index.
  ● iloc works on the positions in the index (so it only takes integers).
  ● ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.

# Combining position-based and label-based indexing
>>> df = pd.DataFrame(np.nan, 
                      index=list('abcde'),
                      columns=['x','y','z', 8, 9])
>>> df
    x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

>>> df.ix[:'c', :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

>>> df.iloc[:df.index.get_loc('c') + 1, :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

# get_loc() is an index method meaning "get the position of the label in this index".
# Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row 'c' as well

```