# 1 特征工程是什么？

* 数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。
* 特征工程是一项工程活动，最大限度地从原始数据中提取特征以供算法和模型使用。
* 特征工程包括以下方面：

![特征工程脑图](pic7.jpg)
<center>Fig. 1 特征工程脑图</center>


特征处理是特征工程的核心部分，sklearn提供了较为完整的特征处理方法，包括数据预处理，特征选择，降维等。


本文中使用sklearn中的IRIS（鸢尾花）数据集来对特征处理功能进行说明。IRIS数据集由Fisher在1936年整理，包含4个特征（Sepal.Length（花萼长度）、Sepal.Width（花萼宽度）、Petal.Length（花瓣长度）、Petal.Width（花瓣宽度）），特征值都为正浮点数，单位为厘米。目标值为鸢尾花的分类（Iris Setosa（山鸢尾）、Iris Versicolour（杂色鸢尾），Iris Virginica（维吉尼亚鸢尾））。导入IRIS数据集的代码如下：



In [15]:
from sklearn.datasets import load_iris
 
#导入IRIS数据集
iris = load_iris()
 
#特征矩阵
X = iris.data
 
#目标向量
y = iris.target

# 2 数据预处理

通过特征提取，未经处理的特征常常有以下问题：

* 不属于同一量纲：即特征的规格不一样，不能够放在一起比较。无量纲化可以解决这一问题。
* 信息冗余：对于某些定量特征，其包含的有效信息为区间划分，例如学习成绩，假若只关心“及格”或不“及格”，那么需要将定量的考分，转换成“1”和“0”表示及格和未及格。二值化可以解决这一问题。
* 定性特征不能直接使用：某些机器学习算法和模型只能接受定量特征的输入，那么需要将定性特征转换为定量特征。最简单的方式是为每一种定性值指定一个定量值，但是这种方式过于灵活，增加了调参的工作。通常使用哑编码的方式将定性特征转换为定量特征：假设有N种定性值，则将这一个特征扩展为N种特征，当原始特征值为第i种定性值时，第i个扩展特征赋值为1，其他扩展特征赋值为0。哑编码的方式相比直接指定的方式，不用增加调参的工作，对于线性模型来说，使用哑编码后的特征可达到非线性的效果。
* 存在缺失值：缺失值需要补充。
* 信息利用率低：不同的机器学习算法和模型对数据中信息的利用是不同的，之前提到在线性模型中，使用对定性特征哑编码可以达到非线性的效果。类似地，对定量变量多项式化，或者进行其他的转换，都能达到非线性的效果。
　　
* **我们使用sklearn中的preproccessing库来进行数据预处理，可以覆盖以上问题的解决方案。** 

https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

## 2.1 无量纲化

无量纲化使不同规格的数据转换到同一规格。常见的无量纲化方法有标准化和区间缩放法。


### 2.1.1 标准化
标准化的前提是特征值服从正态分布，标准化后，其转换成标准正态分布。其公式为$x' = \frac{x-\bar{x}}{\sigma}$

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
* Parameters:
    *copy: boolean, optional, default True
        *If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
    * with_mean: boolean, True by default
        * If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
    * with_std: boolean, True by default
        * If True, scale the data to unit variance (or equivalently, unit standard deviation).
* Attributes
    * scale_: ndarray or None, shape (n_features,)
        * Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to None when with_std=False.
    * mean_: ndarray or None, shape (n_features,)
        * The mean value for each feature in the training set. Equal to None when with_mean=False.
    * var_: ndarray or None, shape (n_features,)
        * The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.
    * n_samples_seen_: int or array, shape (n_features,)
        * The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

注意： It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to the constructor of StandardScaler.

In [16]:
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(X)

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

### 2.1.2 区间缩放法
区间缩放法利用了边界值信息，将特征的取值区间缩放到某个特点的范围，例如[0, 1]等。其公式为 $x' = \frac{x-min}{max-min}$

class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True)[source]

* Parameters:
    * feature_range: tuple (min, max), default=(0, 1). 
        * Desired range of transformed data.
    * copybool, default=True. 
        * Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

* Attributes:
    * min_: ndarray of shape (n_features,). 
        * Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_
    * scale_: ndarray of shape (n_features,)
        * Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))
    * data_min_: ndarray of shape (n_features,)
        * Per feature minimum seen in the data
    * data_max_: ndarray of shape (n_features,)
        * Per feature maximum seen in the data
    * data_range_: ndarray of shape (n_features,)
        * Per feature range (data_max_ - data_min_) seen in the data
    * n_samples_seen_: int
        * The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.


class sklearn.preprocessing.MaxAbsScaler(*, copy=True) is similar to MinMaxScaler

In [17]:
from sklearn.preprocessing import MinMaxScaler

#区间缩放，返回值为缩放到[0, 1]区间的数据
MinMaxScaler().fit_transform(X)

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     ,

### 2.1.3 标准化与正则化的区别

简单来说，标准化是依照特征矩阵的列处理数据，其通过求z-score的方法，将样本的特征值转换到同一量纲下。

正则化是依照特征矩阵的行处理数据，拥有统一的标准，也就是说都转化为“单位向量”。

正则化的过程是将每个样本缩放到单位范数（每个样本的范数为1），如果后面要使用如二次型（点积）或者其它核方法计算两个样本之间的相似性这个方法会很有用。这使得学习算法不仅能够拟合数据，而且能够使模型的参数权重尽量的小。sklearn.preprocessing.Normalizer的默认方法是norm='l2',即岭回归（ridget），其超参数$\alpha$决定了你想正则化这个模型的强度。如果$\alpha=0$那此时的岭回归便变为了线性回归。如果$\alpha$非常的大，所有的权重最后都接近于零，最后结果将是一条穿过数据平均值的水平直线。

l1: lasso:
$J(\theta) = MSE(\theta) + \alpha \sum_{i=1}^{n}|\theta_{i}|$ 

l2: ridget: 
$J(\theta) = MSE(\theta) + \alpha \frac{1}{2}\sum_{i=1}^{n}\theta_{i}^{2}$ 


与岭回归稍微不一样的Lasso回归的一个重要特征是它倾向于完全消除最不重要的特征的权重（即将它们设置为零），所以也可以拿来筛选单项特征，通常使用
clf=sklearn.linear_model.Lasso(alpha=0.1)，同时用clf.coef_来查看每个特征的权重。


---

class sklearn.preprocessing.normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)
* Parameters
    * X{array-like, sparse matrix}, shape [n_samples, n_features]
        * The data to normalize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

    * norm: ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
        * The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).

    * axis0 or 1, optional (1 by default)
        * axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.
    * copy: boolean, optional, default True
        * set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix and if axis is 1).
    * return_norm: boolean, default False
        * whether to return the computed norms
* Returns
    * X{array-like, sparse matrix}, shape [n_samples, n_features]
        * Normalized input X.
    * norms: array, shape [n_samples] if axis=1 else [n_features]
        * An array of norms along given axis for X. When X is sparse, a NotImplementedError will be raised for norm ‘l1’ or ‘l2’.

其他常用数据预处理方法：
* Scaling sparse data: MaxAbsScaler and maxabs_scale 
* Scaling data with outliers:  robust_scale and RobustScaler
* Centering kernel matrices:  KernelCenterer 

非线性数据预处理方法：
* Mapping to a Uniform distribution： QuantileTransformer and quantile_transform provide a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1:
* Mapping to a Gaussian distribution： PowerTransformer, Yeo-Johnson or Box-Cox

---

class sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, normalize=False, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')
* Parameters: 
    * alpha: float, default=1.0
        * Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
    * fit_intercept: bool, default=True
        * Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
    * normalize: bool, default=False
        * This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
    * precompute: ‘auto’, bool or array-like of shape (n_features, n_features), default=False
        * Whether to use a precomputed Gram matrix to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
    * copy_X: bool, default=True
        * If True, X will be copied; else, it may be overwritten.
    * max_iter: int, default=1000
        * The maximum number of iterations
    * tol: float, default=1e-4
        * The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
    * warm_start: bool, default=False
        * When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary.
    * positive: bool, default=False
        * When set to True, forces the coefficients to be positive.
    * random_state: int, RandomState instance, default=None
        * The seed of the pseudo random number generator that selects a random feature to update. Used when selection == ‘random’. Pass an int for reproducible output across multiple function calls. See Glossary.
    * selection{‘cyclic’, ‘random’}, default=’cyclic’
        * If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

* Attributes
    * coef_: ndarray of shape (n_features,) or (n_targets, n_features)
        * parameter vector (w in the cost function formula)
    * sparse_coef_: sparse matrix of shape (n_features, 1) or (n_targets, n_features)
        * sparse representation of the fitted coef_
    * intercept_: float or ndarray of shape (n_targets,)
        * independent term in decision function.
    * n_iter_: int or list of int



In [30]:
from sklearn.preprocessing import Normalizer

#岭回归正则化，返回值为正则化后的数据
Normalizer().fit_transform(X)

array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 ],
       [0.82813287, 0.50702013, 0.23660939, 0.03380134],
       [0.80533308, 0.54831188, 0.2227517 , 0.03426949],
       [0.80003025, 0.53915082, 0.26087943, 0.03478392],
       [0.790965  , 0.5694948 , 0.2214702 , 0.0316386 ],
       [0.78417499, 0.5663486 , 0.2468699 , 0.05808704],
       [0.78010936, 0.57660257, 0.23742459, 0.0508767 ],
       [0.80218492, 0.54548574, 0.24065548, 0.0320874 ],
       [0.80642366, 0.5315065 , 0.25658935, 0.03665562],
       [0.81803119, 0.51752994, 0.25041771, 0.01669451],
       [0.80373519, 0.55070744, 0.22325977, 0.02976797],
       [0.786991  , 0.55745196, 0.26233033, 0.03279129],
       [0.82307218, 0.51442011, 0.24006272, 0.01714734],
       [0.8025126 , 0.55989251, 0.20529392, 0.01866308],
       [0.81120865, 0.55945424, 0.16783627, 0.02797271],
       [0.77381111, 0.59732787, 0.2036345 , 0.05430253],
       [0.79428944, 0.57365349, 0.19121783, 0.05883625],
       [0.80327412, 0.55126656,

In [31]:
#Lasso回归正则化
Normalizer(norm='l1').fit_transform(X)

array([[0.5       , 0.34313725, 0.1372549 , 0.01960784],
       [0.51578947, 0.31578947, 0.14736842, 0.02105263],
       [0.5       , 0.34042553, 0.13829787, 0.0212766 ],
       [0.4893617 , 0.32978723, 0.15957447, 0.0212766 ],
       [0.49019608, 0.35294118, 0.1372549 , 0.01960784],
       [0.47368421, 0.34210526, 0.14912281, 0.03508772],
       [0.4742268 , 0.35051546, 0.1443299 , 0.03092784],
       [0.4950495 , 0.33663366, 0.14851485, 0.01980198],
       [0.49438202, 0.3258427 , 0.15730337, 0.02247191],
       [0.51041667, 0.32291667, 0.15625   , 0.01041667],
       [0.5       , 0.34259259, 0.13888889, 0.01851852],
       [0.48      , 0.34      , 0.16      , 0.02      ],
       [0.51612903, 0.32258065, 0.15053763, 0.01075269],
       [0.50588235, 0.35294118, 0.12941176, 0.01176471],
       [0.51785714, 0.35714286, 0.10714286, 0.01785714],
       [0.475     , 0.36666667, 0.125     , 0.03333333],
       [0.49090909, 0.35454545, 0.11818182, 0.03636364],
       [0.49514563, 0.33980583,

In [32]:
# Lasso回归筛选单项特征
from sklearn import linear_model
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X,y)
print(clf.coef_)
print(clf.intercept_)

[ 0.         -0.          0.40811896  0.        ]
-0.5337110569441175


## 2.2 对定量特征二值化

定量特征二值化的核心在于设定一个阈值，大于阈值的赋值为1，小于等于阈值的赋值为0

---

class sklearn.preprocessing.Binarizer(*, threshold=0.0, copy=True)
* Parameters:
    * threshold: float, optional (0.0 by default)
        * Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
    *copy: boolean, optional, default True
        * set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

In [6]:
from sklearn.preprocessing import Binarizer
Binarizer(threshold=3).fit_transform(iris.data)

array([[1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],


## 2.3 对定性特征哑编码

　　由于IRIS数据集的特征皆为定量特征，故使用其目标值进行哑编码（实际上是不需要的）。使用preproccessing库的OneHotEncoder类对数据进行哑编码的代码如下：

---

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
* Parameters:
    * categories‘auto’ or a list of array-like, default=’auto’
        * Categories (unique values) per feature:
        * ‘auto’ : Determine categories automatically from the training data.
        * list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
        * The used categories can be found in the categories_ attribute.
    * drop{‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None
        * Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
        * However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
        * None : retain all features (the default).
        * ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
        * ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

        * array : drop[i] is the category in feature X[:, i] that should be dropped.

    * sparsebool, default=True
        *Will return sparse matrix if set True else will return an array.
    * dtypenumber type, default=np.float
        * Desired dtype of output.
    * handle_unknown{‘error’, ‘ignore’}, default=’error’
        * Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

* Attributes
    * categories_: list of arrays
        * The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).
    * drop_idx_: array of shape (n_features,)
        * drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.
        * drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop='if_binary' and the feature isn’t binary.
        * drop_idx_ = None if all the transformed features will be retained.



In [7]:
from sklearn.preprocessing import OneHotEncoder
 
#哑编码，对IRIS数据集的目标值，返回值为哑编码后的数据
print(OneHotEncoder().fit_transform(iris.target.reshape((-1,1))))

  (0, 0)	1.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 0)	1.0
  (5, 0)	1.0
  (6, 0)	1.0
  (7, 0)	1.0
  (8, 0)	1.0
  (9, 0)	1.0
  (10, 0)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 0)	1.0
  (14, 0)	1.0
  (15, 0)	1.0
  (16, 0)	1.0
  (17, 0)	1.0
  (18, 0)	1.0
  (19, 0)	1.0
  (20, 0)	1.0
  (21, 0)	1.0
  (22, 0)	1.0
  (23, 0)	1.0
  (24, 0)	1.0
  :	:
  (125, 2)	1.0
  (126, 2)	1.0
  (127, 2)	1.0
  (128, 2)	1.0
  (129, 2)	1.0
  (130, 2)	1.0
  (131, 2)	1.0
  (132, 2)	1.0
  (133, 2)	1.0
  (134, 2)	1.0
  (135, 2)	1.0
  (136, 2)	1.0
  (137, 2)	1.0
  (138, 2)	1.0
  (139, 2)	1.0
  (140, 2)	1.0
  (141, 2)	1.0
  (142, 2)	1.0
  (143, 2)	1.0
  (144, 2)	1.0
  (145, 2)	1.0
  (146, 2)	1.0
  (147, 2)	1.0
  (148, 2)	1.0
  (149, 2)	1.0


## 2.4 缺失值计算
由于IRIS数据集没有缺失值，故对数据集新增一个样本，4个特征均赋值为NaN，表示数据缺失。使用preproccessing库的SimpleImputer类对数据进行缺失值计算的代码如下：

---

class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

* Parameters:
    * missing_values: number, string, np.nan (default) or None
        * The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.
    * strategy: string, default=’mean’
        * The imputation strategy.
        * If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
        * If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
        * If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
        * If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
    * fill_value: string or numerical value, default=None
        * When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
    * verbose: integer, default=0
        * Controls the verbosity of the imputer.
    * copy: boolean, default=True
        * If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
        * If X is not an array of floating values;
        * If X is encoded as a CSR matrix;
        * If add_indicator=True.

    * add_indicator: boolean, default=False
        * If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

* Attributes
    * statistics_: array of shape (n_features,)
    * The imputation fill value for each feature. Computing statistics can result in np.nan values. During transform, features corresponding to np.nan statistics will be discarded.
    * indicator_: sklearn.impute.MissingIndicator
        * Indicator used to add binary indicators for missing values. None if add_indicator is False.

In [33]:
from numpy import vstack, array, nan
from sklearn.impute import SimpleImputer

#缺失值计算，返回值为计算缺失值后的数据
#参数missing_value为缺失值的表示形式，默认为NaN
#参数strategy为缺失值填充方式，默认为mean（均值）
imp = SimpleImputer(missing_values=nan, strategy='mean')
X_new = vstack((array([nan, nan, nan, nan]), X))
imp.fit_transform(X_new)

array([[5.84333333, 3.05733333, 3.758     , 1.19933333],
       [5.1       , 3.5       , 1.4       , 0.2       ],
       [4.9       , 3.        , 1.4       , 0.2       ],
       [4.7       , 3.2       , 1.3       , 0.2       ],
       [4.6       , 3.1       , 1.5       , 0.2       ],
       [5.        , 3.6       , 1.4       , 0.2       ],
       [5.4       , 3.9       , 1.7       , 0.4       ],
       [4.6       , 3.4       , 1.4       , 0.3       ],
       [5.        , 3.4       , 1.5       , 0.2       ],
       [4.4       , 2.9       , 1.4       , 0.2       ],
       [4.9       , 3.1       , 1.5       , 0.1       ],
       [5.4       , 3.7       , 1.5       , 0.2       ],
       [4.8       , 3.4       , 1.6       , 0.2       ],
       [4.8       , 3.        , 1.4       , 0.1       ],
       [4.3       , 3.        , 1.1       , 0.1       ],
       [5.8       , 4.        , 1.2       , 0.2       ],
       [5.7       , 4.4       , 1.5       , 0.4       ],
       [5.4       , 3.9       ,

## 2.5 数据变换

常见的数据变换有基于多项式的、基于指数函数的、基于对数函数的。

In [34]:
from sklearn.preprocessing import PolynomialFeatures

#多项式转换
#参数degree为度，默认值为2
PolynomialFeatures().fit_transform(X)

array([[ 1.  ,  5.1 ,  3.5 , ...,  1.96,  0.28,  0.04],
       [ 1.  ,  4.9 ,  3.  , ...,  1.96,  0.28,  0.04],
       [ 1.  ,  4.7 ,  3.2 , ...,  1.69,  0.26,  0.04],
       ...,
       [ 1.  ,  6.5 ,  3.  , ..., 27.04, 10.4 ,  4.  ],
       [ 1.  ,  6.2 ,  3.4 , ..., 29.16, 12.42,  5.29],
       [ 1.  ,  5.9 ,  3.  , ..., 26.01,  9.18,  3.24]])

In [35]:
from numpy import log1p
from sklearn.preprocessing import FunctionTransformer
 
#自定义转换函数为对数函数的数据变换
#第一个参数是单变元函数
FunctionTransformer(log1p).fit_transform(X)

array([[1.80828877, 1.5040774 , 0.87546874, 0.18232156],
       [1.77495235, 1.38629436, 0.87546874, 0.18232156],
       [1.74046617, 1.43508453, 0.83290912, 0.18232156],
       [1.7227666 , 1.41098697, 0.91629073, 0.18232156],
       [1.79175947, 1.5260563 , 0.87546874, 0.18232156],
       [1.85629799, 1.58923521, 0.99325177, 0.33647224],
       [1.7227666 , 1.48160454, 0.87546874, 0.26236426],
       [1.79175947, 1.48160454, 0.91629073, 0.18232156],
       [1.68639895, 1.36097655, 0.87546874, 0.18232156],
       [1.77495235, 1.41098697, 0.91629073, 0.09531018],
       [1.85629799, 1.54756251, 0.91629073, 0.18232156],
       [1.75785792, 1.48160454, 0.95551145, 0.18232156],
       [1.75785792, 1.38629436, 0.87546874, 0.09531018],
       [1.66770682, 1.38629436, 0.74193734, 0.09531018],
       [1.91692261, 1.60943791, 0.78845736, 0.18232156],
       [1.90210753, 1.68639895, 0.91629073, 0.33647224],
       [1.85629799, 1.58923521, 0.83290912, 0.33647224],
       [1.80828877, 1.5040774 ,

## 2.6 回顾

<center> Table. 1 Sklearn Data Preprocessing Tools </center>

|类名	  | 功能	| 说明   |
|-----  |---------|-----|
|StandardScaler	| 数据预处理（无量纲化） |	标准化，基于特征矩阵的列，将特征值转换至服从标准正态分布|
|MinMaxScaler	| 数据预处理（无量纲化）	|区间缩放，基于最大最小值，将特征值转换到[0, 1]区间上|
|Normalizer	| 数据预处理（归一化）	|基于特征矩阵的行，将样本向量转换为“单位向量”|
|Binarizer	| 数据预处理（二值化）	|基于给定阈值，将定量特征按阈值划分|
|OneHotEncoder	| 数据预处理（哑编码）	|将定性数据编码为定量数据|
|SimpleImputer	| 数据预处理（缺失值计算）	|计算缺失值，缺失值可填充为均值等|
|PolynomialFeatures	| 数据预处理（多项式数据转换）	| 多项式数据转换|
|FunctionTransformer	| 数据预处理（自定义单元数据转换）|	使用单变元的函数来转换数据|
|VarianceThreshold |	特征选择（Filter）|	方差选择法|
|SelectKBest |	特征选择（Filter）|	可选关联系数、卡方校验、最大信息系数作为得分计算的方法|
|RFE	|特征选择（Wrapper）|	递归地训练基模型，将权值系数较小的特征从特征集合中消除|
|SelectFromModel|	特征选择（Embedded）|	训练基模型，选择权值系数较高的特征|
|PCA	|降维（无监督）|	主成分分析法|
|LDA	|降维（有监督）|	线性判别分析法|


# 3. 特征选择
当数据预处理完成后，我们需要选择有意义的特征输入机器学习的算法和模型进行训练。通常来说，从两个方面考虑来选择特征：

* 特征是否发散：如果一个特征不发散，例如方差接近于0，也就是说样本在这个特征上基本上没有差异，这个特征对于样本的区分并没有什么用。
* 特征与目标的相关性：这点比较显见，与目标相关性高的特征，应当优选选择。除方差法外，本文介绍的其他方法均从相关性考虑。

---

根据特征选择的形式又可以将特征选择方法分为3种：
* Filter：过滤法，按照发散性或者相关性对各个特征进行评分，设定阈值或者待选择阈值的个数，选择特征。
* Wrapper：包装法，根据目标函数（通常是预测效果评分），每次选择若干特征，或者排除若干特征。
* Embedded：嵌入法，先使用某些机器学习的算法和模型进行训练，得到各个特征的权值系数，根据系数从大到小选择特征。类似于Filter方法，但是是通过训练来确定特征的优劣。


**我们使用sklearn中的feature_selection库来进行特征选择。**
https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection

## 3.1 移除Filter

### 3.1.1 移除低方差的特征 (Removing features with low variance)
假设某特征的特征值只有0和1，并且在所有输入样本中，95%的实例的该特征取值都是1，那就可以认为这个特征作用不大。如果100%都是1，那这个特征就没意义了。当特征值都是离散型变量的时候这种方法才能用，如果是连续型变量，就需要将连续变量离散化之后才能用。而且实际当中，一般不太会有95%以上都取某个值的特征存在，所以这种方法虽然简单但是不太好用。可以把它作为特征选择的预处理，先去掉那些取值变化小的特征，然后再从接下来提到的的特征选择方法中选择合适的进行进一步的特征选择。

class sklearn.feature_selection.VarianceThreshold(threshold=0.0)
* Parameters: 
    * threshold: float, optional
        * Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.
* Attributes
    * variances_array, shape (n_features,)
        *Variances of individual features.

In [36]:
from sklearn.feature_selection import VarianceThreshold

#方差选择法，返回值为特征选择后的数据
#参数threshold为方差的阈值
VarianceThreshold(threshold=3).fit_transform(X)

array([[1.4],
       [1.4],
       [1.3],
       [1.5],
       [1.4],
       [1.7],
       [1.4],
       [1.5],
       [1.4],
       [1.5],
       [1.5],
       [1.6],
       [1.4],
       [1.1],
       [1.2],
       [1.5],
       [1.3],
       [1.4],
       [1.7],
       [1.5],
       [1.7],
       [1.5],
       [1. ],
       [1.7],
       [1.9],
       [1.6],
       [1.6],
       [1.5],
       [1.4],
       [1.6],
       [1.6],
       [1.5],
       [1.5],
       [1.4],
       [1.5],
       [1.2],
       [1.3],
       [1.4],
       [1.3],
       [1.5],
       [1.3],
       [1.3],
       [1.3],
       [1.6],
       [1.9],
       [1.4],
       [1.6],
       [1.4],
       [1.5],
       [1.4],
       [4.7],
       [4.5],
       [4.9],
       [4. ],
       [4.6],
       [4.5],
       [4.7],
       [3.3],
       [4.6],
       [3.9],
       [3.5],
       [4.2],
       [4. ],
       [4.7],
       [3.6],
       [4.4],
       [4.5],
       [4.1],
       [4.5],
       [3.9],
       [4.8],
      

### 3.1.2 单变量特征选择 (Univariate feature selection)

单变量特征选择的原理是分别单独的计算每个变量的某个统计指标，根据该指标来判断哪些指标重要，剔除那些不重要的指标。

* 对于分类问题(y离散)，可采用：
    * 卡方检验，f_classif, mutual_info_classif，互信息
* 对于回归问题(y连续)，可采用：
    * 皮尔森相关系数，f_regression, mutual_info_regression，最大信息系数

这种方法比较简单，易于运行，易于理解，通常对于理解数据有较好的效果（但对特征优化、提高泛化能力来说不一定有效）。这种方法有许多改进的版本、变种。

单变量特征选择基于单变量的统计测试来选择最佳特征。它可以看作预测模型的一项预处理。

---
    ==Scikit-learn将特征选择程序用包含 transform 函数的对象来展现==：

* SelectKBest：
    * 移除得分前 k 名以外的所有特征(取top k)
* SelectPercentile： 
    * 移除得分在用户指定百分比以后的特征(取top k%)
    * 对每个特征使用通用的单变量统计检验： 假正率(false positive rate) SelectFpr, 伪发现率(false discovery rate) SelectFdr, 或族系误差率 SelectFwe.
* GenericUnivariateSelect ：
    * 可以设置不同的策略来进行单变量特征选择。同时不同的选择策略也能够使用超参数寻优，从而让我们找到最佳的单变量特征选择策略。
　　* 将特征输入到评分函数，返回一个单变量的f_score(F检验的值)或p-values(P值，假设检验中的一个标准，P-value用来和显著性水平作比较)，注意SelectKBest 和 SelectPercentile只有得分，没有p-value。

* For classification: chi2, f_classif, mutual_info_classif
* For regression: f_regression, mutual_info_regression

--- 

卡方(Chi2)检验

经典的卡方检验是检验定性自变量对定性因变量的相关性。假设自变量有N种取值，因变量有M种取值，考虑自变量等于i且因变量等于j的样本频数的观察值与期望的差距，构建统计量：
$\chi^{2}=\sum \frac{(A-E)^{2}}{E}$

---
sklearn.feature_selection.chi2(X, y)
* Parameters:
    * X: {array-like, sparse matrix} of shape (n_samples, n_features)
        * Sample vectors.
    * y: array-like of shape (n_samples,)
        * Target vector (class labels).
* Returns:
    * chi2: array, shape = (n_features,)
        * chi2 statistics of each feature.
    * pval: array, shape = (n_features,)
        * p-values of each feature.

In [37]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X.shape)
print(X_new.shape)

(150, 4)
(150, 2)


Pearson相关系数 (Pearson Correlation)

皮尔森相关系数是一种最简单的，能帮助理解特征和响应变量之间关系的方法，该方法衡量的是变量之间的线性相关性，结果的取值区间为[-1，1]，-1表示完全的负相关，+1表示完全的正相关，0表示没有线性相关。

Pearson Correlation速度快、易于计算，经常在拿到数据(经过清洗和特征提取之后的)之后第一时间就执行。

---

sklearn.feature_selection.f_regression(X, y, *, center=True)
* Parameters: 
    * X: {array-like, sparse matrix} shape = (n_samples, n_features)
        * The set of regressors that will be tested sequentially.
    * y: array of shape(n_samples).
        * The data matrix
    * center: True, bool,
        * If true, X and y will be centered.

* Returns
    * F: array, shape=(n_features,)
        * F values of features.
    * pval: array, shape=(n_features,)
        * p-values of F-scores.

In [38]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

#选择K个最好的特征，返回选择特征后的数据
#第一个参数为计算评估特征是否好的函数，该函数输入特征矩阵和目标向量，输出二元组（评分，P值）的数组，数组第i项为第i个特征的评分和P值。在此定义为计算相关系数
#参数k为选择的特征个数

X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
f_value, p_value = f_regression(X,y)
print(X.shape)
print(X_new.shape)
print(f_value)
print(p_value)

(150, 4)
(150, 2)
[ 233.8389959    32.93720748 1341.93578461 1592.82421036]
[2.89047835e-32 5.20156326e-08 4.20187315e-76 4.15531102e-81]


其他sklearn.feature_selection的方程：
* f_classif
    * ANOVA F-value between label/feature for classification tasks.
* mutual_info_classif
    * Mutual information for a discrete target.
* chi2
    * Chi-squared stats of non-negative features for classification tasks.
* f_regression
    * F-value between label/feature for regression tasks.
* mutual_info_regression
    * Mutual information for a continuous target.
* SelectPercentile
    * Select features based on percentile of the highest scores.
* SelectFpr
    * Select features based on a false positive rate test.
* SelectFdr
    * Select features based on an estimated false discovery rate.
* SelectFwe
    * Select features based on family-wise error rate.
* GenericUnivariateSelect
    * Univariate feature selector with configurable mode.

### 3.1.4 互信息法

互信息和最大信息系数 (Mutual information and maximal information coefficient (MIC)

　　经典的互信息（互信息为随机变量X与Y之间的互信息I(X;Y)为单个事件之间互信息的数学期望）也是评价定性自变量对定性因变量的相关性的，互信息计算公式如下：

互信息直接用于特征选择其实不是太方便：
* 它不属于度量方式，也没有办法归一化，在不同数据及上的结果无法做比较；
* 对于连续变量的计算不是很方便（X和Y都是集合，x，y都是离散的取值），通常变量需要先离散化，而互信息的结果对离散化的方式很敏感。

　　最大信息系数克服了这两个问题。它首先寻找一种最优的离散化方式，然后把互信息取值转换成一种度量方式，取值区间在[0，1]。 minepy 提供了MIC功能。

反过头来看y=x^2这个例子，MIC算出来的互信息值为1(最大的取值)。
MIC的统计能力遭到了 一些质疑 ，当零假设不成立时，MIC的统计就会受到影响。在有的数据集上不存在这个问题，但有的数据集上就存在这个问题。

In [42]:
import numpy as np
from minepy import MINE
m = MINE()
x = np.random.uniform(-1, 1, 10000)
print(x)
m.compute_score(x, x**2)
print(m.mic())

[ 0.11120806  0.82855225 -0.40183082 ... -0.07893167  0.15397029
  0.06326656]
1.0000000000000009


## 3.2 Wrapper
### 3.2.1 递归特征消除 (Recursive Feature Elimination)
递归消除特征法使用一个基模型来进行多轮训练，每轮训练后，移除若干权值系数的特征，再基于新的特征集进行下一轮训练。

sklearn官方解释：对特征含有权重的预测模型(例如，线性模型对应参数coefficients)，RFE通过递归减少考察的特征集规模来选择特征。首先，预测模型在原始特征上训练，每个特征指定一个权重。之后，那些拥有最小绝对值权重的特征被踢出特征集。如此往复递归，直至剩余的特征数量达到所需的特征数量。

RFECV 通过交叉验证的方式执行RFE，以此来选择最佳数量的特征：对于一个数量为d的feature的集合，他的所有的子集的个数是2的d次方减1(包含空集)。指定一个外部的学习算法，比如SVM之类的。通过该算法计算所有子集的validation error。选择error最小的那个子集作为所挑选的特征。

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE

示例:
Recursive feature elimination: 一个递归特征消除的示例，展示了在数字分类任务中，像素之间的相关性。
https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html#sphx-glr-auto-examples-feature-selection-plot-rfe-digits-py

In [43]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
 
#递归特征消除法，返回特征选择后的数据
#参数estimator为基模型
#参数n_features_to_select为选择的特征个数
RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1],
       [1.5, 0.2],
       [1.6, 0.2],
       [1.4, 0.1],
       [1.1, 0.1],
       [1.2, 0.2],
       [1.5, 0.4],
       [1.3, 0.4],
       [1.4, 0.3],
       [1.7, 0.3],
       [1.5, 0.3],
       [1.7, 0.2],
       [1.5, 0.4],
       [1. , 0.2],
       [1.7, 0.5],
       [1.9, 0.2],
       [1.6, 0.2],
       [1.6, 0.4],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.6, 0.2],
       [1.6, 0.2],
       [1.5, 0.4],
       [1.5, 0.1],
       [1.4, 0.2],
       [1.5, 0.2],
       [1.2, 0.2],
       [1.3, 0.2],
       [1.4, 0.1],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.3, 0.3],
       [1.3, 0.3],
       [1.3, 0.2],
       [1.6, 0.6],
       [1.9, 0.4],
       [1.4, 0.3],
       [1.6, 0.2],
       [1.4, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [4.7, 1.4],
       [4.5, 1.5],
       [4.9,

## 3.3 Embedded
使用SelectFromModel选择特征 (Feature selection using SelectFromModel)

单变量特征选择方法独立的衡量每个特征与响应变量之间的关系，另一种主流的特征选择方法是基于机器学习模型的方法。有些机器学习方法本身就具有对特征进行打分的机制，或者很容易将其运用到特征选择任务中，例如回归模型，SVM，决策树，随机森林等等。其实Pearson相关系数等价于线性回归里的标准化回归系数。

SelectFromModel 作为meta-transformer，能够用于拟合后任何拥有coef_或feature_importances_ 属性的预测模型。 如果特征对应的coef_ 或 feature_importances_ 值低于设定的阈值threshold，那么这些特征将被移除。除了手动设置阈值，也可通过字符串参数调用内置的启发式算法(heuristics)来设置阈值，包括：平均值(“mean”), 中位数(“median”)以及他们与浮点数的乘积，如”0.1*mean”。

示例: Feature selection using SelectFromModel and LassoCV: 在阈值未知的前提下，选择了Boston dataset中两项最重要的特征。
https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-diabetes-py

---

class sklearn.feature_selection.SelectFromModel(estimator, *, threshold=None, prefit=False, norm_order=1, max_features=None)
* Parameters
    * estimator: object
        * The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a non-fitted estimator. The estimator must have either a feature_importances_ or coef_ attribute after fitting.
    * threshold: string, float, optional default None
        * The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.
    * prefit: bool, default False
        * Whether a prefit model is expected to be passed into the constructor directly or not. If True, transform must be called directly and SelectFromModel cannot be used with cross_val_score, GridSearchCV and similar utilities that clone the estimator. Otherwise train the model using fit and then transform to do feature selection.
    * norm_order: non-zero int, inf, -inf, default 1
        * Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.
    * max_features: int or None, optional
        * The maximum number of features to select. To only select based on max_features, set threshold=-np.inf.
* Attributes
    * estimator_: an estimator
        * The base estimator from which the transformer is built. This is stored only when a non-fitted estimator is passed to the SelectFromModel, i.e when prefit is False.
    * threshold_: float
        * The threshold value used for feature selection.

### 3.3.1 基于惩罚项的特征选择法
使用L1范数作为惩罚项的线性模型(Linear models)会得到稀疏解：大部分特征对应的系数为0。当你希望减少特征的维度以用于其它分类器时，可以通过 feature_selection.SelectFromModel 来选择不为0的系数。特别指出，常用于此目的的稀疏预测模型有 linear_model.Lasso（回归）， linear_model.LogisticRegression 和 svm.LinearSVC（分类）。对于SVM和逻辑回归，参数C控制稀疏性：C越小，被选中的特征越少。对于Lasso，参数$\alpha$越大，被选中的特征越少。

Lasso: L1恢复和压缩感知 (L1-recovery and compressive sensing)

对于一个好的$\alpha$值，在满足特定条件下， Lasso 仅使用少量观测值就能够完全恢复出非零的系数。特别地，样本的数量需要“足够大”，否则L1模型的表现会充满随机性，所谓“足够大”取决于非零系数的数量，特征数量的对数，噪声的数量，非零系数的最小绝对值以及设计矩阵X的结构。此外，设计矩阵必须拥有特定的属性，比如不能太过相关(correlated)。 对于非零系数的恢复，还没有一个选择alpha值的通用规则 。alpha值可以通过交叉验证来设置(LassoCV or LassoLarsCV)，尽管这也许会导致模型欠惩罚(under-penalized)：引入少量非相关变量不会影响分数预测。相反BIC (LassoLarsIC) 更倾向于设置较大的alpha值。

In [45]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
print(X.shape)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print(X_new.shape)

(150, 4)
(150, 3)


### 3.3.2 随机稀疏模型 (Randomized sparse models)
　　
基于L1的稀疏模型的局限在于，当面对一组互相关的特征时，它们只会选择其中一项特征。为了减轻该问题的影响可以使用随机化技术，通过多次重新估计稀疏模型来扰乱设计矩阵_，或通过多次下采样数据来统计一个给定的回归量被选中的次数。

==稳定性选择 (Stability Selection)==

RandomizedLasso 实现了使用这项策略的Lasso，RandomizedLogisticRegression 使用逻辑回归，适用于分类任务。要得到整个迭代过程的稳定分数，你可以使用 lasso_stability_path。

注意到对于非零特征的检测，要使随机稀疏模型比标准F统计量更有效， 那么模型的参考标准需要是稀疏的，换句话说，非零特征应当只占一小部分。


### 3.3.3 基于树模型的特征选择法 (Tree-based feature selection)
基于树的预测模型（见 sklearn.tree 模块，森林见 sklearn.ensemble 模块）能够用来计算特征的重要程度，因此能用来去除不相关的特征（结合 sklearn.feature_selection.SelectFromModel）:

https://scikit-learn.org/stable/modules/classes.html?highlight=sklearn%20tree#module-sklearn.tree
https://scikit-learn.org/stable/modules/classes.html?highlight=klearn%20ensemble#module-sklearn.ensemble



In [47]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
print(X.shape)
clf = ExtraTreesClassifier().fit(X, y)
print(clf.feature_importances_)  
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print(X_new.shape)               

(150, 4)
[0.09494829 0.05546731 0.39589694 0.45368746]
(150, 2)


### 3.3.4 将特征选择过程融入pipeline (Feature selection as part of a pipeline)
特征选择常常被当作学习之前的一项预处理。在scikit-learn中推荐使用sklearn.pipeline.Pipeline
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline  

In [51]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)



Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classification', RandomForestClassifier())])

## 3.4 回顾

<center> Table. 2 sklearn特征选择方法 </center>

|类|所属方式|说明|
|--|-------|----|
|VarianceThreshold|	Filter|	方差选择法(移除低方差的特征)|
|SelectKBest|	Filter|	可选关联系数、卡方校验、最大信息系数作为得分计算的方法|
|RFE|	Wrapper|	递归地训练基模型，将权值系数较小的特征从特征集合中消除|
|SelectFromModel|	Embedded|	训练基模型，选择权值系数较高的特征|

# 4 降维
当特征选择完成后，可以直接训练模型了，但是可能由于特征矩阵过大，导致计算量大，训练时间长的问题，因此降低特征矩阵维度也是必不可少的。常见的降维方法除了以上提到的基于L1惩罚项的模型以外，另外还有主成分分析法（PCA）和线性判别分析（LDA），线性判别分析本身也是一个分类模型。PCA和LDA有很多的相似点，其本质是要将原始的样本映射到维度更低的样本空间中，但是PCA和LDA的映射目标不一样：PCA是为了让映射后的样本具有最大的发散性；而LDA是为了让映射后的样本有最好的分类性能。所以说PCA是一种无监督的降维方法，而LDA是一种有监督的降维方法。

---

class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
* Parameters
    * n_components: int, float, None or str
        * Number of components to keep. if n_components is not set all components are kept:
        * n_components == min(n_samples, n_features)
        * If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.
        * If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
        * If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.

        * Hence, the None case results in: n_components == min(n_samples, n_features) - 1
    * copy: bool, default=True
        * If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
    * whiten: bool, optional (default False)
        * When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
        * Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
    * svd_solver: str {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
        * If auto : The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
        * If full : run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
        * If arpack : run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
        * If randomized : run randomized SVD by the method of Halko et al.
    * tol: float >= 0, optional (default .0)
        * Tolerance for singular values computed by svd_solver == ‘arpack’.
    * iterated_power: int >= 0, or ‘auto’, (default ‘auto’)
        * Number of iterations for the power method computed by svd_solver == ‘randomized’.
    * random_state: int, RandomState instance, default=None
        * Used when svd_solver == ‘arpack’ or ‘randomized’. Pass an int for reproducible results across multiple function calls. See Glossary.

* Attributes
    * components_: array, shape (n_components, n_features)
        * Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
    * explained_variance_: array, shape (n_components,)
        * The amount of variance explained by each of the selected components.
        * Equal to n_components largest eigenvalues of the covariance matrix of X.
    * explained_variance_ratio_: array, shape (n_components,)
        * Percentage of variance explained by each of the selected components.
        * If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

    * singular_values_: array, shape (n_components,)
        * The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.
    * mean_: array, shape (n_features,)
        * Per-feature empirical mean, estimated from the training set.
        * Equal to X.mean(axis=0).
    * n_components_: int
        * The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
    * n_features_: int
        * Number of features in the training data.
    * n_samples_: int
        * Number of samples in the training data.
    * noise_variance_float
        * The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples.

        * Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.



In [71]:
from sklearn.decomposition import PCA
 
#主成分分析法，返回降维后的数据
#参数n_components为主成分数目
pca=PCA(n_components=2).fit(X)
pca.components_
pca.explained_variance_ratio_

array([0.92461872, 0.05306648])

In [75]:
# 方法一： 挑选保留95%方差的n_components
pca=PCA().fit(X)
cumsum=np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>=0.95)+1
print(d)
print(pca.explained_variance_ratio_)

2
[0.92461872 0.05306648 0.01710261 0.00521218]


In [76]:
#方法二： 挑选保留95%方差的n_components
pca=PCA(n_components=0.95).fit(X,y)
print(pca.n_components_)
print(pca.explained_variance_ratio_)

2
[0.92461872 0.05306648]


In [74]:
import sklearn.decomposition
#参数n_components为降维后的维数
sklearn.decomposition.LatentDirichletAllocation(n_components=2).fit_transform(X)

array([[0.86982412, 0.13017588],
       [0.84319344, 0.15680656],
       [0.85812854, 0.14187146],
       [0.82949911, 0.17050089],
       [0.87164823, 0.12835177],
       [0.81257228, 0.18742772],
       [0.82701706, 0.17298294],
       [0.85463678, 0.14536322],
       [0.82430237, 0.17569763],
       [0.86728628, 0.13271372],
       [0.87474477, 0.12525523],
       [0.83958899, 0.16041101],
       [0.86985932, 0.13014068],
       [0.88268152, 0.11731848],
       [0.90626008, 0.09373992],
       [0.85841079, 0.14158921],
       [0.85046775, 0.14953225],
       [0.84359055, 0.15640945],
       [0.84102201, 0.15897799],
       [0.84753953, 0.15246047],
       [0.84330927, 0.15669073],
       [0.81624926, 0.18375074],
       [0.89117484, 0.10882516],
       [0.74277017, 0.25722983],
       [0.80340545, 0.19659455],
       [0.82253252, 0.17746748],
       [0.78591754, 0.21408246],
       [0.86319719, 0.13680281],
       [0.8678587 , 0.1321413 ],
       [0.82604362, 0.17395638],
       [0.