# 特征工程

## 数据预处理

### 归一化
按最小值中心化后，在按照极差缩放

$$x^* = \frac{x - min(x)}{max(x) - min(x)}$$

In [1]:
from sklearn.preprocessing import MinMaxScaler

In [2]:
data = [[-1,2],[-0.5,6],[0,10],[1,18]]

In [3]:
scaler = MinMaxScaler()
scaler = scaler.fit(data)
result = scaler.transform(data)
result

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

简便写法

In [4]:
result_1 = scaler.fit_transform(data)
result_1

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

反向操作，返回未处理的数据

In [5]:
scaler.inverse_transform(result)

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

自定义范围

In [6]:
scaler = MinMaxScaler(feature_range=(5,10))
result_2 = scaler.fit_transform(data)
result_2

array([[ 5.  ,  5.  ],
       [ 6.25,  6.25],
       [ 7.5 ,  7.5 ],
       [10.  , 10.  ]])

当数据量太大时 可以用 scaler.partial_fit

**归一化对异常值很敏感**

### 标准化
当数据按照 $\mu$ 做中心化后，在按照 $\sigma$ 做缩放，数据就会服从均值为 0，方差为 1 的正态分布

$$x^* = \frac{x-\mu}{\sigma}$$

In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
result_3 = scaler.fit_transform(data)
result_3

array([[-1.18321596, -1.18321596],
       [-0.50709255, -0.50709255],
       [ 0.16903085,  0.16903085],
       [ 1.52127766,  1.52127766]])

In [9]:
scaler.mean_

array([-0.125,  9.   ])

In [10]:
scaler.var_

array([ 0.546875, 35.      ])

In [11]:
result_3.mean()

0.0

In [12]:
result_3.var()

1.0

### 缺失值处理

In [13]:
import pandas as pd
from sklearn import impute

In [14]:
data = pd.read_csv('data/Narrativedata.csv',index_col=0)

In [15]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,No
1,38.0,female,C,Yes
2,26.0,female,S,Yes
3,35.0,female,S,Yes
4,35.0,male,S,No


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       714 non-null    float64
 1   Sex       891 non-null    object 
 2   Embarked  889 non-null    object 
 3   Survived  891 non-null    object 
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


提取年龄这一列来操作

In [17]:
age = data.loc[:,'Age']

此时，age 是 Series 类型

In [18]:
type(age)

pandas.core.series.Series

应取出他的值

In [19]:
age = age.values

In [20]:
type(age)

numpy.ndarray

还应该升维

In [21]:
print(age.shape)
age = age.reshape(-1,1)
print(age.shape)

(891,)
(891, 1)


实例化均值、中位数和 0 三种缺失值填充方法

In [22]:
imp_mean = impute.SimpleImputer()
imp_mid = impute.SimpleImputer(strategy='median')
imp_zero = impute.SimpleImputer(strategy='constant', fill_value=0)

In [23]:
imp_mean = imp_mean.fit_transform(age)
imp_mid = imp_mid.fit_transform(age)
imp_zero = imp_zero.fit_transform(age)

In [24]:
imp_mean[:20,:]

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [29.69911765],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ],
       [ 4.        ],
       [58.        ],
       [20.        ],
       [39.        ],
       [14.        ],
       [55.        ],
       [ 2.        ],
       [29.69911765],
       [31.        ],
       [29.69911765]])

In [25]:
imp_mid[:20,:]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.],
       [28.],
       [54.],
       [ 2.],
       [27.],
       [14.],
       [ 4.],
       [58.],
       [20.],
       [39.],
       [14.],
       [55.],
       [ 2.],
       [28.],
       [31.],
       [28.]])

In [26]:
imp_zero[:20,:]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.],
       [ 0.],
       [54.],
       [ 2.],
       [27.],
       [14.],
       [ 4.],
       [58.],
       [20.],
       [39.],
       [14.],
       [55.],
       [ 2.],
       [ 0.],
       [31.],
       [ 0.]])

这里我们用中位数进行填补，因为平均值会出现小数

In [27]:
data.loc[:,'Age'] = imp_mid

In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       891 non-null    float64
 1   Sex       891 non-null    object 
 2   Embarked  889 non-null    object 
 3   Survived  891 non-null    object 
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


可以看到上面年龄已经填补完成了

接下来处理 Embarked 这个属性，因为他是字符型，因此采用众数进行填补

In [29]:
imp_mode = impute.SimpleImputer(strategy='most_frequent')
Embarked = data.loc[:,'Embarked'].values.reshape(-1,1)
data.loc[:,'Embarked'] = imp_mode.fit_transform(Embarked)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       891 non-null    float64
 1   Sex       891 non-null    object 
 2   Embarked  891 non-null    object 
 3   Survived  891 non-null    object 
dtypes: float64(1), object(3)
memory usage: 34.8+ KB


### 用 pandas 也可以

In [30]:
data_ = pd.read_csv('data/Narrativedata.csv',index_col=0)
data_.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,No
1,38.0,female,C,Yes
2,26.0,female,S,Yes
3,35.0,female,S,Yes
4,35.0,male,S,No


用中位数填补 Age 属性的缺失值

In [31]:
data_.loc[:,'Age'] = data_.loc[:,'Age'].fillna(data_.loc[:,'Age'].median())
data_.loc[:,'Age'].head(20)

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
5     28.0
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
11    58.0
12    20.0
13    39.0
14    14.0
15    55.0
16     2.0
17    28.0
18    31.0
19    28.0
Name: Age, dtype: float64

删除 Embarked 缺失的两条记录

In [32]:
data_.dropna(axis=0,inplace=True)
data_.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       889 non-null    float64
 1   Sex       889 non-null    object 
 2   Embarked  889 non-null    object 
 3   Survived  889 non-null    object 
dtypes: float64(1), object(3)
memory usage: 34.7+ KB


### 编码与哑变量 针对离散型变量

#### 标签编码

In [33]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [34]:
y = data.iloc[:,-1]
np.unique(y)

array(['No', 'Unknown', 'Yes'], dtype=object)

In [35]:
le = LabelEncoder()
le.fit(y)
y_1 = le.transform(y)
y_1

array([0, 2, 2, 2, 0, 0, 0, 0, 2, 2, 1, 2, 0, 0, 0, 1, 0, 2, 0, 2, 1, 2,
       2, 2, 0, 1, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1,
       2, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 2, 0, 0, 0, 2,
       2, 0, 2, 0, 0, 0, 0, 0, 2, 1, 0, 1, 2, 2, 0, 2, 2, 0, 2, 2, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 0, 1, 0,
       0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 2, 0, 0, 0, 2, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 1, 0, 2, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 2, 0, 0,
       1, 0, 0, 0, 0, 0, 2, 0, 0, 2, 2, 2, 1, 2, 1, 0, 0, 2, 2, 0, 2, 0,
       2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2,
       0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 2, 0, 1,
       0, 0, 1, 2, 2, 1, 0, 2, 2, 0, 2, 2, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0,
       2, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 2, 2,

In [36]:
le.classes_

array(['No', 'Unknown', 'Yes'], dtype=object)

In [37]:
le.inverse_transform(y_1)

array(['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes',
       'Unknown', 'Yes', 'No', 'No', 'No', 'Unknown', 'No', 'Yes', 'No',
       'Yes', 'Unknown', 'Yes', 'Yes', 'Yes', 'No', 'Unknown', 'No', 'No',
       'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No',
       'No', 'Yes', 'No', 'No', 'No', 'Unknown', 'Yes', 'No', 'No', 'Yes',
       'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No',
       'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'No',
       'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Unknown', 'No',
       'Unknown', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes',
       'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No',
       'Unknown', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No',
       'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No',
       'Yes', 'Yes', 'No', 'Unknown', 'No', 'No', 'Yes', 'No', 'N

也可以一行搞定

In [38]:
y_2 = LabelEncoder().fit_transform(y)
y_2

array([0, 2, 2, 2, 0, 0, 0, 0, 2, 2, 1, 2, 0, 0, 0, 1, 0, 2, 0, 2, 1, 2,
       2, 2, 0, 1, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1,
       2, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 2, 0, 0, 0, 2,
       2, 0, 2, 0, 0, 0, 0, 0, 2, 1, 0, 1, 2, 2, 0, 2, 2, 0, 2, 2, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 0, 1, 0,
       0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 2, 0, 0, 0, 2, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 1, 0, 2, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 2, 0, 0,
       1, 0, 0, 0, 0, 0, 2, 0, 0, 2, 2, 2, 1, 2, 1, 0, 0, 2, 2, 0, 2, 0,
       2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2,
       0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 1, 0, 2, 2, 2, 2, 2, 0, 2, 0, 1,
       0, 0, 1, 2, 2, 1, 0, 2, 2, 0, 2, 2, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0,
       2, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 2, 2,

#### 特征编码

In [39]:
from sklearn.preprocessing import OrdinalEncoder

In [40]:
data_ = data.copy()

In [41]:
OrdinalEncoder().fit(data_.iloc[:,1:-1]).categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

In [42]:
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])

In [43]:
data_.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,1.0,2.0,No
1,38.0,0.0,0.0,Yes
2,26.0,0.0,2.0,Yes
3,35.0,0.0,2.0,Yes
4,35.0,1.0,2.0,No


但是，这种编码于所有变量都适合吗？有如下变量种类：
- 名义变量 互相独立，无大小关系，无法计算 (泰坦尼克数据的舱门)
- 有序变量 有大小关系，不可计算 (学历)
- 有距变量 有大小关系，可计算 (身高体重)

#### onehot 编码

In [44]:
from sklearn.preprocessing import OneHotEncoder

In [45]:
data_ = data.copy()
data_.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,No
1,38.0,female,C,Yes
2,26.0,female,S,Yes
3,35.0,female,S,Yes
4,35.0,male,S,No


In [46]:
x = data_.iloc[:,1:-1]
enc = OneHotEncoder(categories='auto').fit(x)
result = enc.transform(x).toarray()
result

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       ...,
       [1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0.]])

同样是可以还原的

In [47]:
pd.DataFrame(enc.inverse_transform(result))

Unnamed: 0,0,1
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
886,male,S
887,female,S
888,female,S
889,male,C


特征对应情况

In [48]:
enc.get_feature_names()

array(['x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S'], dtype=object)

将 onehot 编码合并到数据中

In [49]:
newdata = pd.concat([data, pd.DataFrame(result)], axis=1)
newdata.head()

Unnamed: 0,Age,Sex,Embarked,Survived,0,1,2,3,4
0,22.0,male,S,No,0.0,1.0,0.0,0.0,1.0
1,38.0,female,C,Yes,1.0,0.0,1.0,0.0,0.0
2,26.0,female,S,Yes,1.0,0.0,0.0,0.0,1.0
3,35.0,female,S,Yes,1.0,0.0,0.0,0.0,1.0
4,35.0,male,S,No,0.0,1.0,0.0,0.0,1.0


In [50]:
newdata.drop(['Sex', 'Embarked'], axis=1, inplace=True)
newdata.head()

Unnamed: 0,Age,Survived,0,1,2,3,4
0,22.0,No,0.0,1.0,0.0,0.0,1.0
1,38.0,Yes,1.0,0.0,1.0,0.0,0.0
2,26.0,Yes,1.0,0.0,0.0,0.0,1.0
3,35.0,Yes,1.0,0.0,0.0,0.0,1.0
4,35.0,No,0.0,1.0,0.0,0.0,1.0


In [51]:
newdata.columns = ['Age', 'Survived', 'x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S']
newdata.head()

Unnamed: 0,Age,Survived,x0_female,x0_male,x1_C,x1_Q,x1_S
0,22.0,No,0.0,1.0,0.0,0.0,1.0
1,38.0,Yes,1.0,0.0,1.0,0.0,0.0
2,26.0,Yes,1.0,0.0,0.0,0.0,1.0
3,35.0,Yes,1.0,0.0,0.0,0.0,1.0
4,35.0,No,0.0,1.0,0.0,0.0,1.0


### 处理连续型特征 二值化和分段

#### 二值化

In [52]:
from sklearn.preprocessing import Binarizer

In [53]:
data_ = data.copy()

不可以输入一维数组，所以要升维

In [54]:
x = data_.iloc[:,0].values.reshape(-1,1)

In [55]:
x[:5,:]

array([[22.],
       [38.],
       [26.],
       [35.],
       [35.]])

In [56]:
x_ = Binarizer(threshold=30).fit_transform(x)
x_[:10,:]

array([[0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.]])

#### 分箱
参数：
- n_bins 分箱个数，默认会把所有传入的特征值分箱
- encode 编码方式，默认 onehot，可填入 onehot 和 ordinal(将特征变量编为整数)
- strategy 分箱方式
 * uniform 等宽分箱，不考虑分布
 * quantile 等位分箱，每个箱中数据量相同(默认)
 * kmeans 按聚类分箱，每个值到最近的一维 k 均值簇心的值都相同

In [57]:
from sklearn.preprocessing import KBinsDiscretizer

In [58]:
x = data_.iloc[:,0].values.reshape(-1,1)

In [59]:
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
x = est.fit_transform(x)
x[:5,:]

array([[0.],
       [1.],
       [0.],
       [1.],
       [1.]])

查看共有多少个箱

In [60]:
set(np.squeeze(x))

{0.0, 1.0, 2.0}

## 特征选择
过滤、嵌入、包装

In [61]:
digit = pd.read_csv('data/digit.csv')
digit_ = digit.copy()
digit.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
x = digit_.iloc[:,1:]
y = digit_.iloc[:,0]
x.shape

(100, 783)

### 过滤法
全部特征 -> 最佳特征子集 -> 算法 -> 模型评估

#### 方差过滤
消除方差很小的特征

In [63]:
from sklearn.feature_selection import VarianceThreshold

In [64]:
selector = VarianceThreshold(threshold=0)
x_var0 = selector.fit_transform(x)
x_var0.shape

(100, 510)

可以看到已经删掉了许多方差为零的特征

现在我们取一半，也就是说删掉方差后一半的特征

In [65]:
x_mid = VarianceThreshold(threshold=np.median(x.var().values)).fit_transform(x)

In [66]:
x_mid.shape

(100, 391)

当特征是二分类变量时，服从伯努利分布，其方差为：

$$Var\left[X\right] = p(1-p)$$

下面代表着当某个二分类特征的其中一个值超过80%，即删掉

In [67]:
x_bvar = VarianceThreshold(threshold=(.8*(1 - .8))).fit_transform(x)
x_bvar.shape

(100, 510)

#### 相关性过滤
- 卡方过滤
- F 检验
- 互信息法

卡方过滤：只能计算非负的离散型标签

本质是推测两组数据之间的差异，其原假设是两组数据互相独立，卡方检验返回卡方值和 P值两个统计量

当 p 值 <= 0.05 或者 0.01，就说明两组数据是相关的

In [68]:
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

SelectKBest 是用来选择前 k 个特征的方法

In [69]:
X_fschi = SelectKBest(chi2, k=300).fit_transform(x, y) 
X_fschi.shape

(100, 300)

In [70]:
X_fschi[0]

array([253, 245, 212, 222, 253, 253, 253, 253, 253, 253, 253, 253, 253,
       253, 160,  15,   0,   0,   0,   0,   0,   0,   0,   0, 254, 253,
       253, 253, 189,  99,   0,  32, 202, 253, 253, 253, 240, 122, 122,
       190, 253, 253, 253, 174,   0,   0,   0,   0,   0,   0,   0,   0,
       255, 253, 253, 253, 238, 222, 222, 222, 241, 253, 253, 230,  70,
         0,   0,  17, 175, 229, 253, 253,   0,   0,   0,   0,   0,   0,
         0,   0, 158, 253, 253, 253, 253, 253, 253, 253, 253, 205, 106,
        65,   0,   0,   0,   0,   0,  62, 244, 157,   0,   0,   0,   0,
         0,   0,   0,   0,   6,  26, 179, 179, 179, 179, 179,  30,  15,
        10,   0,   0,   0,   0,   0,   0,   0,   0,  14,   6,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

#### F检验
可用于回归或者分类，一般先将数据转换成正态分布再做 F检验

F检验的本质是寻找两组数据之间的线性关系，其原假设是数据不存在显著的线性关系。

它返回F值和p值两个统计量。和卡方过滤一样，我们希望选取p值小于0.05或0.01的特征，这些特征与标签时显著线性相关的

In [71]:
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import f_regression

In [72]:
F,p = f_classif(x_mid, y)

  msb = ssbn / float(dfbn)


In [73]:
x_mid.shape

(100, 391)

#### 互信息法
捕捉每个特征与标签之间的任意关系(包括线性和非线性关系)

返回“每个特征与目标之间的互信息量的估计”，这个估计量在[0,1]之间 取值，为0则表示两个变量独立，为1则表示两个变量完全相关

In [74]:
from sklearn.feature_selection import mutual_info_classif

In [75]:
result = mutual_info_classif(x_mid, y)

In [76]:
result

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### 嵌入法 Embedded
嵌入法是一种让算法自己决定使用哪些特征的方法，即特征选择和算法训练同时进行