### 什么是特征交叉

特征交叉一种合成特征的方法，可以在多维特征数据集上，进行很好的非线性特征拟合。假设一个数据集有特征x1和x2，那么引入交叉特征值x3，使得：
```
x3=x1*x2
```
那么最终的表达式为：
```
y=b+w1x1+w2x2+w3x3
```

特征交叉本质上是一个笛卡尔积，两个特征列进行笛卡尔积。笛卡尔积中，如果同时满足两者的条件，则结果为1；否则为0，因此这种方式更加适合离散型的数据特征。一般来说，先把数据进行分档处理，再把分档的结果进行特征交叉，此时可以获得更好的数据特征，分档处理可以对数据降维，从而极大地简化计算量。
FM(Factorization Machine) 因子分解机

In [43]:
#f(x)=w1x1+w2x2+w3x3+...+wnxn
#victor(a)=(1,x1,x2,x3,xn) 简化为(x1,x2)
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sparrow.tools.path import get_workspace_path
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(2,5).reshape(1, 3)
X

array([[2, 3, 4]])

In [44]:
#victor(a)=(1,x1,x2,x1x2,pow(x1,2),pow(x2,2)
poly = PolynomialFeatures(degree = 2)
poly.fit_transform(X)

array([[ 1.,  2.,  3.,  4.,  4.,  6.,  8.,  9., 12., 16.]])

In [45]:
poly.get_feature_names()

['1', 'x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']

In [46]:
# 设置参数interaction_only = True，不包含单个自变量pow(x1,2),pow(x2,2) x属于N特征数据
poly = PolynomialFeatures(degree = 2, interaction_only = True)
poly.fit_transform(X)

array([[ 1.,  2.,  3.,  4.,  6.,  8., 12.]])

In [47]:
# 再添加 设置参数include_bias= False，不包含偏差项（即为1的项）数据
#If True (default), then include a bias column, the feature in which
#        all polynomial powers are zero (i.e. a column of ones - acts as an
#        intercept term in a linear model
poly = PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False)
poly.fit_transform(X)


array([[ 2.,  3.,  4.,  6.,  8., 12.]])

In [48]:
filename = 'dataset/csv/boston_house_prices.csv'
names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PRTATIO','B','LSTAT','MEDV']
dataset = pd.read_csv(get_workspace_path("source/sparrow/feature")+filename,names= names,header=2)
dataset.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PRTATIO,B,LSTAT,MEDV
count,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0,505.0
mean,3.620667,11.350495,11.154257,0.069307,0.554728,6.284059,68.581584,3.794459,9.566337,408.459406,18.461782,356.594376,12.668257,22.529901
std,8.608572,23.343704,6.855868,0.254227,0.11599,0.703195,28.176371,2.107757,8.707553,168.629992,2.16252,91.367787,7.13995,9.205991
min,0.00906,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.08221,0.0,5.19,0.0,0.449,5.885,45.0,2.1,4.0,279.0,17.4,375.33,7.01,17.0
50%,0.25915,0.0,9.69,0.0,0.538,6.208,77.7,3.1992,5.0,330.0,19.1,391.43,11.38,21.2
75%,3.67822,12.5,18.1,0.0,0.624,6.625,94.1,5.2119,24.0,666.0,20.2,396.21,16.96,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [49]:
validation_size = 0.2
seed = 7
array = dataset.to_numpy()
X = array[:, 0:13]
Y = array[:, 13]
X,Y

(array([[2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        [3.2370e-02, 0.0000e+00, 2.1800e+00, ..., 1.8700e+01, 3.9463e+02,
         2.9400e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 array([21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. , 18.9,
        21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6, 15.2,
        14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2, 13.1,
        13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7, 21.2,
        19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9, 35.4,
        24.7, 31.6, 23.3, 19.6, 18.7, 1

In [56]:
polyFeatureHouse=PolynomialFeatures(include_bias=False)
X2=polyFeatureHouse.fit_transform(X)
polyFeatureHouse.get_feature_names()

['x0',
 'x1',
 'x2',
 'x3',
 'x4',
 'x5',
 'x6',
 'x7',
 'x8',
 'x9',
 'x10',
 'x11',
 'x12',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x0 x4',
 'x0 x5',
 'x0 x6',
 'x0 x7',
 'x0 x8',
 'x0 x9',
 'x0 x10',
 'x0 x11',
 'x0 x12',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x1 x4',
 'x1 x5',
 'x1 x6',
 'x1 x7',
 'x1 x8',
 'x1 x9',
 'x1 x10',
 'x1 x11',
 'x1 x12',
 'x2^2',
 'x2 x3',
 'x2 x4',
 'x2 x5',
 'x2 x6',
 'x2 x7',
 'x2 x8',
 'x2 x9',
 'x2 x10',
 'x2 x11',
 'x2 x12',
 'x3^2',
 'x3 x4',
 'x3 x5',
 'x3 x6',
 'x3 x7',
 'x3 x8',
 'x3 x9',
 'x3 x10',
 'x3 x11',
 'x3 x12',
 'x4^2',
 'x4 x5',
 'x4 x6',
 'x4 x7',
 'x4 x8',
 'x4 x9',
 'x4 x10',
 'x4 x11',
 'x4 x12',
 'x5^2',
 'x5 x6',
 'x5 x7',
 'x5 x8',
 'x5 x9',
 'x5 x10',
 'x5 x11',
 'x5 x12',
 'x6^2',
 'x6 x7',
 'x6 x8',
 'x6 x9',
 'x6 x10',
 'x6 x11',
 'x6 x12',
 'x7^2',
 'x7 x8',
 'x7 x9',
 'x7 x10',
 'x7 x11',
 'x7 x12',
 'x8^2',
 'x8 x9',
 'x8 x10',
 'x8 x11',
 'x8 x12',
 'x9^2',
 'x9 x10',
 'x9 x11',
 'x9 x12',
 'x10^2',
 'x10 x11',
 'x10 x12',
 '

In [None]:
X_train, X_test,X2_train, X2_test, Y_train, Y_test = train_test_split(X, X2, Y, test_size=validation_size, random_state= seed)

linear_regressor=linear_model.LinearRegression()
linear_regressor.fit(X_train,Y_train)
linear_regressor.score(X_test,Y_test)

In [53]:
linear_regressor2=linear_model.LinearRegression()
linear_regressor2.fit(X2_train,Y_train)
linear_regressor2.score(X2_test,Y_test)




0.8654012079831455