### 什么是特征交叉

特征交叉一种合成特征的方法，可以在多维特征数据集上，进行很好的非线性特征拟合。假设一个数据集有特征x1和x2，那么引入交叉特征值x3，使得：
```
x3=x1*x2
```
那么最终的表达式为：
```
y=b+w1x1+w2x2+w3x3
```

特征交叉本质上是一个笛卡尔积，两个特征列进行笛卡尔积。笛卡尔积中，如果同时满足两者的条件，则结果为1；否则为0，因此这种方式更加适合离散型的数据特征。一般来说，先把数据进行分档处理，再把分档的结果进行特征交叉，此时可以获得更好的数据特征，分档处理可以对数据降维，从而极大地简化计算量。
FM(Factorization Machine) 因子分解机

In [8]:
#f(x)=w1x1+w2x2+w3x3+...+wnxn
#victor(a)=(1,x1,x2,x3,xn) 简化为(x1,x2)
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(2,5).reshape(1, 3)
X

array([[2, 3, 4]])

In [9]:
#victor(a)=(1,x1,x2,x1x2,pow(x1,2),pow(x2,2)
poly = PolynomialFeatures(degree = 2)
poly.fit_transform(X)

array([[ 1.,  2.,  3.,  4.,  4.,  6.,  8.,  9., 12., 16.]])

In [10]:
poly.get_feature_names()

['1', 'x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']

In [11]:
# 设置参数interaction_only = True，不包含单个自变量pow(x1,2),pow(x2,2) x属于N特征数据
poly = PolynomialFeatures(degree = 2, interaction_only = True)
poly.fit_transform(X)

array([[ 1.,  2.,  3.,  4.,  6.,  8., 12.]])

In [12]:
# 再添加 设置参数include_bias= False，不包含偏差项（即为1的项）数据
#If True (default), then include a bias column, the feature in which
#        all polynomial powers are zero (i.e. a column of ones - acts as an
#        intercept term in a linear model
poly = PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False)
poly.fit_transform(X)


array([[ 2.,  3.,  4.,  6.,  8., 12.]])

In [13]:
filename = 'dataset/csv/boston_house_prices.csv'
names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PRTATIO','B','LSTAT','MEDV']
dataset = pd.read_csv(get_workspace_path("source/sparrow/feature")+filename,names= names,header=2)
dataset.describe()

NameError: name 'get_workspace_path' is not defined

In [14]:
validation_size = 0.2
seed = 7
array = dataset.to_numpy()
X = array[:, 0:13]
Y = array[:, 13]
X,Y

NameError: name 'dataset' is not defined

In [15]:
polyFeatureHouse=PolynomialFeatures(include_bias=False)
X2=polyFeatureHouse.fit_transform(X)
polyFeatureHouse.get_feature_names()

['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']

In [None]:
X_train, X_test,X2_train, X2_test, Y_train, Y_test = train_test_split(X, X2, Y, test_size=validation_size, random_state= seed)

linear_regressor=linear_model.LinearRegression()
linear_regressor.fit(X_train,Y_train)
linear_regressor.score(X_test,Y_test)

In [53]:
linear_regressor2=linear_model.LinearRegression()
linear_regressor2.fit(X2_train,Y_train)
linear_regressor2.score(X2_test,Y_test)




0.8654012079831455