### 特征工程

为了更好地理解常见的特征工程技术以及如何在 python 中实现它们，我们将使用一个小数据集。 首先，我们创建衣柜数据集，并导入必要的库。

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm;
import sklearn.preprocessing as p

df = pd.DataFrame({'response': [2.4, 3.3, -4.2, 5.6, 1.5, 8.7], 
                         'x1': ['yes','no','yes','maybe','no','yes'],
                         'x2': [-1,-3,np.nan, 0, np.nan, 1],
                         'x3': [2.4, 15, 3.3, 2.4, 1.8, 0.4],
                         'x4': [np.nan, np.nan, 1, 1, 1, 1],
                         'x5': ['A', 'B', np.nan, 'A', 'A', 'A']})
df

  from pandas.core import datetools


Unnamed: 0,response,x1,x2,x3,x4,x5
0,2.4,yes,-1.0,2.4,,A
1,3.3,no,-3.0,15.0,,B
2,-4.2,yes,,3.3,1.0,
3,5.6,maybe,0.0,2.4,1.0,A
4,1.5,no,,1.8,1.0,A
5,8.7,yes,1.0,0.4,1.0,A


`1.` 在响应与数据集中的三个 x 变量之间拟合一个线性模型。另外，添加一个截距。使用你的结果回答下面的第一个测试题目。

`2.` 使用 [这里](http://scikit-learn.org/stable/modules/preprocessing.html) 的 sklearn 文件和之前的视频来帮助填写列均值的每个定量列的缺失值。现在，使用新列重新拟合问题 `1.`中的线性模型，并使用结果来回答下面的第二个测试题目。

`3.` 另一种常用的缩放特征的方法是减去均值并除以标准偏差。有特定的一些机器学习算法，在这些算法中，你应该经常考虑这种类型的缩放（或其他规范化的方法），就像在 [这里](https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning) 讨论的那样。使用 [sklearn 文件 ](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) 和之前的视频来帮助你对数据集中的三个新的定量列进行缩放。

为了确保你正确地执行了这些算法转换，请回答下面的第三个测试题目。

In [2]:
df['intercept'] = 1
lm = sm.OLS(df['response'], df[['intercept','x2', 'x3','x4']])

results = lm.fit()
results.summary()

LinAlgError: SVD did not converge

In [3]:
#2
imp = p.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(df[['x2', 'x3', 'x4']])
df[['x2', 'x3', 'x4']] = imp.transform(df[['x2', 'x3', 'x4']])

In [4]:
lm = sm.OLS(df['response'], df[['intercept','x2','x3','x4']])
results = lm.fit()
results.summary()



0,1,2,3
Dep. Variable:,response,R-squared:,0.489
Model:,OLS,Adj. R-squared:,0.149
Method:,Least Squares,F-statistic:,1.438
Date:,"Tue, 22 Jan 2019",Prob (F-statistic):,0.365
Time:,11:14:32,Log-Likelihood:,-14.742
No. Observations:,6,AIC:,35.48
Df Residuals:,3,BIC:,34.86
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.1507,1.111,1.036,0.376,-2.384,4.685
x2,5.1202,3.053,1.677,0.192,-4.594,14.835
x3,1.0487,0.752,1.394,0.258,-1.345,3.442
x4,1.1507,1.111,1.036,0.376,-2.384,4.685

0,1,2,3
Omnibus:,,Durbin-Watson:,2.043
Prob(Omnibus):,,Jarque-Bera (JB):,2.515
Skew:,-1.531,Prob(JB):,0.284
Kurtosis:,3.828,Cond. No.,4.21e+16


In [5]:
#3
norm = p.StandardScaler()
norm.fit(df[['x2','x3','x4']])
norm.transform(df[['x2','x3','x4']])

array([[-0.20701967, -0.3706604 ,  0.        ],
       [-1.86317701,  2.20015852,  0.        ],
       [ 0.        , -0.18703048,  0.        ],
       [ 0.621059  , -0.3706604 ,  0.        ],
       [ 0.        , -0.49308035,  0.        ],
       [ 1.44913767, -0.7787269 ,  0.        ]])