好这里我们给大家介绍一些额外的回归方法，主要用到sklearn。

这里给大家看一个预测房价的例子，就不画图了，主要通过rsquared来看性能

我们把线性模型作为baseline，然后我们来看SVM，KNeighbors和RandomForest

导入数据：

In [2]:
import pandas as pd

data = pd.read_csv('kc_house_data.csv').iloc[:, 2:]
data.dropna(inplace=True)
data.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
y = data['price'].values
X = data.iloc[:, 1:].values

models = dict()
models['Linear'] = LinearRegression()
models['SVR'] = SVR()
models['KNR'] = KNeighborsRegressor()
models['RFR'] = RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(len(y))
print(len(y_train))
print(len(y_test))

21613
16209
5404


上面的train_test_split是用来分训练集和测试集的，
因为不涉及模型参数的迭代所以没有验证集。

In [4]:

predictions = dict()
scores = dict()
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)
    scores[name] = model.score(X_test, y_test)
    print('The R-squared of model {} is: {}'.format(name, scores[name]))


The R-squared of model Linear is: 0.6827481394862662
The R-squared of model SVR is: -0.05060479823213537
The R-squared of model KNR is: 0.48813878550583556
The R-squared of model RFR is: 0.5250125278505533


可以看到除了RandomForest之外其他的模型SVM和KNeighbors都表现得稀烂，
尤其是SVM，Rsquared竟然是负的。
可能是因为没有归一化，
归一化就是把你的数据映射到某一个范围内，一般是(0, 1)或者(-1, 1)，
我们下面要用的StandardScaler就是一种归一化方法，他是把数据缩放到均值为0，方差为1，
这样缩放的好处是他不改变原有的数据分布

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scale, X_test_scale = scaler.transform(X_train), scaler.transform(X_test)
for name, model in models.items():
    model.fit(X_train_scale, y_train)
    predictions['scaled_X{}'.format(name)] = model.predict(X_test_scale)
    scores['scaled_X{}'.format(name)] = model.score(X_test_scale, y_test)
    print('The R-squared of model scaled X {} is: {}'.format(name, scores['scaled_X{}'.format(name)]))

The R-squared of model scaled X Linear is: 0.6961039793320598
The R-squared of model scaled X SVR is: -0.05188688166663136
The R-squared of model scaled X KNR is: 0.8002224114292416
The R-squared of model scaled X RFR is: 0.8817434598335361


可以看到KNeighbors已经上来了，但SVM还是不行。
我们再归一化一下y：

In [5]:
y_scaler = StandardScaler()
y_scaler.fit(y_train.reshape(-1, 1))
y_train_scale, y_test_scale = y_scaler.transform(y_train.reshape(-1, 1)).ravel(), y_scaler.transform(y_test.reshape(-1, 1)).ravel()
for name, model in models.items():
    model.fit(X_train_scale, y_train_scale)
    predictions['scaled_y{}'.format(name)] = model.predict(X_test_scale)
    scores['scaled_y{}'.format(name)] = model.score(X_test_scale, y_test_scale)
    print('The R-squared of model scaled y {} is: {}'.format(name, scores['scaled_y{}'.format(name)]))


The R-squared of model scaled y Linear is: 0.696118445091847
The R-squared of model scaled y SVR is: 0.750656768863314
The R-squared of model scaled y KNR is: 0.8002224114292416
The R-squared of model scaled y RFR is: 0.8864932591033934


## 作业1

用我们讲的房价的数据，到sklearn上找若干我们没讲过
的回归模型，自己玩一下，用train test split分割训练集
和验证集，给出完整的拟合、预测代码，并给出在验证集上的
r-squared.

可以参见 https://scikit-learn.org/stable/supervised_learning.html