## 机器学习之模型选择，训练与调优(四)

上一篇文章对数据进行了预处理， 包括数据清洗，特征组合和类别数据编码，最后使用pipeline进行统一转换。  
接下来开始选择模型对数据进行训练

In [73]:
# 准备工作
#准备工作
# 基本包的导入
import numpy as np
import os

# 画图相关
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# 忽略警告
import warnings

# 图片存储目录
PROJECT_ROOT_DIR = '../'
CHAPTER_ID = 'end_to_end_project'
IMAGE_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension='png', resolution=300):
    path = os.path.join(IMAGE_PATH, fig_id + "." + fig_extension)
    print("保存图片:", fig_id)
    
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)


warnings.filterwarnings(action='ignore', module='scipy', message='internal')

# 加载数据
HOUSING_PATH = os.path.join("../datasets", "housing")
def loading_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = loading_housing_data()

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(housing, housing["income_cat"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]

for set in (strat_train_set, strat_test_set): 
    set.drop(["income_cat"], axis=1, inplace=True)

In [74]:
# 复制一份训练集，并drop预测结果后语后续评估。
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

In [75]:
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [76]:
# 上述函数，其输入是包含1个多个枚举类别的2D数组，需要reshape成为这种数组
# from sklearn.preprocessing import CategoricalEncoder  #后面会添加这个方法

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder


from scipy import sparse

# 后面再去理解
class CategoricalEncoder(BaseEstimator, TransformerMixin):
    """Encode categorical features as a numeric array.
    The input to this transformer should be a matrix of integers or strings,
    denoting the values taken on by categorical (discrete) features.
    The features can be encoded using a one-hot aka one-of-K scheme
    (``encoding='onehot'``, the default) or converted to ordinal integers
    (``encoding='ordinal'``).
    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.
    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
    Parameters
    ----------
    encoding : str, 'onehot', 'onehot-dense' or 'ordinal'
        The type of encoding to use (default is 'onehot'):
        - 'onehot': encode the features using a one-hot aka one-of-K scheme
          (or also called 'dummy' encoding). This creates a binary column for
          each category and returns a sparse matrix.
        - 'onehot-dense': the same as 'onehot' but returns a dense array
          instead of a sparse matrix.
        - 'ordinal': encode the features as ordinal integers. This results in
          a single column of integers (0 to n_categories - 1) per feature.
    categories : 'auto' or a list of lists/arrays of values.
        Categories (unique values) per feature:
        - 'auto' : Determine categories automatically from the training data.
        - list : ``categories[i]`` holds the categories expected in the ith
          column. The passed categories are sorted before encoding the data
          (used categories can be found in the ``categories_`` attribute).
    dtype : number type, default np.float64
        Desired dtype of output.
    handle_unknown : 'error' (default) or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform (default is to raise). When this is parameter
        is set to 'ignore' and an unknown category is encountered during
        transform, the resulting one-hot encoded columns for this feature
        will be all zeros.
        Ignoring unknown categories is not supported for
        ``encoding='ordinal'``.
    Attributes
    ----------
    categories_ : list of arrays
        The categories of each feature determined during fitting. When
        categories were specified manually, this holds the sorted categories
        (in order corresponding with output of `transform`).
    Examples
    --------
    Given a dataset with three features and two samples, we let the encoder
    find the maximum value per feature and transform the data to a binary
    one-hot encoding.
    >>> from sklearn.preprocessing import CategoricalEncoder
    >>> enc = CategoricalEncoder(handle_unknown='ignore')
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
    ... # doctest: +ELLIPSIS
    CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
              encoding='onehot', handle_unknown='ignore')
    >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
    See also
    --------
    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
      integer ordinal features. The ``OneHotEncoder assumes`` that input
      features take on values in the range ``[0, max(feature)]`` instead of
      using the unique values.
    sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
      dictionary items (also handles string-valued features).
    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
      encoding of dictionary items or strings.
    """

    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
            The data to determine the categories of each feature.
        Returns
        -------
        self
        """

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out



In [77]:
# 另外一个转换器：选择一个子集
from sklearn.base import BaseEstimator, TransformerMixin

# 如上，对于数据集需要进行大量的转换，并且要有一定的顺序。因此sklearn提供了pipeline来进行处理。
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler


# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
    
# 由于只能对数值类型进行处理，因此需要去除类别数据。
housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attr_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ("cat_encoder", CategoricalEncoder(encoding='onehot-dense'))
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

In [78]:
# 运用结合pipeline对数值类型，类别类型同时进行转换
housing_prepare = full_pipeline.fit_transform(housing)
housing_prepare.shape

(16512, 16)

以上代码就是前3篇文章的所有准备工作，得到housing_prepare就是需要用于算法的训练数据。

### 选择并训练模型

现在可以对训练集 选择一个模型进行训练和测试。

In [90]:
# 线性模型

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepare, housing_labels)

# 训练完成，查看部分数据的训练结果：
#利用管道对部分训练数据进行转换
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepare = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepare))  #预测数据
# 与实际的真实值比较
print("Labels:", list(some_labels))

# 对比以上预测，精确度不是很高。下面来计算一下此回归模型真实值与预测值之间的RMSE。

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepare)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse  

Predictions: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]


68628.19819848923

上文提到，该值表示预测值与真实值之间的误差还不满足预测。这是一个欠拟合问题。  
意味着训练集中的特征没有足够的信息来做预测；或者所选择模型不够强大。  
针对这个问题，主要的解决方法有 选择更强大的模型，或者加入更多的特征。最后可以考虑对模型加入正则项。


In [93]:
from sklearn.tree import DecisionTreeRegressor 
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepare, housing_labels)

housing_p = tree_reg.predict(housing_prepare)
tree_mse = mean_squared_error(housing_labels, housing_p)
tree_rmse = np.sqrt(tree_mse)
tree_rmse #没有误差，说明每个数据都预测正确，过拟合

0.0

In [87]:
housing_predictions

array([210644.60459286, 317768.80697211, 210956.43331178, ...,
        95464.57062437, 214353.22541713, 276426.4692067 ])

以上说明模型过拟合了，完全没有误差。

### 用交叉验证做更好的评估

一种用于评估决策树模型的方式是将训练集划分成更小的训练集和一个验证集。    
以下方法 使用K-fold 交叉验证，即将训练集随机划分为10个子集，将其中9个用作训练，一个用于验证。  
进行10次训练， 结果是包含10次训练的评估分数

In [96]:
from sklearn.model_selection import cross_val_score # 交叉验证
scores = cross_val_score(tree_reg, housing_prepare, housing_labels, 
                       scoring="neg_mean_squared_error", cv=10)

交叉验证期望得到的函数值越大越好，而不是像损失函数(代价函数)那样越小越好。  
因此得分与MSE相反(结果为负)，因此需要使用以下方法计算rmse分数。  

In [98]:
tree_rmse_scores = np.sqrt(- scores)
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviations:", scores.std())
display_scores(tree_rmse_scores)

Scores: [69226.17632204 67694.79874797 70836.47151251 69944.84306415
 71105.57199113 74536.65163847 69698.25274811 70014.21947239
 76485.28736039 71113.75905971]
Mean: 71065.60319168642
Standard deviations: 2458.750570067908


由于得到的均值比线性得到的分数还大，说明该模型还不如之前的线性模型。

由于交叉验证不仅得到模型的性能估计，而且还可以得到精确度估计( 71000 +- 2500)。  
如果只用一个验证集的话，是得不到这个信息的。但带来的时间损耗也是应该要考虑的。

In [99]:
# 对线性回归使用交叉验证
lin_scores = cross_val_score(lin_reg, housing_prepare, housing_labels, 
                            scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviations: 2731.6740017983466


对比结果，决策树确实过拟合了，性能比现行回归模型要差。

In [100]:
# 使用更强大的随机森林模型
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepare, housing_labels)

housing_predictions = forest_reg.predict(housing_prepare)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

21899.2507207247

In [101]:
forest_scores = cross_val_score(forest_reg, housing_prepare, housing_labels, 
                               scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores) # 耗时间

Scores: [51964.41327624 50609.19017883 52015.56310418 54915.30834399
 51823.54440092 55504.33141367 51933.55364057 50394.88863904
 55194.99947022 51783.09626233]
Mean: 52613.888872999676
Standard deviations: 1783.6375618811246


对比结果，可以看到随机森林模型要好很多。不过需要注意的是， 交叉验证得到的分数还是比单独预测训练集的分数高太多(52000 >> 22000), 因此模型还是存在过拟合问题。  
通常方法是简化模型(正则化)，或者加入更多的测试数据。  
除以之外，还可以使用其他的模型，比如svm(使用不同的kernels)， 神经网络等。  
这时候不必要花太多时间在调参上，目标是选出2-5个比较有希望的模型。

### 做个补充

In [104]:
# 完整的pipeline可以包括数据准备和预测
full_pipeline_with_predictor = Pipeline([
    ("preparation", full_pipeline), 
    ("linear", LinearRegression()),
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)

array([210644.60459286, 317768.80697211, 210956.43331178,  59218.98886849,
       189747.55849879])

此时需要对得到的模型进行持久化保存， 方便后续读取调优。

In [105]:
# 持久化保存训练好的model：joblib
my_model = full_pipeline_with_predictor
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
['my_model.pkl']
# 读取
my_model = joblib.load('my_model.pkl')

In [106]:
my_model

Pipeline(memory=None,
     steps=[('preparation', FeatureUnion(n_jobs=1,
       transformer_list=[('num_pipeline', Pipeline(memory=None,
     steps=[('selector', DataFrameSelector(attribute_names=['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'])), ('...ts=None)), ('linear', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

### 微调模型

其中一种方法是手动调参，直到发现合适的超参数(hyperparameters), 这通常是是一个单调乏味的工作，并且还没有足够的时间去探索一些组合选项。

可以使用 sklearn 的GridSearchCV 库 来进行网格搜索。 需要做的是给定哪些超参数以及要尝试的值，它会使用交叉验证来评估超参数值的所有可能组合。  
如下对随机森林进行网格搜索。  


### GridSearchCV

In [107]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    # 尝试 3*4=12中超参数组合
    {'n_estimators':[3, 10,  30], 'max_features': [2, 4, 6, 8]},
    # 尝试2*3=6种bootstrap
    {'bootstrap': [False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]}
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
# 如上所说，这里进行5次交叉验证。
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)

如上，网格搜索针对随机森林需要的参数n_estimators， max_features，bootstrap组合进行穷举搜索。  
总而言之，网格搜索将探索12 + 6 = 18的随机性的超参数值组合。  
它将训练每个模型5次(因为我们使用的是5倍交叉验证)。  
换句话说,总而言之,会有18×5 = 90轮培训!这可能需要很长时间，但当它完成时，你可以得到这样的参数的最佳组合。

In [109]:
# 由于该步骤很耗时，因此为了演示，将训练数据减少一部分。
housing_prepare_small = housing_prepare[:3000]
housing_labels_small = housing_labels[:3000]

In [110]:
grid_search.fit(housing_prepare_small, housing_labels_small)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [114]:
# 搜索结束
grid_search.best_params_  #最好的超参数组合, 由于30是最大值，所以应该尝试更大的值

{'max_features': 6, 'n_estimators': 30}

In [113]:
grid_search.best_estimator_  #可以直接得到 最优的模型用于训练

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

In [116]:
# 当然，可以得到每种组合的分数
cvres = grid_search.cv_results_ 

for mean_scores, params in zip(cvres["mean_test_score"], cvres["params"]):
                               print(np.sqrt(-mean_scores), params)

71132.8432432034 {'max_features': 2, 'n_estimators': 3}
59984.47727852012 {'max_features': 2, 'n_estimators': 10}
56972.051178962836 {'max_features': 2, 'n_estimators': 30}
65379.00225452585 {'max_features': 4, 'n_estimators': 3}
57368.943967545856 {'max_features': 4, 'n_estimators': 10}
54526.57612637526 {'max_features': 4, 'n_estimators': 30}
62833.71167680876 {'max_features': 6, 'n_estimators': 3}
56202.50853540545 {'max_features': 6, 'n_estimators': 10}
54042.95646340654 {'max_features': 6, 'n_estimators': 30}
63650.99752919532 {'max_features': 8, 'n_estimators': 3}
56587.24638469238 {'max_features': 8, 'n_estimators': 10}
54103.90905116381 {'max_features': 8, 'n_estimators': 30}
67387.35090354811 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
58801.14857602752 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
64385.90097183371 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
57651.480112536025 {'bootstrap': False, 'max_features': 3, 'n_estimators': 

这里得到的最优模型 分数是6，30 -- 54042。对比之前默认的结果要好，因此这一步算是对模型进行的微调。

不要忘记，可以将一些预处理作为超参数处理。  
例如，网格搜索将自动发现是否添加您不确定的特性(bedrooms_per_room超参数)。  
同样，它也可以用来自动找到处理异常值的最佳方法、丢失的特征、特征选择等等。

### Randomized Search

当确定的超参数组合很少时，可以使用网格搜索。但是假如不知道超参数的范围或者其范围很大时，可以考虑使用随机搜索。  
与网格搜索不同，它通过在每次迭代中为每个超参数选择一个随机值来评估给定数目的随机组合。有两个优点：  
1. 指定随机搜索的次数， 探索不同的值，而不是像网格搜索针对对给定的值进行搜索。  
2. 通过设定迭代次数，你可以对你想要分配给超参数搜索的计算有更多的控制。  

In [123]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    "n_estimators": randint(low=1, high=200),
    "max_features": randint(low=1, high=8),
}
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs, 
                               n_iter=10, cv=5, scoring="neg_mean_squared_error", random_state=42)


In [124]:
rnd_search.fit(housing_prepare_small, housing_labels_small)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a1cef1668>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a1cef1be0>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)

In [125]:
# 如上进行了随机搜索， 结果展示如下：
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

53513.76883786206 {'max_features': 7, 'n_estimators': 180}
55461.56488400293 {'max_features': 5, 'n_estimators': 15}
54868.06214585608 {'max_features': 3, 'n_estimators': 72}
55011.94093003847 {'max_features': 5, 'n_estimators': 21}
53569.799862039145 {'max_features': 7, 'n_estimators': 122}
54864.033006859674 {'max_features': 3, 'n_estimators': 75}
54672.69350280115 {'max_features': 3, 'n_estimators': 88}
53702.08699745629 {'max_features': 5, 'n_estimators': 100}
54455.099383984176 {'max_features': 3, 'n_estimators': 150}
72608.486575064 {'max_features': 5, 'n_estimators': 2}


可以得到随机搜索的最优值是7， 180。可以先利用随机搜索得到大概范围，再进行网格搜索。

### Ensemble Methods 集成方法

微调系统的另一个方法是尝试将性能最好的模型进行组合。  
组合模型(ensemble)一般要比最好的个体模型表现的要好。（就像随机森林比他们所依赖的个体决策树要好）。

#### 分析模型和误差
通过对最好的模型进行检查，可以得到更有用的信息。比如随机森林回归可以指出每个属性的相对重要性，以便作出准确的预测.


In [119]:
feature_importances = grid_search.best_estimator_.feature_importances_

In [120]:
feature_importances

array([0.06944296, 0.05731171, 0.04242502, 0.02077279, 0.01813785,
       0.02092872, 0.0184342 , 0.32902876, 0.04881488, 0.10831886,
       0.09421353, 0.01132378, 0.15255042, 0.        , 0.00313905,
       0.00515749])

In [122]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat) 

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.32902875749672267, 'median_income'),
 (0.15255041700055916, 'INLAND'),
 (0.10831885804623398, 'pop_per_hhold'),
 (0.09421352604555039, 'bedrooms_per_room'),
 (0.06944295804995118, 'longitude'),
 (0.05731170766962609, 'latitude'),
 (0.048814879762369236, 'rooms_per_hhold'),
 (0.0424250223890338, 'housing_median_age'),
 (0.020928720432547652, 'population'),
 (0.020772789007452108, 'total_rooms'),
 (0.018434198917854132, 'households'),
 (0.01813785342906769, 'total_bedrooms'),
 (0.011323775759722297, '<1H OCEAN'),
 (0.005157489087866792, 'NEAR OCEAN'),
 (0.0031390469054428756, 'NEAR BAY'),
 (0.0, 'ISLAND')]

通过以上信息可以得到每个特征的重要性。因此可以在后面drop部分特征再进行训练。  

### 在测试集上评估模型
在调整过模型之后，是时候将最终的模型用于最初得到的测试数据上。  

In [126]:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

final_rmse  # 目前所有值中的最低值

53813.45378519866

如果进行了大量的超参数调优(因为模型在验证数据上执行得很好，并且很可能在未知的数据集上执行得很好), 那么性能通常会比您使用交叉验证所度量的要差一些。

最后就是将模型用于实际的数据，并持续维护。


## 总结

以上四篇文章简单介绍了一个机器学习工程的主要步骤，包括但不限于：
1. 获取数据并简单查看
2. 基本统计数据并且可视化
3. 数据清洗，特征组合与可视化
4. 特征缩放， 编码，处理数据
5. 优先选择不同转换器组合的 pipeline 来顺序处理数据，得到可预测的final_model模型。
6. 选择模型训练，得到2-5个模型进行保存，后续调优。
7. 针对不同模型， 进行交叉验证对比，使用随机搜索，网格搜索来得到更优的模型。
8. 将最优模型用于测试，得到结果

后续会简单介绍机器学习工程的大概步骤，相对而言会更详细一些。