概要：牛油果价格数据已经准备好，可以开始训练模型了。

1.首先需要设置matplotlib绘图环境

In [31]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

2.导入牛油果价格数据。

In [32]:
import pandas as pd

AVOCADO_PATH = os.path.join("datasets", "avocado")

def load_avocado_data(avocado_path=AVOCADO_PATH):
    csv_path = os.path.join(avocado_path, "avocado.csv")
    return pd.read_csv(csv_path)

avocado = load_avocado_data()
avocado.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


3.将序号去掉，这个属性不是特征。

In [33]:
avocado.drop("Unnamed: 0", axis=1, inplace=True)

4.拆分数据集。从上面的分析可知，总销量（Total Volume）对价格的影响是最大的，所以切分数据集的时候，需要将总销量分段，每段的样本数要平均。

In [34]:
avocado["Total Volume Cat"] = pd.cut(avocado["Total Volume"],
                               bins=[1., 10000.0, 100000.0, 500000.0, 850000.0, np.inf],
                               labels=[1, 2, 3, 4, 5])
avocado["Total Volume Cat"].value_counts()

3    5159
2    4649
1    4299
5    2733
4    1409
Name: Total Volume Cat, dtype: int64

5.下面开始按照特征“Total Volume Cat”来切分数据集。

In [35]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(avocado, avocado["Total Volume Cat"]):
    strat_train_set = avocado.loc[train_index]
    strat_test_set = avocado.loc[test_index]

6.使用scikit的流水线技术，将数据预处理组合起来。相对房价数据，没有添加组合属性，也没有数据清洗的步骤（因为没有缺失的数据）。

In [36]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("Total Volume Cat", axis=1, inplace=True)
    set_.drop("Date", axis=1, inplace=True)

In [37]:
avocado_data = strat_train_set.drop("AveragePrice", axis=1) # drop labels for training set
avocado_labels = strat_train_set["AveragePrice"].copy()

In [38]:
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#‘type’和‘region’是文本，先去掉
avocado_num = avocado_data.drop(['type', 'region'], axis=1)

#流水线，同时完成了数据清洗，增加有效属性和数据标准化。
num_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])

num_attribs = list(avocado_num)
cat_attribs = ["type", "region"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

avocado_prepared = full_pipeline.fit_transform(avocado_data)

In [39]:
avocado_prepared.shape

(14599, 65)

In [40]:
avocado_prepared

<14599x65 sparse matrix of type '<class 'numpy.float64'>'
	with 160589 stored elements in Compressed Sparse Row format>

训练数据集共有属性65个，其中9个为原来的，type用0和1表示；region属性用独热向量表示（54个）。

6.开始训练模型。

In [43]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(avocado_prepared, avocado_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [44]:
some_data = avocado_data.iloc[:5]
some_labels = avocado_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

Predictions: [1.41637713 1.76002455 1.41644597 1.0753166  1.59837264]
Labels: [1.42, 1.86, 1.36, 1.17, 1.29]


看着还可以。

7.我们评估一下错误率。

In [45]:
from sklearn.metrics import mean_squared_error

avocado_predictions = lin_reg.predict(avocado_prepared)
lin_mse = mean_squared_error(avocado_labels, avocado_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

0.2676004917633777

在训练集上的误差是0.26美元。

8.随机拆分数据集，观察错误率。

In [102]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(avocado, test_size=0.2, random_state=42)
for set_ in (train_set, test_set):
    set_.drop("Total Volume Cat", axis=1, inplace=True)
    set_.drop("Date", axis=1, inplace=True)
    
avocado_data_r = train_set.drop("AveragePrice", axis=1) # drop labels for training set
avocado_labels_r = train_set["AveragePrice"].copy()

avocado_prepared_r = full_pipeline.fit_transform(avocado_data_r)




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [103]:
lin_reg_r = LinearRegression()
lin_reg_r.fit(avocado_prepared_r, avocado_labels_r)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [104]:

avocado_predictions_r = lin_reg_r.predict(avocado_prepared_r)
lin_mse_r = mean_squared_error(avocado_labels_r, avocado_predictions_r)
lin_rmse_r = np.sqrt(lin_mse_r)
lin_rmse_r

0.2672371424877746

随机拆分数据集后，错误率相差不大；估计是总销量对价格的影响比较小。

9.我们试试决策树。

In [105]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(avocado_prepared, avocado_labels)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=42, splitter='best')

In [106]:
avocado_predictions = tree_reg.predict(avocado_prepared)
tree_mse = mean_squared_error(avocado_labels, avocado_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

1.369068451371083e-17

误差率很低，在训练集上过拟合了，交叉验证一下。

In [107]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, avocado_prepared, avocado_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [108]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [0.21500414 0.2163622  0.2159961  0.20727092 0.21545571 0.21012847
 0.20730842 0.21207068 0.22064367 0.22260345]
Mean: 0.21428437610508824
Standard deviation: 0.004886816146940346


效果比线性回归要好。

10.我们再尝试一下随机森林模型。

In [109]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(avocado_prepared, avocado_labels)

avocado_predictions = forest_reg.predict(avocado_prepared)
forest_mse = mean_squared_error(avocado_labels, avocado_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

0.06893078593939443

In [110]:
forest_scores = cross_val_score(forest_reg, avocado_prepared, avocado_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [0.15583854 0.15971812 0.16112383 0.16194843 0.160737   0.1549782
 0.1592748  0.16100592 0.16184411 0.17000036]
Mean: 0.16064693079948336
Standard deviation: 0.003859777160859902


In [None]:
在验证集上的结果比训练集上效果要差些，不过基本上可以了，没有过拟合的现象。