<a href="https://colab.research.google.com/github/sunlight2018/hands_on_ml3_notebooks/blob/main/chapter_02_end_to_end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

流程图

1. 获取数据        👉 fetch_housing_data()
2. 探索数据        👉 .head(), .info(), .describe(), .hist()
3. 创建测试集      👉 train_test_split() + StratifiedShuffleSplit
4. 数据清洗与预处理 👉 dropna(), SimpleImputer, Pipeline
5. 特征工程        👉 ColumnTransformer + OneHotEncoder
6. 训练模型        👉 fit() + predict()
7. 模型评估        👉 RMSE, cross_val_score, 测试集验证

关键词卡

概念/模块
说明

StratifiedShuffleSplit
分层抽样，确保训练/测试分布一致

SimpleImputer
缺失值填补（均值/中位数/众数）

Pipeline
封装数据处理步骤

ColumnTransformer
数值/类别分开处理再合并

cross_val_score
交叉验证（更稳定的评估方式）

fit_transform()
常用于数据处理链条中的一站式转换

LinearRegression / RandomForestRegressor
模型训练器


In [None]:
# 🧠 Hands-On ML 第三版：第二章 Colab 学习模版

# ✅ 0. 环境准备
import os
import tarfile
import urllib.request
import numpy as np
import pandas as pd

# ✅ 1. 下载和解压数据
def fetch_housing_data():
    DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml3/main/"
    HOUSING_PATH = os.path.join("datasets", "housing")
    HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

    os.makedirs(HOUSING_PATH, exist_ok=True)
    tgz_path = os.path.join(HOUSING_PATH, "housing.tgz")
    urllib.request.urlretrieve(HOUSING_URL, tgz_path)
    with tarfile.open(tgz_path) as housing_tgz:
        housing_tgz.extractall(path=HOUSING_PATH)

# ✅ 2. 加载数据
def load_housing_data():
    csv_path = os.path.join("datasets", "housing", "housing.csv")
    return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()
housing.head()

# ✅ 3. 数据初探
print(housing.info())
print(housing["ocean_proximity"].value_counts())
print(housing.describe())
housing.hist(bins=50, figsize=(20, 15))

# ✅ 4. 划分训练/测试集（分层采样）
housing["income_cat"] = pd.cut(housing["median_income"],
                                bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                labels=[1, 2, 3, 4, 5])

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# ✅ 5. 分离特征和标签
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

# ✅ 6. 数值列预处理：缺失值填补 + 标准化
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

housing_num = housing.select_dtypes(include=[np.number])
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

# ✅ 7. 数值+类别列预处理
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = list(housing_num.columns)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)

# ✅ 8. 训练模型（线性回归）
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# ✅ 9. 模型评估（RMSE）
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print("线性回归 RMSE:", lin_rmse)

# ✅ 10. 更复杂的模型（随机森林）
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

# ✅ 11. 交叉验证评估
from sklearn.model_selection import cross_val_score

scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("随机森林交叉验证 RMSE 平均:", rmse_scores.mean())

# ✅ 12. 最终测试集评估
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = forest_reg.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("最终测试集 RMSE:", final_rmse)

# 🎯 本章完成！你现在拥有一个端到端 ML 项目的基本模板。

In [None]:
我已经为你创建好了《Hands-On Machine Learning》第三版第 2 章的 Colab 学习模板，内容包含：
	•	从数据获取到测试集评估的完整流程 ✅
	•	结构清晰、每步注释齐全 ✅
	•	可以反复复用，套入新数据集就能做项目 ✅

你可以复制这份模板到 Colab 中直接使用，也可以按模块添加笔记和练习。

下一步我们可以整理第 3 章（分类问题）、或帮你把这章转成 Markdown 笔记，或者添加“练习题模块”。你想继续哪个？