上一篇文章对数据进行了更进一步的探索和可视化，描述了属性间的相关性，以及简单提到特征组合。
接下来了解 为机器学习算法 预处理准备数据。

In [1]:
#准备工作
# 基本包的导入
import numpy as np
import os

# 画图相关
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# 忽略警告
import warnings

# 图片存储目录
PROJECT_ROOT_DIR = '../'
CHAPTER_ID = 'end_to_end_project'
IMAGE_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension='png', resolution=300):
    path = os.path.join(IMAGE_PATH, fig_id + "." + fig_extension)
    print("保存图片:", fig_id)
    
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)


warnings.filterwarnings(action='ignore', module='scipy', message='internal')

# 加载数据
HOUSING_PATH = os.path.join("../datasets", "housing")
def loading_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = loading_housing_data()

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(housing, housing["income_cat"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]

for set in (strat_train_set, strat_test_set): 
    set.drop(["income_cat"], axis=1, inplace=True)

### 为算法准备数据

过程中会涉及很多的数据缩放，转换，应该使用函数来定义这些操作，而不是像每次执行的去分析数据。
其好处在于可以在其他数据集上应用这些函数，而且在实时系统中使用这些函数来函数来转换数据，输入算法，还可以轻松的尝试不同转换，并查看哪些转换组合有效。

In [4]:
# 复制一份训练集，并drop预测结果后语后续评估。
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

#### 数据清洗
大部分机器学习算法是不能处理实例的空特征的。或者说空睡醒会对结果有非常大的负作用，比如以上提到 total_bedrooms 属性。
对于空属性一般有以下处理方法：
1. 删除该特征对应的实例。
2. 直接删除该特征
3. 使用平均值，中位数，0 进行填充。


In [15]:
housing.dropna(subset=["total_bedrooms"])  #方法一
housing.drop("total_bedrooms", axis=1)  #方法二

median = housing["total_bedrooms"].median
housing["total_bedrooms"].fillna(median)
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
17606,-121.89,37.29,38.0,1568.0,351.0,710.0,339.0,2.7042,<1H OCEAN
18632,-121.93,37.05,14.0,679.0,108.0,306.0,113.0,6.4214,<1H OCEAN
14650,-117.2,32.77,31.0,1952.0,471.0,936.0,462.0,2.8621,NEAR OCEAN
3230,-119.61,36.31,25.0,1847.0,371.0,1460.0,353.0,1.8839,INLAND
3555,-118.59,34.23,17.0,6592.0,1525.0,4459.0,1463.0,3.0347,<1H OCEAN
