# Clifornia Housing Prices

This problem is in chapter 2 of “Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Géron (O’Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9.” In order to enhance my ability and memory, I'm here to reimplement this entire procedure.

# Preparation

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
housing = pd.read_csv('../input/california-housing-prices/housing.csv')
display(housing.head())

## Frame the Problem

1. It is a **supervised learning** task. Because given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price). 
2. It is a **regression task**. Because asked to predict a value. 
    - More specifically, this is a **multiple regression** problem since the system will use multiple attributes to make a prediction (it will use the district’s population, the median income, etc.). 
    - It is also a **univariate regression** problem since we are only trying to predict a single value for each district. If we were trying to predict multiple values per district, it would be a **multivariate regression** problem. 
3. There is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain **batch learning** should do just fine.

## Performance Measure

### Root Mean Square Error

$$Euclidean\ norm\ (norm2): RMSE(X,h)=\sqrt{\frac{1}{m}\sum_{i=1}^m{(h(x)^i-y^i)^2}}$$
It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.

### Mean Absolute Error

$$Manhattan\ norm\ (norm1): MSE(X,h)=\frac{1}{m}\sum_{i=1}^m{|h(x)^i-y^i|}$$

For example, suppose that there are many **outlier districts**. Consider using the Mean Absolute Error.

## Check the Data

In [None]:
housing.info()

Each row represents one district. There are 10 attributes and 20,640 instances in the dataset. Notice that the total_bed rooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this attribute. The only categorical attribute is "ocean_proximity".

In [None]:
sns.barplot(x=housing["ocean_proximity"].value_counts().index,
            y=housing["ocean_proximity"].value_counts().values)
plt.title("Ocean Proximity")
plt.show()

In [None]:
display(housing.describe())

Note that the null values are ignored (so, for example, count of total_bedrooms is 20,433, not 20,640). The std row shows the standard deviation, which measures how dispersed the values are. 12 The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations falls. For example, 25% of the districts have a housing_median_age lower than 18, while 50% are lower than 29 and 75% are lower than 37. These are often called the 25 th percentile (or 1 st quartile), the median, and the 75 th percentile (or 3 rd quartile).

In [None]:
housing.hist(bins=50, figsize=(20, 15))
plt.show()

1. First, the **median income** attribute does not look like it is expressed in US dollars (USD). Instead, the numbers represent roughly tens of thousands of dollars (e.g., 3 actually means about 30,000USD).
2. The **housing median age** and the **median house value** were also capped. The latter may be a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s output) to see if this is a problem or not. If they tell you that they need precise predictions even beyond 500,000USD, then you have mainly two options:
    - Collect *proper labels* for the districts whose labels were capped.
    - *Remove* those districts from the training set (and also from the test set, since your system should not be evaluated poorly if it predicts values beyond 500,000USD).
3. These attributes have very *different scales*. We will discuss this later in this chapter when we explore feature scaling.
4. Finally, many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try *transforming these attributes later on to have more **bell-shaped** distributions*.

## Stratified Sampling to Split Data Set

**!!Suppose that the *median income* is a very important attribute!!** to predict median housing prices. We have to ensure that the test set is representative of the various categories of incomes in the whole dataset. Since the median income is a continuous numerical attribute, we first need to create an income category attribute.

In [None]:
sns.distplot(housing['median_income'])
plt.title('Median Income')
plt.show()

Most median income values are clustered around 1.5 to 6 (i.e., \\$15,000–\\$60,000), but some median incomes go far beyond 6. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that we should not have too many strata, and each stratum should be large enough. The following code uses the `pd.cut()` function to create an income category attribute with 5 categories (labeled from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than \\$15,000), category 2 from 1.5 to 3, and so on:

In [None]:
housing['income_cat'] = pd.cut(housing["median_income"],
                               bins=[0, 1.5, 3, 4.5, 6, np.inf],
                               labels=[1, 2, 3, 4, 5])
sns.barplot(x=housing["income_cat"].value_counts().index,
         y=housing["income_cat"].value_counts().values)
plt.title("Income Categories")
plt.show()

Do **stratified sampling** based on the income category

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
print(housing['income_cat'].value_counts() / len(housing))
print(strat_train_set['income_cat'].value_counts() / len(strat_train_set))
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))

As you can see, the test set generated using stratified sampling has income category **proportions** almost identical to those in the **full dataset**, whereas the test set generated using purely random sampling is quite skewed. Now we should **remove** the "income_cat" attribute so the data is back to its original state:

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# EDA

Now, we have:

- strat_train_set
- strat_test_set

First, we put the strat_test_set aside and we are only exploring the **training set**.

## Summary of Housing Prices

In [None]:
housing = strat_train_set.copy()

Now let’s look at the **housing prices**. The **radius** of each circle represents the district’s **population** (option s), and the **color** represents the **price** (option c). We will use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices):

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, 
             label="population",
             c="median_house_value", cmap=plt.get_cmap("jet"), 
             colorbar=True, figsize=(10,7)) 
plt.legend()
plt.show()

## Correlation Analysis

In [None]:
corr_matrix = housing.corr()
print(corr_matrix["median_house_value"].sort_values(ascending=False))

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", 
              "total_rooms", "housing_median_age"] 
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

The most promising attribute to predict the median house value is the median income, so let’s zoom in on their correlation scatterplot.

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.show()

The correlation is indeed very strong; we can clearly see the upward trend and the points are not too dispersed. Second, the price cap that we noticed earlier is clearly visible as a horizontal line at \\$500,000. But this plot reveals other less obvious straight lines: a horizontal line around \\$450,000, another around \\$350,000, perhaps one around \\$280,000, and a few more below that. We may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce these data quirks.

## Experimenting with Attribute Combinations

Try out various attribute combinations. For example, the total number of rooms in a district is not very useful if we don’t know how many households there are. What we really want is the number of rooms per household. Similarly, the total number of bedrooms by itself is not very useful: we probably want to compare it to the number of rooms. And the population per household also seems like an interesting attribute combination to look at. Let’s create these new attributes:

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

Look at the correlation matrix again.|

In [None]:
corr_matrix = housing.corr()
print(corr_matrix["median_house_value"].sort_values(ascending=False))

The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedroom/room ratio tend to be more expensive. The number of rooms per household is also more informative than the total number of rooms in a district—obviously the larger the houses, the more expensive they are.

# Preprocessing the Data for Machine Learning Algorithms

## Data Cleaning

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) 
housing_labels = strat_train_set["median_house_value"].copy()
display(housing.head())

### Missing Values

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)

The imputer has simply computed the median of **each attribute** and stored the result in its `statistics_` instance variable. Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is ***safer to apply the imputer to all the numerical attributes***:

In [None]:
housing_tr = pd.DataFrame(imputer.transform(housing_num), 
                          columns=housing_num.columns)

## Categorical attributes

In [None]:
housing_cat = housing[["ocean_proximity"]]
display(housing_cat.head(10))

### Ordinal Encoder

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
print(housing_cat_encoded[:10])
print(ordinal_encoder.categories_)

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g., for ordered categories such as “bad”, “average”, “good”, “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEn coder class to convert categorical values into one-hot vectors:

### One Hot Encoder (Dummy)

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories.

In [None]:
print(housing_cat_1hot.toarray())

If a categorical attribute has a large number of possible categories (e.g., country code, profession, species, etc.), then one-hot encoding will result in a large number of input features. This may slow down training and degrade performance. If this happens, you may want to replace the categorical input with useful numerical features related to the categories: for example, you could replace the ocean_proximity feature with the distance to the ocean (similarly, a country code could be replaced with the country’s population and GDP per capita). Alternatively, you could replace each category with a learnable low dimensional vector called an embedding. Each category’s representation would be learned during training: this is an example of representation learning (see Chapter 13 and ??? for more details).

## Custom Transformers

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin 
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6 
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs 
        self.add_bedrooms_per_room = add_bedrooms_per_room 
    def fit(self, X, y=None):
        return self # nothing else to do 
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household] 

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) 
housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
display(pd.DataFrame(housing_extra_attribs).head())

More generally, we can add a hyperparameter to gate any data preparation step that we are not 100% sure about. The more we automate these data preparation steps, the more combinations we can automatically try out, making it much more likely that you will find a great combination (and saving we a lot of time).

## Feature Scaling

Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required.

- Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason.
- Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a **problem** for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by **outliers**. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected.

## Transformation Pipelines

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.

### Pipeline of Numerical Attributes

In [None]:
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), 
                         ('attribs_adder', CombinedAttributesAdder()), 
                         ('std_scaler', StandardScaler()),])
housing_num_tr = num_pipeline.fit_transform(housing_num)

### Pipeline of All Attributes

In [None]:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num) 
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([("num", num_pipeline, num_attribs), 
                                   ("cat", OneHotEncoder(), cat_attribs),])
housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
pd.DataFrame(housing_prepared).head()

# Train a Model

## Training and Evaluating on the Training Set

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression() 
lin_reg.fit(housing_prepared, housing_labels)

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print("RMSE:", lin_rmse)

Most districts’ median_housing_values range between \\$120,000 and \\$265,000, so a typical prediction error of \\$68,628 is not very satisfying. This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough. The main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model. This model is not regularized, so this rules out the last option. You could try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does.

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor() 
tree_reg.fit(housing_prepared, housing_labels)

In [None]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print("RMSE:", tree_rmse)

It is much more likely that the model has badly overfit the data. As we saw earlier, we don’t want to touch the test set until we are ready to launch a model you are confident about, so we need to use part of the training set for training, and part for model validation.

#### K-fold Cross-Validation

Randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:

In [None]:
from sklearn.model_selection import cross_val_score 
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)

It seems to perform worse than the Linear Regression model! Notice that cross-validation allows we to get not only an estimate of the performance of our model, but also a measure of how precise this estimate is (i.e., its standard deviation). Let's try Linear Regression model.

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

### Random Forests

Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print("RMSE:", forest_rmse)

In [None]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Random Forests look very promising. However, note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set.

In [None]:
# # Save the model
# from sklearn.externals import joblib
# joblib.dump(my_model, "my_model.pkl") 
# # and later...
# my_model_loaded = joblib.load("my_model.pkl")

# Fine-Tune the Model

## Grid Search

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of n_estimators and max_features hyperparameter values specified in the first dict, then try all 2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the bootstrap hyperparameter set to False instead of True (which is the default value for this hyperparameter).

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 
     'max_features': [2, 4, 6, 8]}, 
    {'bootstrap': [False], 
     'n_estimators': [3, 10], 
     'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', 
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

In [None]:
print(grid_search.best_params_)
print(grid_search.best_estimator_)

And of course the evaluation scores are also available:

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

The RMSE score for this combination is 50033, which is slightly better than the score you got earlier using the default hyperparameter values.
> Don’t forget that we can treat some of the data preparation steps as hyperparameters. For example, the grid search will automatically find out whether or not to add a feature we were not sure about (e.g., using the add_bedrooms_per_room hyperparameter of your CombinedAttributesAdder transformer).

## Randomized Search

When the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. Instead of trying out all possible combinations in grid search, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:
- If we let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
- We have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.

## Ensemble Methods

Another way to fine-tune your system is to try to combine the models that perform best. The **group (or “ensemble”)** will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Trees they rely on), especially if the individual models make very different types of errors.

## Analyze the Best Models and Their Errors

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
print(feature_importances)

Display these importance scores next to their corresponding attribute names:

In [None]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
print(sorted(zip(feature_importances, attributes), reverse=True))

With this information, we may want to try dropping some of the less useful features (e.g., apparently only one ocean_proximity category is really useful, so we could try dropping the others).


We should also look at the specific errors that our system makes, then try to understand why it makes them and what could fix the problem (adding extra features or, on the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).

# Evaluate Model on the Test Set

In [None]:
final_model = grid_search.best_estimator_ 
X_test = strat_test_set.drop("median_house_value", axis=1) 
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions) 
final_rmse = np.sqrt(final_mse)
print("Final RMSE:", final_rmse)

Compute a 95% confidence interval for the generalization error using `scipy.stats.t.interval()`

In [None]:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
ci = np.sqrt(stats.t.interval(confidence, len(squared_errors)-1, 
                              loc=squared_errors.mean(), 
                              scale=stats.sem(squared_errors)))
print(ci)

The performance will usually be slightly worse than what we measured using crossvalidation if we did a lot of hyperparameter tuning (because we system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets). It is not the case in this example, but when this happens we must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.