**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 4 – End-to-End Machine Learning Project**

# California Housing Prices Dataset 

Nok Wongpiromsarn, 26 August 2020

**Credit:** The large portion of the code has been taken from Chapter 2 of Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

## Step 2 Get the Data

**2.1 Load the data using pandas**

In [None]:
import os
import pandas as pd

data_path = os.path.join("datasets", "housing.csv")
data = pd.read_csv(data_path)

**2.2 Take a Quick Look at the Data Structure**

Examine the top 10 rows of the data

In [None]:
data.head(10)

Get a quick description of the data

In [None]:
data.info()

Get counts for categorical attribute

In [None]:
data["ocean_proximity"].value_counts()

Get a summary of the numerical attributes

In [None]:
data.describe()

Plot a histogram

In [None]:
import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
plt.show()

**2.3 Create a Test Set**

Split the data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

Use stratified sampling to ensure that the test set is representative of the various categories of median income

In [None]:
# Create an income category attribute by arranging the median income into bins 
# (0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0] (6.0, inf]
import numpy as np
data["income_cat"] = pd.cut(data["median_income"],
                            bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                            labels=[1, 2, 3, 4, 5])

# Do stratified sampling based on income category
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    train_set = data.loc[train_index]
    test_set = data.loc[test_index]

# Remove the income_cat attribute so the data is back to its original state
# Here, axis=1 indicates dropping labels from columns
for s in (train_set, test_set):
    s.drop("income_cat", axis=1, inplace=True)

## Step3: Discover and Visualize the Data to Gain Insights

**3.1 Visualize the geographical data**

In [None]:
train_set.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
              s=train_set["population"]/100, label="population", figsize=(10,7),
              c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)

**3.2 Look for correlations**

In [None]:
# Compute the correlations
corr_matrix = train_set.corr()
print(corr_matrix["median_house_value"].sort_values(ascending=False))

# Plot pairwise relationships of relevant features.
from pandas.plotting import scatter_matrix

attributes = ["median_house_value",
              "median_income",
              "total_rooms",
              "housing_median_age",
             ]
scatter_matrix(train_set[attributes], figsize=(12,8));

**3.3 Experimenting with Attribute Combinations**

In [None]:
train_set["bedrooms_per_room"] = train_set["total_bedrooms"]/train_set["total_rooms"]

print(train_set.corr()["median_house_value"].sort_values(ascending=False))
attributes = ["median_house_value",
              "total_bedrooms",
              "total_rooms",
              "bedrooms_per_room",
             ]
scatter_matrix(train_set[attributes], figsize=(12,8));

## Step 4: Prepare the Data for Machine Learning Algorithms

**4.1 Data Cleaning**

In [None]:
train_set.info()

# Drop instances with missing features
train_set.drop("bedrooms_per_room", axis=1, inplace=True)
train_set.dropna(subset=["total_bedrooms"], inplace=True)

train_set.info()

Separate features and labels

In [None]:
housing_features = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

**4.2 Handling Text and Categorical Attributes**

In [None]:
from sklearn.preprocessing import OneHotEncoder

# First, we create the binary vector arrays representation of the ocean_proximity feature with One-Hot Encoding
housing_cat = housing_features[["ocean_proximity"]]
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat).toarray()

print(cat_encoder.categories_)
print(housing_cat_1hot)

# Construct data frame for one hot encoded columns
housing_cat_1hot_df = pd.DataFrame(housing_cat_1hot, columns=cat_encoder.get_feature_names())

# Reset indices to make sure that concat works properly
housing_features.reset_index(drop=True, inplace=True)
housing_cat_1hot_df.reset_index(drop=True, inplace=True)

# Replace the original "ocean_proximity" column with its one hot encoding 
housing_features = pd.concat([housing_features, housing_cat_1hot_df], axis=1).drop(['ocean_proximity'], axis=1)
housing_features.info()

**4.3 Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler

scale_1hot = True

if scale_1hot:
    # Here we apply scaling to all the columns, including one-hot-encoded ones
    scaled_values = StandardScaler().fit_transform(housing_features.values)
    housing_features = pd.DataFrame(scaled_values, index=housing_features.index, columns=housing_features.columns)
else:
    # Only applies scaling to the numerical features
    features = housing_features.columns[:8]
    housing_features[features] = StandardScaler().fit_transform(housing_features[features])

pd.options.display.float_format = "{:.2f}".format
housing_features.describe()

## Step 5: Select and Train a Model

**5.1 Training and Evaluating on the Training Set**

In [None]:
# Train a Linear Regression model
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_features, housing_labels)

# Measure the RMSE
from sklearn.metrics import mean_squared_error

predictions = lin_reg.predict(housing_features)
lin_rmse = np.sqrt(mean_squared_error(housing_labels, predictions))
print(lin_rmse)

**5.2 K-fold cross-validation**

Split the training set into 10 subsets, train, and evaluate the model 10 times, picking a different subset each time

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lin_reg, housing_features, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
np.set_printoptions(formatter={'float_kind':"{:.2f}".format})
print("Scores: {}".format(rmse_scores))
print("Mean: {:.2f}".format(rmse_scores.mean()))
print("standard deviation: {:.2f}".format(rmse_scores.std()))