Learning how to predict California housing prices using the default California Housing prices dataset that comes with the scikit library.

In [75]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error


In [76]:
# Loading the housing dataset
dataset = fetch_california_housing()
# Printing the shape of data
print(dataset.data.shape)

(20640, 8)


In [77]:
# Dataset Description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [78]:
print(dataset.data[0])

[   8.3252       41.            6.98412698    1.02380952  322.
    2.55555556   37.88       -122.23      ]


In [79]:
# Splitting the Data into Training data and Test Data. Using random_state of 67 for reproducible result.
X = dataset['data']
y = dataset['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=67)

In [80]:
# Creating the pipeline. SInce the data seems to be not normalized, we are making Standard Scaler as the first step of the pipeline
pipeline = make_pipeline(StandardScaler(), LinearRegression())

In [81]:
# Fitting the model
pipeline.fit(X_train, y_train)

In [82]:
# Predicting the housing prices for the Test Data
y_pred = pipeline.predict(X_test)

In [83]:
# Evaluating the model_selection
r2_s = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(r2_s, mse)

0.5891435539852219 0.5472825858911409


In [84]:
# Printing the weights.
pipeline.steps[1][1].coef_

array([ 0.83450018,  0.11723744, -0.27429123,  0.31103746, -0.00381503,
       -0.03973839, -0.89942805, -0.87308872])