<b>Importing necessary libraries

In [None]:
import pandas as pd     # Data Wrangling & Preprocessing
import numpy as np      # Data Wrangling & Preprocessing
import seaborn as sns   # Plotting charts
import matplotlib.pyplot as plt    # Plotting charts
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split    #Splitting the data into training & testing set 
from sklearn.preprocessing import OneHotEncoder    #Encoding categorical variables
from sklearn.pipeline import Pipeline    # To create pipelines for preprocessing steps
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression    # Linear Regression Model
from sklearn.ensemble import RandomForestRegressor   # RandomForest Regressor Model
from sklearn.metrics import mean_squared_error    # RMSE Evaluation Metric for Regression 
from sklearn.model_selection import cross_val_score    # To Compute validation score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
import pickle    # To export the trained model

In [None]:
data = pd.read_csv(r'../input/california-housing-prices/housing.csv')

In [None]:
data.head()

Each row represents a district and there are 10 attributes in the dataset. Now let’s use the info() method which is useful for getting a quick description of the data, especially the total number of rows, the type of each attribute, and the number of non-zero values:

In [None]:
data.info()

There are 20,640 instances in the dataset. Note that the total_bedrooms attribute has only 20,433 non-zero values, which means 207 districts do not contain values. We will have to deal with that later.

All attributes are numeric except for the ocean_proximity field. Its type is an object, so it can contain any type of Python object. You can find out which categories exist in that column and how many districts belong to each category by using the value_counts() method:

In [None]:
data['ocean_proximity'].value_counts()

Another quick way to get a feel for what kind of data you’re dealing with is to plot a histogram for each numerical attribute:

In [None]:
data.hist(bins=50, figsize=(16,12))
plt.show()

<b> Split the data into Training and Testing set</b><br>
Creating a test set is theoretically straightforward: select some instances at random, typically 20% of the dataset (or less if your dataset is very large), and set them aside:

In [None]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

Let’s take a closer look at the histogram of median income, as most median income values cluster around 1.5 to 6, but some median income goes well beyond 6.

It is important to have a sufficient number of instances in your dataset for each stratum, otherwise, the estimate of the importance of a stratum may be biased. This means that you should not have too many strata and that each stratum should be large enough:

In [None]:
data['income_cat'] = pd.cut(data['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])
data['income_cat'].hist()
plt.show()

<b>Stratified Sampling on Dataset</b><br>
Stratified Sampling is a method of sampling from a population that can be divided into a subset of the population.

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))

In [None]:
#Now you need to remove the Income_cat attribute added by us to get the data back to its form:
for set_ in (strat_train_set, strat_test_set):
    set_.drop('income_cat', axis=1, inplace=True)
data = strat_train_set.copy()

In [None]:
data.head()

In [None]:
#Now before creating a machine learning model for house price prediction with Python let’s visualize the data in terms of longitude and latitude:
data.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
                s=data['population']/100, label='population', figsize=(10,7),
                cmap=plt.get_cmap('jet'), colorbar=True)

plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
plt.legend() 
plt.show()

#Note: Add a paramters 'c = median_house_value' is you're working in jupyter notebook. Not working in kaggle.

The graph shows house prices in California where red is expensive, blue is cheap, larger circles indicate areas with a larger population.

<b> Finding Correlations</b></br>

Since the dataset is not too large, you can easily calculate the standard correlation coefficient between each pair of attributes using the corr() method:

In [None]:
corr_matrix = data.corr()
print(corr_matrix.median_house_value.sort_values(ascending=False))

Correlation ranges are between -1 and 1. When it is close to 1 it means that there is a positive correlation and when it is close to -1 it means that there is a negative correlation. When it is close to 0, it means that there is no linear correlation.

And now let’s look at the correlation matrix again by adding three new columns to the dataset; rooms per household, bedrooms per room and population per household:

In [None]:
data["rooms_per_household"] = data["total_rooms"]/data["households"]
data["bedrooms_per_room"] = data["total_bedrooms"]/data["total_rooms"]
data["population_per_household"] = data["population"]/data["households"]

corr_matrix = data.corr()
print(corr_matrix["median_house_value"].sort_values(ascending=False))

In [None]:
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 10))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<b> Data Preparation </b><br>

Now, this is the most important step before a train a machine learning model for the task of house price prediction. Now let’s perform all the necessary data transformations:
    

In [None]:
# Data Preparation
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

housing_num = housing.drop("ocean_proximity", axis=1)

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

As you can see, there are many data transformation steps that need to be performed in the correct order. Fortunately, Scikit-Learn provides the Pipeline class to help you with such sequences of transformations. Here is a small pipeline for numeric attributes:

In [None]:
num_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

In [None]:
housing_num_tr = num_pipeline.fit_transform(housing_num)

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
# function to display scores
def display_scores(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard Deviation: ", scores.std())

<b>Linear Regression for House Price Prediction with Python</b><br>

Now I will use the linear regression algorithm for the task of house price prediction with Python:

In [None]:
# Model Training - LR
lin_reg_model = LinearRegression()
lin_reg_model.fit(housing_prepared, housing_labels)

data = housing.iloc[:5]
labels = housing_labels.iloc[:5]
data_preparation = full_pipeline.transform(data)

In [None]:
# Predictions and RMSE
housing_predictions = lin_reg_model.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print('RMSE value for Linear Regression: ', lin_rmse)

In [None]:
#Cross Validation
scores = cross_val_score(lin_reg_model, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
pd.Series(np.sqrt(-scores)).describe()
display_scores(scores)

<b> Random Forest Regressor<b>

In [None]:
# Model Training - RFR
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
# Predictions and RMSE
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print('RMSE value for Random Forest Regressor: ', forest_rmse)

In [None]:
# Cross Validation
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)