## California Housing Price Prediction

The original dataset appeared in R. Kelley Pace and Ronald Barry, [“Sparse Spatial Autoregressions,” Statistics
& Probability Letters 33, no. 3 (1997): 291–297](hhttp://www.spatial-statistics.com/pace_manuscripts/spletters_ms_dir/statistics_prob_lets/html/ms_sp_lets1.html)

The task is to build a model of housing prices in California using the California census dataset. This data has metrics such as the population, median income, median housing price, and so on for each block group (district) in California.


We need to predict district’s median housing prices. Thus, we need to train a model to predict a district's median housing price based on other data of the district. We will use the census data for this purpose.

So the task at hand is clearly a typical supervised learning task. Moreover, it is also a multivariate regression task, since we are asked to predict a value. We will be using Root Mean Square Error (RMSE) as our performance measure.

Press "Upvote" the notebook if you find the notebook interesting and helpful. You can also "Fork" at the top-right of this screen to run this notebook yourself and build each of the examples.

In [None]:
#Importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

%matplotlib inline

In [None]:
# Loading the dataset
housing  = pd.read_csv('../input/california-housing-prices/housing.csv')
housing.head()

In [None]:
#To get a quick desciption of the data, in particular the total number of rows, and each attribute’s type and number of non-null values 
housing.info()

There are **20,640** instances in the dataset. Notice that the ***total_bedrooms*** attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. All attributes are numerical, except the ***ocean_proximity*** field. 

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
# To undestand the summary of the numerical attributes.
housing.describe()

In [None]:
# To plot a histogram to understand the data
housing.hist(bins=50, figsize=(20,15))
plt.savefig("attribute_histogram_plots.png")
plt.show()

### Creating a test set
We select random 20% of dataset as a test set using Scikit-Learn's *train_test_split* function.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

In [None]:
# The median income is a very important attribute to predict median housing prices.
# We need to ensure that the test set is representative of the various categories of incomes in the whole dataset.
# Therefore, we are creating an income category column to divide median_income is different categories (5 here)

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist()

In [None]:
# Now we need to do stratified sampling based on the income category. 
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
# To check whether we have divided the instances in all the income categories proportionally.
housing["income_cat"].value_counts() / len(housing)

In [None]:
# Drop the income_cat column from the datasets
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

### Discover and visualize the data to gain insights

In [None]:
housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", figsize = (8,6), alpha=0.1)
plt.savefig("visualization_plot.png")

The plot shows density of houses respective to its longitude & latitude.

In [None]:
# Now let's take housing prices into consideration
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.title('California housing prices')
plt.legend()
plt.savefig("housing_prices_scatterplot.png")

The radius of each circle represents the district’s population (s), and the color represents the price (c).

We used a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices).

In [None]:
# If you are aware of California map, you can see that the housing prices are high near the coastal area.

# Optional
import matplotlib.image as mpimg
california_img=mpimg.imread('../input/california-housing-feature-engineering/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.savefig("california_housing_prices_plot.png")
plt.show()

### Let's look for correlations between attributes

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# Using the corr_matrix, we can see the attributes that are likely to be correlate. 

attributes = ["median_house_value", "median_income", "total_rooms","housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.savefig("scatter_matrix_plot.png")

In [None]:
# The median house value seems to be highly correlated to the median_income among others.

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.axis([0, 16, 0, 550000])
plt.savefig("income_vs_house_value_scatterplot.png")

#### Observations from the scatter plot:
1. The correlation is indeed very strong, you can clearly see the upward trend and the points are not too dispersed.
2. The price cap that we noticed earlier is clearly visible as a horizontal line at USD 500,000.
3. There is less obvious straight lines at USD 450,000 &  USD 350,000.

Now the total number of rooms or bedrooms in a district is not very useful. The number of rooms per household, bedrooms per rooms and population per household seems useful attributes.

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
# Finding the correlation
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

## Prepare the Data for Machine Learning Algorithms

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

In [None]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)]
sample_incomplete_rows.head()

### Data Cleaning

Only the total_bedrooms attribute discloses missing values. We can either delete those instances or delete the total_bedrooms attribute, or replace the missing values with median.

Now, this selected dataset have missing values in only one attribute,  but we cannot be sure that there won’t be any missing values in new data. Therefore, we use Scikit-Learn's Imputer function to take care of all the missing values.

In [None]:
#housing.dropna(subset=["total_bedrooms"]) # option 1
#housing.drop("total_bedrooms", axis=1) # option 2
#housing["total_bedrooms"].fillna(housing["total_bedrooms"].median()) # option 3

# We are using Scikit-Learn's Imputer function here.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

#Remove the text attribute because median can only be calculated on numerical attributes
housing_num = housing.drop('ocean_proximity', axis=1)

imputer.fit(housing_num)

# The imputer has simply computed the median of each attribute and stored the result in its statistics_ instance variable.
imputer.statistics_

In [None]:
#The trained imputer can transform the training set by replacing missing values by the learned medians
X = imputer.transform(housing_num)

In [None]:
# X is a Numpy Array, change it to dataframe using pandas DataFrame function. 
housing_tr = pd.DataFrame(X, columns=housing_num.columns,index=housing.index)
housing_tr.head()

### Handling Text and Categorical Attributes

The categorical attribute *ocean_proximity* needs to be taken care of. We change the text labels to numbers using Scikit-Learn's OrdinalEncoder function. It encode categorical features as an integer array.

Further, we use OneHotEncoder encoder to convert integer categorical values into one-hot vectors to create one binary attribute per category.


In [None]:
from sklearn.preprocessing import OneHotEncoder

housing_cat = housing[["ocean_proximity"]]
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

Let's create a custom transformer to add extra attributes using Scikit-Learn's FunctionTransformer class that lets you easily create a transformer based on a transformation function.

In [None]:
from sklearn.preprocessing import FunctionTransformer

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)

Let's build a pipeline for preprocessing the numerical attributes

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
housing_num_tr

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared.shape