California Housing

Comparison of Linear Regression and Decision Tree classifier for predicting Median House Value after some feature engineering

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv("/kaggle/input/california-housing-prices/housing.csv", sep=",")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isna().sum()

Data is looking at blocks of regions in California with information about the housholds in those regions from Census data:
- Longitude and latitude indicating where in CA the region is located
- housing_median_age - median age of individuals in the area (though with min at 1, there are likely some incorrect values here)
- total_rooms - total number of rooms in the region (again with min at 2, likely some incorrect values here)
- total_bedrooms - total number of bedrooms in the region - likely to be very correlated with total_rooms, possibly only need to keep one of these values
- population - total population in the region (again with min at 3, likely to be some incorrect values here)
- median_income - not clear what the scale of this would be
- median_house_value - from 14,999 to 500,001 (seems that min and max may be truncated)

207 missing data in the total_bedrooms

In [None]:
sns.pairplot(data)

Total rooms, total bedrooms, population, households are very highly correlated with each other.

Longitude. latitude, housing_median_age seems to be no strong correlation with median_house_value

Median_house_value appears to have a capped upper bound, as seen with the larger values for the top bin in the histogram

An interesting extra feature might be some indication of the density of the population, e.g. households per population, and type of housing e.g. number of rooms per household

Also, not included in the pairplot is ocean proximity, so should convert this to a numerical value to check correlation, and convert to one hot values if including in the features.

Likely useful features for predicting median house value:
- Median_income
- Households / population
- total_rooms / households
- total_bedrooms / households

In [None]:
ocean = pd.factorize(data.ocean_proximity)

In [None]:
ocean

In [None]:
data['ocean'] = pd.factorize(data.ocean_proximity)[0]

In [None]:
data.head()

In [None]:
sns.pairplot(data.iloc[:,3:])

Ocean doesn't look particularly informative for median_house_value except for possibly when it is 4, which is on an island, but will include it as a one-hot encoding.

In [None]:
a = pd.Series(data.median_income)
b = pd.Series(data.households / data.population)
c = pd.Series(data.total_rooms / data.households)
d = pd.Series(data.total_bedrooms / data.households)
plt.scatter(x=c,y=d)

total_rooms / households (c) and total_bedrooms / households (d) are correlated enough that it is probably sufficient to only include one of these, easiest decision is to use total_rooms as this does not have any missing values

In [None]:
e = pd.get_dummies(data.ocean_proximity)

In [None]:
X = pd.DataFrame()
X['median_income'] = a
X['population_density'] = b
X['housing_density'] = c
X = pd.concat([X,e],axis=1)
X.head()

In [None]:
y = np.array(data.median_house_value).reshape(-1,1)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)
scaler_out = StandardScaler()
scaler_out.fit(y_train)
scaled_y_train = scaler_out.transform(y_train)
scaled_y_test = scaler_out.transform(y_test)

Simple model: see if a linear regression can predict median house price using the features median income, households / population, bedrooms / households, and whether the region is near the ocean (one hot encoding of classes).

In [None]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(scaled_X_train,scaled_y_train)
clf.score(scaled_X_test,scaled_y_test)
pred = scaler_out.inverse_transform(clf.predict(scaled_X_test))
plt.scatter(pred,y_test)

X axis shows the predicted Median House Value, Y axis shows the target Median House Value.

Clearly shows the upper capped values in the target values. Predicted values get a lot higher than the reported Median House Values.

In [None]:
err = pred - y_test
plt.hist(err,bins=50)

In [None]:
((err * err).mean()) ** (0.5)

The histogram of the errors of the model have a larger tail at the negative end showing the skew in the predictions to the higher values, as shown in the previous scatter plot.

RMSE of 69312.60 USD

In [None]:
from sklearn import tree
clfTree = tree.DecisionTreeRegressor()
clfTree.fit(scaled_X_train,scaled_y_train)
pred_tree = scaler_out.inverse_transform(clfTree.predict(scaled_X_test))
plt.scatter(x=pred_tree,y=y_test)

Can clearly see that in contrast to the linear regression model, the decision tree regressor is also capped as the data is at the upper end of median house values. (One of the disadvantages of decision trees is that they do not generalize beyond the examples that they have seen.)

In [None]:
err_tree = pred_tree.reshape(-1,1) - y_test
plt.hist(err_tree,bins=50)

In [None]:
((err_tree * err_tree).mean()) ** (0.5)

The histogram of the error for the decision tree also is even on both sides, without the long tail for predicting higher median house values.

RMSE of 89878.89 USD