# California Housing Price Prediction

## Step 1: Frame the problem

- Objective: Predicting the median housing price for given block.
- Problem Type: Supervise & Regression

The dataset contains information about houses in California district, obtained from 1990 California census.
There are around 20000 records along with 10 features in the dataset. Feature names are self explanatory: longitude,	latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity

Important to remember that the each record in dataset is not about the house but it is about the block.

1. longitude: A measure of how far west a house is; a higher value is farther west

2. latitude: A measure of how far north a house is; a higher value is farther north

3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

4. totalRooms: Total number of rooms within a block

5. totalBedrooms: Total number of bedrooms within a block

6. population: Total number of people residing within a block

7. households: Total number of households, a group of people residing within a home unit, for a block

8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

10. oceanProximity: Location of the house w.r.t ocean/sea


- Few thing to remeber about dataset:
    - The median income, housing median age and the median house value were capped. 
    - Median income not expressed in US dollars. The data has been scaled and capped at 15 for higher median incomes, and at 0.5 for lower median incomes. 
    - Capping median house value may be a serious problem since it is the target attribute (your labels). Machine Learning algorithms may learn that prices never go beyond that limit.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/california-housing-prices/housing.csv")
print(data.shape)
data.head()

## Step 2: Data Exploratoration

In [None]:
data_explore = data.copy()

In [None]:
data_explore.info()

Only total_bedrooms column contain null values, total 207.

In [None]:
data_explore.describe()

Comparing mean, std or min, 25% or max 75%, we can see that there are some outliers in some columns.

### Histograms

In [None]:
data_explore.hist(figsize=(15, 8))

- Observations:
    - There are many blocks for which median housing prize lies in between 2.5 to 5.5
    - Many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns.

### Outliers

In [None]:
columns = ['households', 'population', 'total_bedrooms', 'total_rooms']
plt.figure(figsize=(15, 8))
sns.boxplot(data=data_explore[columns])
plt.ylim((-100, 7000))

In [None]:
Q1 = data_explore.quantile(0.25)
Q3 = data_explore.quantile(0.75)
IQR = Q3 - Q1
((data_explore < (Q1 - 1.5 * IQR)) | (data_explore > (Q3 + 1.5 * IQR))).sum()

In [None]:
data_explore['total_bedrooms'].mean(), data_explore['total_bedrooms'].median()

There are more than 1000 outliers, I will replace them by median instead of mean.

In [None]:
median = data_explore['total_bedrooms'].median()
data_explore['total_bedrooms'].fillna(value=median, inplace=True)
data_explore['total_bedrooms'].isna().sum()

### Median Housing Value Accross Different Geo Locations

In [None]:
import matplotlib.image as mpimg
california_img=mpimg.imread('../input/images/calfornia_img.jpg')
california_state=mpimg.imread('../input/images/calfornia_state.jpg')

In [None]:
plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
plt.imshow(california_state)
plt.axis('off')
plt.subplot(1, 2, 2)
ax = plt.gca()
data_explore.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=data_explore["population"]/100,
             label="population", figsize=(15,6), c="median_house_value", cmap=plt.get_cmap("jet"), ax=ax)
ax.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05])

- Observations: 
    - We can see that the density is larger in Bay Area, Los Angeles, San Diego.
    - It is observed that the housing prices are higher near ocean region and in high population area. This observed not hold true for north california region.
    - As we go away from ocean there is decrease in housing prizes. Most housing prizes are below $20k.

### Correlation Plot

In [None]:
data_explore["rooms_per_household"] = data_explore["total_rooms"]/data_explore["households"]
data_explore["bedrooms_per_room"] = data_explore["total_bedrooms"]/data_explore["total_rooms"]

In [None]:
data_explore_dummies = pd.get_dummies(data_explore) 
plt.figure(figsize=(18, 10))
corr_matrix = data_explore_dummies.corr(method='pearson')
sns.heatmap(corr_matrix, mask=np.zeros_like(corr_matrix, dtype=np.bool), square=True, annot=True)

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

- Observations:
    - Median house value is highly correlated with median income. Other than median income no other feature is highly correlated with target variable.
    - We can see that newly created features are somewhat correlated to target variable. This correlation is more than the correlation of those indivisual features with target variable.
    - There is strong correlation among some of feature variables such as latitude & longitude, population & households etc.

    
Lets explore more about the relationship between correlated features with median house value.

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(data_explore[attributes], figsize=(15, 10))
plt.show()

- Observe the plot between median houseing value and median income, there is strong correlation between them and also the points are not too dispersed. Interesting observation is that we can see the horizontal line at top of chart(at 500,000). This is because of capping.
- Having capped data is not good for training the model because there is possibility that model will learn that the maximum price of house will not go above 500k USD.

## Step 3: Data Preprocessing

First lets get rid of those records for which median house value is capped to $500k.

In [None]:
data_capped = data[data['median_house_value']>=500000]
data = data[data['median_house_value']<500000]
data_capped.shape, data.shape

- Since dataset is not large enough and (As per experts) "median income" is a very important attribute to predict median housing prices, We have to ensure that the test set is representative of the various categories of incomes in the whole dataset.
- The median income is a continuous numerical attribute, created an new income "category" attribute to use for stratisfied sampling. 
- I will generated train and test data using StratifiedShuffleSplit method.

In [None]:
data["income_cat"] = pd.cut(data["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])

In [None]:
data["income_cat"].hist()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

In [None]:
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.iloc[train_index]
    strat_test_set = data.iloc[test_index]

strat_train_set.drop("income_cat", axis = 1, inplace=True)
strat_test_set.drop("income_cat", axis = 1, inplace=True)
strat_train_set.shape, strat_test_set.shape

In [None]:
X_train = strat_train_set.drop('median_house_value', axis=1)
y_train = strat_train_set['median_house_value'].copy()
X_test = strat_test_set.drop('median_house_value', axis=1)
y_test = strat_test_set['median_house_value'].copy()

During exploration, I performed two operations on data:
1. Replace null values with median
2. Added two new columns: bedrooms_per_room, rooms_per_household

I will define custom transformer, which will create the two extra columns.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
rooms_ix, bedrooms_ix, households_ix = 3, 4, 6    # column ids

class CombineAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, bedrooms_per_room]

In [None]:
num_attrs = list(X_train.columns)
num_attrs.remove('ocean_proximity')
cat_attrs = ['ocean_proximity',]

In [None]:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")),
                         ('attribs_adder', CombineAttributesAdder()),
                         ('scaler', PowerTransformer(method='yeo-johnson', standardize=True))])

pre_process = ColumnTransformer([("nums", num_pipeline, num_attrs),
                                   ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attrs)], remainder='passthrough')

In [None]:
X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)
X_train_transformed.shape, X_test_transformed.shape

In [None]:
feature_columns = list(X_train.columns)
feature_columns.extend(['rooms_per_household','bedrooms_per_room'])
new_cols = list(X_train['ocean_proximity'].unique())
feature_columns.extend(new_cols)
feature_columns.remove('ocean_proximity')

## Step 4: Select and Train a Model

- I will be trying out Linear as well as Ensemble learning techniques.
- Following are the 5 models I will be using:
    1. Stochastic Gradient Descent with L-2 Regularization.
    2. Decision Tree
    3. Random Forest
    4. XGBoost Regression


- I will be using RMSE as evaluation metric.

In [None]:
from sklearn.model_selection import cross_val_score

results=[]

def cv_results(model, X, y):
    scores = cross_val_score(model, X, y, cv = 7, scoring="neg_root_mean_squared_error", n_jobs=-1)
    rmse_scores = -scores
    rmse_scores = np.round(rmse_scores, 3)
    print('CV Scores: ', rmse_scores)
    print('rmse: {},  S.D.:{} '.format(np.mean(rmse_scores), np.std(rmse_scores)))
    results.append([model.__class__.__name__, np.mean(rmse_scores), np.std(rmse_scores)])

### Stochastic Gradient Descent

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
sgd_reg = SGDRegressor(alpha=1, penalty='l1', random_state=42)
sgd_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,sgd_reg.coef_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

- Looking at the coefficient values, model has given much importance to the median income attribute. Another most important attributes according this model are proximity of location from ocean and median age of house.

In [None]:
cv_results(sgd_reg, X_train_transformed, y_train)

- RMSE is around than $60000. Which is huge. This indicates that model is poorely fitted with given data.
- Either features included not providing the enough information or model is not powerfull.
- We have observed the underfitting with Linear Regression.

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor(criterion="mse", random_state=42)
tree_reg.fit(X_train_transformed, y_train)

In [None]:
cv_results(tree_reg, X_train_transformed, y_train)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
forest_reg = RandomForestRegressor(criterion='mse', n_estimators=100, n_jobs=-1, random_state=42)
forest_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,forest_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

In [None]:
cv_results(forest_reg, X_train_transformed, y_train)

Though RMSE is still high but we got better result compared to previous two models. 

### XGBoost Regression

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb_reg = XGBRegressor(n_estimators=100, max_depth=8, learning_rate=0.1, objective='reg:squarederror', random_state=42)
xgb_reg.fit(X_train_transformed, y_train)

In [None]:
feature_imp = [ col for col in zip(feature_columns,xgb_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp

In [None]:
cv_results(xgb_reg, X_train_transformed, y_train)

In [None]:
result_df = pd.DataFrame(data=results, columns=['Model', 'RMSE', 'S.D'])
result_df

Got better cross validation result for XGBoost than all previously implemented algorithms. Remember in Random Forest we have use full grown decision trees where as in XGBoost we have use Decision trees of height=8.

Now, I will tune parameters of Random Forest and XGBoost. Once best models are obtained, I will evaluate each best model and then the model which give best results on test dataset will be the final model. 

## Step 5: Fine Tune a Model

In [None]:
from sklearn.model_selection import GridSearchCV

First we will find best Random Forest regression model.

In [None]:
rf_grid_parm=[{'n_estimators':[50, 100, 300], 'max_depth':[8, 16, 24]}]
rf_grid_search = GridSearchCV(RandomForestRegressor(random_state=42, n_jobs=-1), rf_grid_parm, cv=5, scoring="neg_root_mean_squared_error", return_train_score=True, n_jobs=-1)
rf_grid_search.fit(X_train_transformed, y_train)

In [None]:
rf_grid_search.best_params_, -rf_grid_search.best_score_

In [None]:
cvres = rf_grid_search.cv_results_
print("Results for each run of Random Forest Regression...")
for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
    print(-train_mean_score, -test_mean_score, params)

In [None]:
best_forest_reg = rf_grid_search.best_estimator_
best_forest_reg

Now lets find best XGBoost regression model.

In [None]:
xgb_grid_parm=[{'n_estimators':[50, 100, 300], 'max_depth':[6, 8, 12]}]
xgb_grid_search = GridSearchCV(XGBRegressor(objective='reg:squarederror', learning_rate=0.1, n_jobs=-1, random_state=42), xgb_grid_parm, cv=5, scoring="neg_root_mean_squared_error", return_train_score=True, n_jobs=-1)
xgb_grid_search.fit(X_train_transformed, y_train)

In [None]:
xgb_grid_search.best_params_, -xgb_grid_search.best_score_

In [None]:
cvres = xgb_grid_search.cv_results_
print("Results for each run of XGBoost Regression...")
for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
    print(-train_mean_score, -test_mean_score, params)

In [None]:
best_xgb_reg = xgb_grid_search.best_estimator_
best_xgb_reg

## Step 6: Model Evaluation

Evaluate Random Forest regression model.

In [None]:
cv_results(best_forest_reg, X_test_transformed, y_test)

Evaluate XGBoost regression model.

In [None]:
cv_results(best_xgb_reg, X_test_transformed, y_test)

- Error is still large, but among both models XGBoost performs slightly better on both train and test dataset. So I will select XGBoost as final model.

Lets analyse model's prediciton on overall dataset. This will help to find out where the model is making many mistakes.

In [None]:
combine_data = pd.concat([strat_train_set, strat_test_set], axis=0)
combine_data.shape

In [None]:
y_train_pred = best_xgb_reg.predict(X_train_transformed)
y_test_pred = best_xgb_reg.predict(X_test_transformed)

In [None]:
y_pred = np.concatenate([y_train_pred, y_test_pred], axis=0)
y_pred.shape

In [None]:
combine_data['predicted_value'] = y_pred

In [None]:
combine_data.head()

In [None]:
combine_data.describe()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
combine_data['median_house_value'].hist()
plt.title('Observed Median House Value')
plt.subplot(1, 2, 2)
combine_data['predicted_value'].hist()
plt.title('Predicted Median House Value')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(combine_data['median_income'], combine_data['median_house_value'], c='green', alpha=0.7, label="Observed")
plt.scatter(combine_data['median_income'], combine_data['predicted_value'], c='red', alpha=0.7, label="Predicted")
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.legend()
plt.show()

- Observations:
    - We can see that there were many blocks for which median house value above 450000. But model has predicted the value lesser than the orignal median house value.
    - For some block having median value less than 100000, model has predicted the value higher than the orignal value.
    - Look at top left corner, For blocks where median house value is less than 3 and median observed value above 300000, model has predicted lesser value than orignal value.

In [None]:
plt.figure(figsize=(18, 10))
fig, ax = plt.subplots(nrows=1, ncols=2)
combine_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, figsize=(15,8), c="median_house_value", cmap=plt.get_cmap("jet"), ax=ax[0], colorbar=False)
ax[0].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05])
ax[0].set_title('Observed Median House Values')
combine_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,figsize=(15,8), c="predicted_value", cmap=plt.get_cmap("jet"), ax=ax[1], colorbar=False)
ax[1].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05])
ax[1].set_title('Predicted Median House Values')
plt.show()

- Observe the redness in both graph. Near the ocean region we can see that there is more redness in observed median house value than the predicted. 

Note: Redness indicates the high house value where as blue indicates low house values. 