# California Housing Prices
Median house prices for California districts derived from the 1990 census.

## Context
This is the dataset used in the second chapter of Aurélien Géron's recent book 'Hands-On Machine learning with Scikit-Learn and TensorFlow'. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being to toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.


## Acknowledgements
Please refer to the [Kaggle challenge web page](https://www.kaggle.com/camnugent/california-housing-prices)

## Inspiration
predict a real estate price

___

# Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
import folium

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error 
from sklearn.linear_model import Lasso, LinearRegression, Ridge, RANSACRegressor, SGDRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.svm import SVR

In [None]:
df = pd.read_csv('../input/housing.csv')
df.head()

In [None]:
df.shape

## Content
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. Be warned the data aren't cleaned so there are some preprocessing steps required! The columns are as follows, their names are pretty self explanitory:
* longitude
* latitude
* housing_median_age
* total_rooms
* total_bedrooms
* population
* households
* median_income
* median_house_value
* ocean_proximity

In [None]:
df.info()

There are few missing value int the 'total_bedrooms' column. Now let's see the basic stats for the numerical columns:

In [None]:
df.describe()

In [None]:
df.ocean_proximity.value_counts()

## Cleaning data

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

In [None]:
print(f'percentage of missing values: {df.total_bedrooms.isnull().sum() / df.shape[0] * 100 :.2f}%')

In [None]:
df = df.fillna(df.median())
df.isnull().sum()

## Dealing with geospatial infos
Visualization of the data in a scatter plot in a "geographic way"

In [None]:
sns.scatterplot(df.longitude, df.latitude)

Same plot but this time with a varying size of the data points based on `population` variable and a different color depending of the real estate price (`median_house_value`)

In [None]:
sns.relplot(x="longitude", y="latitude", hue="median_house_value", size="population", alpha=.5,\
            sizes=(50, 700), data=df, height=8)
plt.show()

In [None]:
# Create a map with folium centered at the mean latitude and longitude
cali_map = folium.Map(location=[35.6, -117], zoom_start=6)

# Display the map
display(cali_map)

In [None]:
# Add markers for each rows
for i in range(df.shape[0]):
    folium.Marker((float(df.iloc[i, 1]), float(df.iloc[i, 0]))).add_to(cali_map) 
    
# Display the map
display(cali_map)

## Target analysis

In [None]:
plt.figure(figsize=(10, 4))
sns.distplot(df.median_house_value)
plt.show()

Variations depending on the proximity with ocean

In [None]:
df.ocean_proximity.unique()

In [None]:
plt.figure(figsize=(10, 4))
for prox in df.ocean_proximity.unique():
    sns.kdeplot(data=df[df.ocean_proximity == prox].median_house_value)
    plt.legend(prox)
plt.show()

## Other analysis

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
df.hist(figsize=(8, 8))
plt.show()

## Correlations

In [None]:
corr = df.corr()
corr

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(8, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

* lat and log are highly positively correlated
* total_bedrooms, population and householdsare highly positively correlated too
* median_income and median_house_value are also positively correlated

which make sense.

---

# Models training and predictions

## Data preparation

Label encoding of categorical feature (ocean proximity)

In [None]:
df = pd.get_dummies(data=df, columns=['ocean_proximity'], drop_first=False)
df.head()

In [None]:
feat_removed = ['median_house_value']

# removed 
#['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income',
#'median_house_value', 'ocean_proximity']

In [None]:
y = df.median_house_value
X = df.drop(columns=feat_removed)
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Metric RMSE root mean squared error

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results.

<img src="./input/fig.jpg" style="height:400px">

In [None]:
def calculate_rmse(model, model_name):
    model.fit(X_train, y_train)
    y_pred, y_pred_train = model.predict(X_test), model.predict(X_train)
    rmse_test, rmse_train = np.sqrt(mean_squared_error(y_test, y_pred)), np.sqrt(mean_squared_error(y_train, y_pred_train))
    print(model_name, f' RMSE on train: {rmse_train:.0f}, on test: {rmse_test:.0f}')
    return rmse_test

## Linear Regression

In [None]:
lr = LinearRegression()
lr_err = calculate_rmse(lr, 'Linear Reg')

## RANSAC Regressor

In [None]:
ra = RANSACRegressor()
ra_err = calculate_rmse(ra, 'RANSAC Reg')

## Lasso

In [None]:
la = Lasso()
la_err = calculate_rmse(la, 'Lasso Reg')

## SGD Regressor

In [None]:
sg = SGDRegressor()
sg_err = calculate_rmse(sg, 'SGD Reg')

## Ridge

In [None]:
ri = SGDRegressor()
ri_err = calculate_rmse(ri, 'Ridge')

## AdaBoostRegressor

In [None]:
ad = AdaBoostRegressor()
ad_err = calculate_rmse(ad, 'AdaBoostRegressor')

## SVR

In [None]:
sv = SVR()
sv_err = calculate_rmse(sv, 'SVR')

## Results comparison

In [None]:
df_score = pd.DataFrame({'Model':['Linear Reg', 'RANSAC Reg', 'Lasso Reg', 'AdaBoost', 'SVR'], 
                         'RMSE':[lr_err, ra_err, la_err, ad_err, sv_err]})
ax = df_score.plot.barh(y='RMSE', x='Model')

Lasso and the Linear Reg are the winners ! Surprisingly the RSME is a little lower for the best models when we keep features such as lat/long and 'total_bedrooms', 'population'.