# California Housing Prices
****

# Importing the libraries
****

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
#%matplotlib inline #Jupyter's own backend
import os
import seaborn as sns


for dirname, _, filenames in os.walk('/kaggle/input/california-housing-prices'):
    for filename in filenames:
        csv_path = os.path.join(dirname, filename)

# About the dataset

"About the dataset" section is used for getting the insights about the housing dataset and patterns in data.

# Importing the data
****
Each row represents one district.

In [None]:
housing = pd.read_csv(csv_path)
housing.head()

In [None]:
housing.info()

There are 206440 entries, ocean_proximity is object data type (it can hold any value)

In [None]:
housing.describe()

In [None]:
housing.hist(bins = 50, figsize = (20,15))
plt.show()

Let's take a look into the only non-numerical (ocean_proximity) attribute:

In [None]:
housing["ocean_proximity"].value_counts()

# Creating a test set

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 0)

Test set should be representative of overall population.
Since median_income attribute is important predictor of median prices, we will see if this part of the test set is representative of overall population.
So we will create new income category attribute (income_cat) that will hold median income categories:

In [None]:
housing["median_income"].hist(bins = 50)

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
      bins = [0., 1.5, 3.0, 4.5, 6, np.inf],
      labels = [1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].hist(bins = 50)

**Stratified sampling based on the income category**
****

Creating the classes

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

**Income category proportions in the overall dataset vs test set**
****

Income category proportions in the overall dataset:

In [None]:
housing["income_cat"].value_counts()/len(housing)

Income category proportions in the test set:

In [None]:
strat_test_set["income_cat"].value_counts()/len(strat_test_set)

This way we see that the proportions in the test set generated using stratified sampling and the overall dataset are almost identical.
Now we can remove the income_cat attribute:

In [None]:
for dataset in (strat_train_set, strat_test_set, housing):
    dataset.drop("income_cat", axis = 1, inplace = True)

# Data discovery and visualization
****

In [None]:
housing = strat_train_set.copy()

**Visualize the geographical data**
****

s - the radius of circles represents the population size
c - the color of the circles represents the price

In [None]:
housing.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.4,
            s = housing["population"]/100, label = "population",
            c = housing["median_house_value"], cmap = "jet", colorbar = True, figsize = (10,7))
plt.legend()

From this image, we can see that housing prices are related to the location and the population density. However, this isn't the rule always, as there is housing in the north close to the ocean but with lower price.

**Looking for correlations**
****

The dataset isn't too large -> we can compute standard correlation coefficient (Pearson's) between every pair of attributes:

In [None]:
sns.heatmap(housing.corr())

It is visible from the correlation matrix that the median house value (target variable) is negatively correlated to latitude and population: the norther the house, the smaller the value. Also, median house value is positively correlated to median income, meaning the higher the median income in the district, the higher the median house value.

Next, scatter plots of the few attributes most correlated to the median house value will be created (pandas' scatter_matrix).

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "longitude", "latitude", "population", "median_income"]

scatter_matrix(housing[attributes], figsize = (15,7))

**Median house value and median income**
****

In [None]:
housing.plot(kind = "scatter", x = "median_income", y = "median_house_value", figsize = (15,7))

The horizontal lines are result of data capping and they will be removed so that the algorithm doesn't reproduce these data quirks.

In [None]:
housing["median_house_value"].value_counts(sort = "desc")

In [None]:
capped_val_remove = [500001.0, 137500.0, 162500.0, 112500.0, 225000.0, 187500.0, 350000.0, 87500.0, 100000.0, 275000.0,
                    150000.0, 175000.0]

for value in capped_val_remove:
    housing = housing[housing.median_house_value != value]

The scatterplot after removing the capped values:

In [None]:
housing.plot(kind = "scatter", x = "median_income", y = "median_house_value", figsize = (15,7))

**Creating new variables - attvalues combinations**
****

There are a couple of new attributes we can create from existing ones **for every district**:
* Number of bedrooms per household
* Number of rooms per household
* Number of bedrooms per room
* Number of people (population) per household

In [None]:
housing["bedrooms_per_household"] = housing["total_bedrooms"]/housing["households"]
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

Now let's see the correlation matrix and if there are any bigger correlation factors:

In [None]:
sns.heatmap(housing.corr())

From the heatmap of a correlation matrix, it is visible that median_house_value:
* has higher **negative** correlation to bedrooms_per_household than with total_bedrooms in a district. We can say that the houses with more bedrooms cost less

* has high **negative** correlation to bedrooms_per_room, so the houses with higher bedrooms/room ratio are cheaper

* has higher **positive** correlation to rooms_per_household than to total_rooms in a district. Houses with more rooms (bigger houses) cost more.

# Preparing the data for Machine Learning algorithms
****

Creating the clean training_set and separating the predictors and labels:

In [None]:
housing = strat_train_set.drop("median_house_value", axis = 1)
housing_labels = strat_train_set["median_house_value"]

**Data cleaning - missing values**
****

In [None]:
housing.isna().any()

In [None]:
housing[housing.isna().any(axis = 1)]

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = "median")

housing_numerical_attributes = housing.drop("ocean_proximity", axis = 1)
imputer.fit(housing_numerical_attributes)

The imputer stored median values of every attribute in statistics_ instance variable

In [None]:
imputer.statistics_

In [None]:
X = imputer.transform(housing_numerical_attributes)
X

In [None]:
housing_tr = pd.DataFrame(X, columns = housing_numerical_attributes.columns, index = housing_numerical_attributes.index)

New data frame with replaced Na values:

In [None]:
housing_tr

**Handling text and categorical variables**
****

In [None]:
housing_categorical = housing[["ocean_proximity"]]

In [None]:
housing_categorical.value_counts()

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_categorical_encoded = ordinal_encoder.fit_transform(housing_categorical)
housing_categorical_encoded

In [None]:
housing_categorical_encoded[:10]

In [None]:
ordinal_encoder.categories_

housing_categorical_encoded variable has encoded categories of ocean_proximity, however, those categories aren't more similar if closer to one another, so one-hot encoding will be used **instead** of ordinal encoding:

In [None]:
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder()
housing_categorical_1hot =  one_hot_encoder.fit_transform(housing_categorical)
housing_categorical_1hot

List of categories:

In [None]:
one_hot_encoder.categories_

**Custom transformer** that adds the combined attributes (rooms_per_household, bedrooms_per_room, population_per_household)
* BaseEstimator - base class
* Transformermixin - base class
* CombinedAttributesAdder - custom transformer with add_bedrooms_per_room hyperparameter used to see if the algorithm works better with or without it

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms, bedrooms, population, households = 3,4,5,6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X):
        rooms_per_household = X[:, rooms] / X[:, households]
        population_per_household = X[:, population] / X[:, households]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms] / X[:, rooms]
            
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [None]:
attribute_add = CombinedAttributesAdder(add_bedrooms_per_room = False)
housing_added_attributes = attribute_add.transform(housing.values)

**Transformation pipeline**
****

Transformation pipeline is used to get all the sequences of transformations on columns. This way, we will replace steps such as imputing the Na values, combining attributes and scaling into one pipeline.

Transformation pipeline for numerical attributes:
****

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_att_pipeline = Pipeline(
[
    ('imputer', SimpleImputer(strategy = "median")),
    ('attributes_adder', CombinedAttributesAdder()),
    ('scaler', StandardScaler())
]
)

In [None]:
housing_numerical_transformator = num_att_pipeline.fit_transform(housing_numerical_attributes)

One transformer for all the columns would be even more useful (transformations pipeline with multiple transformations used on numerical attributes and OneHotEncoder used on categorical attributes), so here ColumnTransformer comes into play:

In [None]:
from sklearn.compose import ColumnTransformer

numerical_attributes = list(housing_numerical_attributes)
categorical_attributes = ["ocean_proximity"]

full_pipeline = ColumnTransformer(
[
    ("numerical", num_att_pipeline, numerical_attributes),
    ("categorical", OneHotEncoder(), categorical_attributes)
]
)

housing_data_prepared = full_pipeline.fit_transform(housing)

# Select and train a model
****

housing_labels - Linear Regression parameter of target attribute median_house_value

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

lin_reg.fit(housing_data_prepared, housing_labels)

Predicting the test set labels:

In [None]:
housing_predictions_lr = lin_reg.predict(housing_data_prepared)

Housing prediction (also marked as y_pred in other notebooks) is a predictions vector:

In [None]:
housing_predictions_lr

**Model evaluation**
****

Since the data doesn't have many outliers and the task is regression task, we will use RMSE (Root Mean Squared Error):

In [None]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(housing_labels, housing_predictions_lr)
rmse = np.sqrt(mse)

In [None]:
rmse

Typical prediction error of $68,284 is not very satisfying - model is underfitting the training data.
Let's try with a more complex machine learning model - Decision Tree:

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_data_prepared, housing_labels)

In [None]:
housing_predictions_tree = tree_reg.predict(housing_data_prepared)

In [None]:
mse_tree = mean_squared_error(housing_labels, housing_predictions_tree)
rmse_tree = np.sqrt(mse_tree)

In [None]:
rmse_tree