# House prices

## Extract and view data

First we fetch data from external url, saved in auxiliary function fetch housing data

In [None]:
import sys
sys.path.append('.\\aux_functions')

# print(sys.path)
from aux_functions.fetch_data import fetch_housing_data

fetch_housing_data()

Now we load the housing data into a data frame, and take a quick look at the data structure

In [None]:
from aux_functions.fetch_data import load_housing_data
housing = load_housing_data()
housing.head()

Each row represents one district. There are 10 attributes

In [None]:
housing.info()

*ocean _proximity* is a categorical attribute, and the categories are found by

In [None]:
housing["ocean_proximity"].value_counts()

The numerical attributes can be explored with:

In [None]:
housing.describe()

We can call the hist() method on the whole data set to analyze the numerical attributes

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rc('text', usetex=False)
housing.hist(bins=50, figsize=(20,15))
plt.show()

*median_income* is not in dolars. It has been capped.

*median_house_value* and *housing_median_age* has also been capped.

MAny histograms are tail heavy. 

##  Create a Test Set

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size = 0.2, random_state=42)

In order to have stratified sampling, we should divide the sets in not many strata, from which we should pick data for the test_set.

We can divide the *meadian_income* attribute into several strata, and create a new *income_cat* attribute

In [None]:
import numpy as np
import pandas as pd

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels = [1, 2, 3, 4, 5])
housing["income_cat"].hist()


Now we are ready to do stratified sampling based on the income category. For this we use *StratifiedShuffleSplit* class

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set  = housing.loc[test_index]

Let's see if it works as expected

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
housing["income_cat"].value_counts() / len(housing)

We can see that the fractions are almost the same. Now remove *income_cat*

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace = True)

## Discover and visualize the data to gain insights

In [None]:
housing = strat_train_set.copy()

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
            s=housing["population"]/100, label="population", figsize=(10,7),
            c="median_house_value", cmap=plt.get_cmap("jet"), colorbar = True)
plt.legend()

*median_income* is not in dolars. It has been capped.

*median_house_value* and *housing_median_age* has also been capped.

MAny histograms are tail heavy. 

##  Create a Test Set

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size = 0.2, random_state=42)

In order to have stratified sampling, we should divide the sets in not many strata, from which we should pick data for the test_set.

We can divide the *meadian_income* attribute into several strata, and create a new *income_cat* attribute

In [None]:
import numpy as np
import pandas as pd

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels = [1, 2, 3, 4, 5])
housing["income_cat"].hist()


Now we are ready to do stratified sampling based on the income category. For this we use *StratifiedShuffleSplit* class

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set  = housing.loc[test_index]

Let's see if it works as expected

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
housing["income_cat"].value_counts() / len(housing)

We can see that the fractions are almost the same. Now remove *income_cat*

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace = True)

## Discover and visualize the data to gain insights

In [None]:
housing = strat_train_set.copy()

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
            s=housing["population"]/100, label="population", figsize=(10,7),
            c="median_house_value", cmap=plt.get_cmap("jet"), colorbar = True)
plt.legend()

## Looking for correlations

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending = False)

Another way for check for correlation between attributes is to use *scatter_matrix*

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize = (12,8))

The most promissing attribute to predict the median house value is the median income

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha = 0.1)

There are horizontal lines around 500,000  450,000 and 350,000
We may want to try removing the corresponding districts to prevent the algorithm from learning these lines

## Experimenting with Attribute Combinations

Create more interestng attributes to analyze

In [None]:
housing["rooms_per_household"] = housing ["total_rooms"] / housing ["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

And now let's look at the correlation matrix again

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending = False)

The new *bedrooms_per_room* is much more correlated with the median house value than the total number of rooms or bedrooms.

## Prepare the data for Machine Learning Algorithm

First let's revert to a clean training set and let's separate the predictor attributes and the labels

In [None]:
housing = strat_train_set.drop("median_house_value", axis =1)
housing_labels = strat_train_set["median_house_value"].copy()

### Data cleaning
Now we need to handle the NaN values in *total_bedrooms*

we have three options: delete the sampltes with NaN values, delete the whole attribute, or substitute the NaN value with some value.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

To use the SimpleImputer, we need to create a copy of the data without categorical attributes

In [None]:
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
imputer.statistics_

Now we can use this trained imputer to transform the training set by replacing missing values by the learned medians

In [None]:
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns = housing_num.columns)

### Handling text and categorical attributes


We are going to create one binary attribute per category: *one-hot encoding*

In [None]:
housing_cat = housing[["ocean_proximity"]]

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
housing_cat_1hot.toarray()

### Custom transformes

For a custom transformer, we need to crete a class and implement three methods: *fit()*(returning self), *transform()* and *fit_transform()*.  

If we add *TransfromerMixin* as a base clase, we get for free the *fit_transform()* method

If we add *BaseEstimator* as a base class we get two extra methods (*get_params()* and *set_params()*) that will be useful

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3,4,5,6

class CombinedAttributesAdder (BaseEstimator, TransformerMixin):
    def __init__ (self, add_bedrooms_per_room = True):  # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                        bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room= False)
housing_extra_attribs = attr_adder.transform(housing.values)