# Housing price prediction 
This is second part of the California housing price prediction problem. In the first part I have explained how to perform end-to-end machine learning algorithm, and make prediction. 



This notebook is mainly about how to automate the preprocessing and test various models to find the best one for our case. Detailed EDA and feature enginnering is explained in the previous NB, so let's directly jump on to dataset and implement the transformers and pipeline to automate the processing post EDA. 



In [81]:
import pandas as pd 
from pandas.plotting import scatter_matrix
import seaborn as sns 
import matplotlib.pyplot as plt 
import numpy as np 
import sklearn 
from sklearn.model_selection import StratifiedShuffleSplit 
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import StandardScaler, LabelBinarizer, OneHotEncoder


## Load Dataset 

In [12]:
df = pd.read_csv("housing.csv")


#### Stratified splitting 
 - In the very begining I am using the stratified splitting for the reason explained in previous NB. 


In [23]:
df["income_category"] = np.ceil (df.median_income/1.5)
df['income_category'] = df["income_category"].where(df.income_category<5, 5.0)


split = StratifiedShuffleSplit (n_splits=1, 
                       test_size=0.2,
                       random_state=42)

for train_idx, test_idx in  split.split(df, df['income_category']):
    strat_train = df.loc[train_idx]
    strat_test  = df.loc[test_idx]

Y_train, Y_test = strat_train.median_house_value, strat_test.median_house_value

for strat in strat_train, strat_test: 
    strat.drop(["income_category","median_house_value"],axis=1, inplace=True)



In [24]:
strat_train.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity'],
      dtype='object')

## Transformer 
Let's encapsulate custom cleanup, feature engineering in a custom class which will do the job seamlessly with sklearn functionalities like pipeline etc. 
 BaseEstimator, TransformerMixin are two sklearn classes which are needed to get get_param(),set_param() functions and fit_transform() respectively. 
 
 - For now, check the column number in the dataframe which we need for feature engineering, we will need an automatic way to do so in future. 




In [25]:
from sklearn.base import BaseEstimator, TransformerMixin 
rooms_ix, bedrooms_ix, population_ix, household_ix = 3,4,5,6 



##### Content of pipeline 
In order to preprocess the data we need to perform a series of tasks in a specific order and this can be partially done by a custom transformer class like below in collaboration with sklean pipelines as shown in following cells. 
The main tasks to be performed are: 
 - cleaning of the dataset, nonnull, treatment of the missing values etc 
 - adding more features based on the existing ones, 
   - numerical features 
   - treatment of the categorical features 
 - scaling the columns to normalised values, using minmaxscaler or standardscaler or any other customized scaler. 
 - any other task needs to be performed on the new data will enter this pipeline. 
 
 Let's implement one easy pipeline for the linear regression and then use it for various models, and embrace the power of this tool. 
 

In [77]:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedroom_per_room=True): ## this does not need arga and kargs 
        self.add_bedroom_per_room = add_bedroom_per_room
    def fit(self,X,y=None):
        return self ## in fit function nothing else needs to be done 
    def transform(self, X, y=None):
        rooms_per_household = X[:,rooms_ix] / X[:,household_ix]
        population_per_household = X[:,population_ix] / X[:,household_ix]
        if self.add_bedroom_per_room:
            bedrooms_per_room = X[:,bedrooms_ix] / X[:,rooms_ix]
            return np.c_[X,rooms_per_household, population_per_household,bedrooms_per_room]
        else:
            return np.c_[X,rooms_per_household, population_per_household]


attr_adder = CombinedAttributesAdder(add_bedroom_per_room=False)
train_df_extra = attr_adder.transform(strat_train.values) ## this return a numpy 2d array. this needs to be converted to dataframe

cols = strat_train.columns.tolist() +["rooms_per_household","population_per_household"]
train_df_extra_  = pd.DataFrame(train_df_extra,columns=cols)
train_df_extra_


## This class will return the values of the Dataframe for given columns. 
## this is also a transformer 
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attributes_names):
        self.attributes_names = attributes_names
    def fit(self, X, y=None):
        return self 
    def transform(self, X,y=None):
        #print ("selecting", self.attributes_names)
        #print (X[self.attributes_names].values.shape )
        return X[self.attributes_names].values 
    
    

In [78]:
strat_train_cat = strat_train[["ocean_proximity"]] ## only categorical features 
strat_train_num = strat_train.drop(["ocean_proximity"],axis=1) ## need only numerical data 

from sklearn.pipeline import FeatureUnion 
from sklearn.pipeline import Pipeline
num_attr = list(strat_train_num.columns)
cat_attr = list(strat_train_cat.columns)

In [84]:
num_pipeline = Pipeline([
    ('selector',DataFrameSelector(num_attr)),
     ('imputer',SimpleImputer (strategy="median")),
     ('add_features',CombinedAttributesAdder()),
    ('std_scaler',StandardScaler()),
     ])

cat_pipeline = Pipeline([
    ('selector',DataFrameSelector(cat_attr)),
     ('dummies',OneHotEncoder(sparse=False)),
     ])


full_pipeline = FeatureUnion (transformer_list=[
    ('num_pipeline',num_pipeline),
    ('cat_pipeline',cat_pipeline)
])

In [88]:
full_pipeline.fit_transform(strat_train)

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])