In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import os
print(os.listdir("../input/california-housing-prices"))

In [None]:
housing = pd.read_csv('../input/california-housing-prices/housing.csv')
housing.head()

## 1. Identify the business problem and metrics to measure performance

We have been given a dataset that contains the housing prices in  the California area using the Calfiornia census data. This dataset contains information such as population, median house value, housesize (in terms of total beds and rooms), latitute and longitude (to geolocalize the households). <br>
Here, we are supposed to look create a model to <b>predict the districts meanding housing price </b>. In this scenario, this request has been given to us by the stakeholders. The next step, consists in identifying any current relevant solutions that have been implemented. We do this for twofold reasons: reference in performance as well as an insight on how to solve the problem. Following this, we find out that existing modeling has been dony following complex and costly rules, with a typical error rate of 15%. <br>
This is a typical example of <b>supervised learning task</b> as we are given labeled data, and, more in particular, this is a <b>regression problem</b> as our target is a continuous feature. Finally, there is no continuous flow of data coming in, so we a <b>batch learning</b> approach should work fine<br>.
The typical performance measure for regression problems is the <b> Root Mean Square Error (RMSE)</b>. It measures the standard deviation of the errors the system makes in its predictions. Formula to compute: <br>
 $RMSE = \sqrt{\frac{1}{m}\Sigma_{i=1}^{m}{\Big({h}({x^{(i)}) -y^{(i)}}\Big)^2}}$ <br> In the case that there are many outlier districts in our set, we may consider using the <b>Mean Absolute Error </b>: <br>
 $MAE({X},{h})= {\frac{1}{m}\Sigma_{i=1}^{m}{\Big|{h}({x^{(i)}) -y^{(i)}}\Big|}}$ <br> Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and te vector of target value.

## 2. Exploratory Data Analysis 

Once uploaded our dataset, the first thing is familariase ourselves with it. This means looking at the various datatypes, columns, check distributions of the features, cardinality of categorical features, correlations, outliers, and missing values.

In [None]:
print('Number of entries in the dataset: {}.'.format(len(housing)))
print('There are {} features in the dataset.'.format(len(housing.columns)))
print('--------------------')
print('List of categorical features: \n{}'.format([x for x in housing.select_dtypes(include='O').columns]))
print('List of continuous features: \n{}'.format([x for x in housing.select_dtypes(exclude='O').columns]))
print('------------------')
print('Features with missing values include:')
_ = housing.isnull().sum()
for x,y in zip(_.index,_):
    if y>0:
        print('{} with {} missing values.'.format(x,y))
print('------------------')
print('Cardinality of the categorical feature:')
_ = housing.ocean_proximity.value_counts()
for x,y in zip(_.index,_):
    print('{} has {} labels.'.format(x,y))

From this initial analysis, we can see that our dataset is made of <b>10 features</b> (9 numerical and 1 categorical) with <b>20,640 districts</b>. Of the numerical features, 'total_bedrooms' is <b>missing 207 values</b>. This will need to be imputed. All other features have the totality of the data and are numerical, with the exception of 'ocean_proximity'. We can see that this feature has a cardinality of 5 (1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, ISLAND). Now let's look at some basic summary statistics, for this we can quickly use the method <i>describe()</i> which will give us a quick overview of the count, men, max, and percentiles.

In [None]:
housing.describe()

Some of the things we can immediately note from looking at the mean and std of the above summary statistics is that 'population' is highly affected by outliers with most values falling outside the centre. Why? Because std is almost as big as the mean, meaning that our values are very far and distant from each other. Hence, if population is relevant, we might want to consider it using the median or maybe winsorize the outliers. After, we will plot the feature and I am expecting to see 'population' skewed towards the right. Also, if we look at the percentiles, we can see that 75% of district houses will have lower than 3147 'total_rooms'. Finally, if we look at the 50% percentile (which corrensponds at the median) we can see that the 'median_house_value' is 179700. We can also explore the distribution of our features with method <i>hist()</i>.

In [None]:
housing.hist(bins=50,figsize=(20,15))
plt.show()

Some of the things we can note from the above histogram:
- 'median_income' is not expressed in usd, and presumably tit has been capped at 15.0001.
- 'housing_median_age' and 'house_median_value' have also been capped at 52 and 500001.0 respectively. The latter could cause issues since it is our target feature. The machine learning model could learn that house_median_value never go beyond 500k. If the requirement includes the possibility that such value could go beyond; we have two options: find the uncapped label values for those district or remove the capping districts all togher from our dataset.
- the majority of our features do not follow a bell curve with a strong tail towards the right. This could cause few issues when applying machine learning algos (e.g. a normal linear regression will perform poorly in this problem because it expects the features to follow a normal distribution).<br>

Before continuing with our exploratory data analysis, it is good practice to divide the dataset into <i>training</i> and <i>testing set</i>. Sklearn provides the train_test_split() method which would normally be an optimal way to split them when the dataset is large enough. For this dataset, a better idea would be to follow a stratified sampling approach. Essentially in this way we make sure that our train and test set are representative of overall population. In this case, we would want ideally our stratified sampling to be based on the 'median_income' (because we are assuming that the household income is a good predictor of the median housing prices). Since this feature is a continuous feature, what we have to do is to first discretize it.

In [None]:
housing['income_cat'] = np.ceil(housing['median_income'] /1.5)
print('Cardinality of median_income before discretization {} and after {} .'.format(len(housing.median_income.value_counts()),len(housing.income_cat.value_counts())))
print('After discretization:\n',housing.income_cat.value_counts())

By discretizing the continuous feature, median_income, to 'income_cat' we have reduced the dimensionality of our feature to 11. Following the principle of dimensionality reduction, I have decided to merge all categories greater than 5 into one.

In [None]:
housing['income_cat'] = np.where(housing['income_cat']>5,5.0,housing['income_cat'])
housing.income_cat.plot(kind='hist')

The majority of our median income groups seem to be centred around the 2-3 groups. Now we can do a stratified sampling based on the income category.

In [None]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split.split(housing,housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Let's see if this worked as expected. We have to first look at the income category in the full dataset and then compare it with the one generated from the strat_test_set and the test_set (generated from the train_test_split method).

In [None]:
original = pd.Series(housing['income_cat'].value_counts() / len(housing), name='Original')
strat = pd.Series(strat_test_set['income_cat'].value_counts() / len(strat_test_set),name='Stratified')
train_set, test_set = train_test_split(housing,test_size=0.2,random_state=42)
random = pd.Series(test_set['income_cat'].value_counts() / len(test_set), name='Random')
test_sets_comparisons = pd.DataFrame([original,strat,random]).T.sort_index()
test_sets_comparisons['% Error Strat'] = 100 * (test_sets_comparisons['Stratified'] / test_sets_comparisons['Original']) - 100
test_sets_comparisons['% Error Random'] = 100 * (test_sets_comparisons['Random'] / test_sets_comparisons['Original']) - 100
test_sets_comparisons

As we can see, the testing set generated by StratifiedShuffleSplit provises the closest resamblance to the distribution of our original housing set. Hence, we can proceed by using the stratified sampled sets. We can now remove the income_cat and have the data to our original state.

In [None]:
for _ in (strat_train_set,strat_test_set):
    _.drop(['income_cat'],axis=1,inplace=True)

In [None]:
housing = strat_train_set.copy()

In [None]:
housing.columns

In [None]:
plt.figure(figsize=(20,20))
housing.plot.scatter(x='longitude',y='latitude', alpha=0.1)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
housing.plot.scatter(x='longitude',y='latitude'
                     ,alpha=0.3,s=housing['population']/100,label='population'
                     ,c='median_house_value',cmap=plt.get_cmap('jet'),colorbar=True, legend=True, ax=ax)
plt.show()

From the above, we can see that there are several areas with a low population area but who have high median_house_value. They appear to be between latitude 36,38 and longitude -124,-122 (centre left area of the chart), and  32,34 latitude -120,-118 (bottom centre). We can zoom in:

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
housing[housing['median_house_value']>400000].plot.scatter(x='longitude',y='latitude'
                     ,alpha=0.3,s=housing['population']/100,label='population'
                     ,c='median_house_value',cmap=plt.get_cmap('jet'),colorbar=True, legend=True, ax=ax)
plt.show()

It would be interesting to see which areas/counties these coordinates correspond to. To do that, we can use te geopy library. First we create coordinate_transformer which will transform our latitude and longitude coordinates and then lookup their correnspoding area using the library.

In [None]:
def coordinate_transformer(latitude,longitude):
    """
    This method takes the latitude and longitude coordinates, adding the number of missing zeros needed for the geolocator 
    request. The outputs are then used to find the county name.
    """
    number_rounder_lat,number_rounder_long = (9 - len(str(latitude))),(9 - len(str(longitude)))
    latitude = str(latitude) + str(0)*(number_rounder_lat)
    longitude = str(longitude) + str(0)*(number_rounder_long)
    from geopy.geocoders import Nominatim
    geolocator = Nominatim(user_agent="california_median_housing_price")
    try:
        location = geolocator.reverse(latitude+", "+longitude)
        return location.raw['address']['county']
    except:
        return 'Not Found'

If we were to run a lookup request per every single record, that will take us almost 5 hours of run time (assuming that one request per second is being handled). We already now that our latitude and longitude values are repeated multiple times. The ideal scenario is to look up the unique combinations of latitude and longitude and then use those propagate to the remaining records.

In [None]:
print('There are {} instances of lat/long in the dataset.'.format(housing.shape[0]))

In [None]:
_ = housing.groupby(['latitude','longitude'])['housing_median_age'].count().reset_index().drop(['housing_median_age'],axis=1)
print('There are {} unique combinations of lat/long in the dataset.'.format(_.shape[0]))

Clearly, even 10k is way too much for ourset to handle. What if we rounded the the values it 1 decimal positions?

In [None]:
_['latitude'],_['longitude'] = np.round(_['latitude'],1),np.round(_['longitude'],1)
_ = _.groupby(['latitude','longitude']).count().reset_index()
print('If rounded to 1 decimal point, we have {} unique combinations.'.format(_.shape[0]))

Ok, seems a little better. Let's see what happens if we round to just the longitude.

In [None]:
_['longitude'] = np.round(_['longitude'])
_ = _.groupby(['latitude','longitude']).count().reset_index()
print('If rounded to 0 decimal points the longitude, we have {} unique combinations.'.format(_.shape[0]))

380 unique combinations is enough. Let find their respective locations.

In [None]:
#from timeit import default_timer as timer
county_list = []
#start = timer()
for lat, long in zip(_.latitude,_.longitude):
    county_list.append(coordinate_transformer(lat,long))
#end = timer()
#county_list = pd.Series(county_list)
#print(end - start)

In [None]:
_['county'] = county_list
housing['latitude_join'] = np.round(housing['latitude'],1)
housing['longitude_join'] = np.round(np.round(housing['longitude'],1))
housing = pd.merge(housing,_,how='left',left_on=['latitude_join','longitude_join'], right_on=['latitude','longitude']).drop(['latitude_join',
       'longitude_join', 'latitude_y', 'longitude_y'],axis=1)
housing.rename(columns={'longitude_x':'longitude','latitude_x':'latitude'},inplace=True)
housing

In [None]:
threshold = 400000
plt.axhline(y=threshold,linewidth=4, color='red')
housing[(housing.median_house_value>350000)].groupby(['county'])['median_house_value'].mean().plot(kind='bar',legend=True,figsize=(10,7),cmap=plt.get_cmap('jet'))
plt.legend(loc='best')
print('List of Counties that exceed the threshold:')
high_valued_houses_counties = []
for x in housing[(housing.median_house_value>400000)].groupby(['county'])['median_house_value'].mean().index:
    high_valued_houses_counties.append(x)
    print(x)

In [None]:
print('The percentage of districts in highly valued counties (£400k and above) is {:.2%}.'.format(housing[housing.county.isin(high_valued_houses_counties)].shape[0]/housing.shape[0]))
(housing[housing.county.isin(high_valued_houses_counties)]['ocean_proximity'].value_counts()/len(housing))

As I suspected, house prices is much related to the location of the property. We can now start to look into the any underlying correlations between our features.

In [None]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

The correlation coeffiencient is useful for finding features that relate to each other. In this way, we can use the value of one of our features (x) to infer our target (y). In this case, we are look at how our target correlates to the remaining feature. We can see that there is a strong linear relationship with 'median_income' <b>(0.687160)</b>. Another tools that we can use to investigate the correlation between attributes is the scatter_matrix of pandas.

In [None]:
from pandas.plotting import scatter_matrix
best_f = corr_matrix['median_house_value'].sort_values(ascending=False).head(4).index.to_list()
scatter_matrix(housing[best_f],figsize=(12,12))
plt.show()

From this scatter plot we can note previosly mentioned cappings on 'median_house_value' and 'median_income'. In our correlation analysis, we have identified 'median_income' as the strongest indicator for our target. Let's zoom in.

In [None]:
housing.plot(kind='scatter',x='median_income',y='median_house_value',alpha=0.1)

Now we can clearly see the straight capping line at the 500k mark, but also one at approximately at 450k and 350k. These can interf we the performance of our model, let's remove them.

In [None]:
# here we want to remove the ones that appear in the scatter plot - the capping values.
housing[housing.median_house_value==350000].shape
housing[housing.median_house_value==450000].shape
housing[housing.median_house_value==500000].shape

Features that are skewed and we might want to transform them (e.g. computing their log) <br>
- population
- median_income
- households
- total_bedrooms
- total_rooms

## 3. Feature Engineering

In [None]:
for f in housing.columns[2:]:
    print(f,housing[f].isnull().sum())

Before extracting any feature, we have to deal with the missing values and decide what to do with our skewed features. In our case, the 'total_bedrooms' features seems to have 158 unrecorded examples with skew towards the right. Therefore, a good strategy is to use sklearn and impute with the median for the missing feature.

In [None]:
import seaborn as sns
sns.distplot((housing.total_bedrooms.dropna()))

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(housing['total_bedrooms'].values.reshape(-1,1))
imputer.statistics_

The median value for total_bedrooms looks like 433. To get the equivalent on pandas, we would have done the below:

In [None]:
housing.total_bedrooms.median()

Ok, now we are sure that the value is indeed the value. First, we have to save its value (and apply it to our stratified test set) then we can apply the transformation.

In [None]:
median_t = imputer.statistics_ 
housing['total_bedrooms'] = imputer.transform(housing['total_bedrooms'].values.reshape(-1,1))

Let's check if the values have been imputed correctly.

In [None]:
housing.isnull().sum()

'total_rooms' was not the only skewed feature, let's visualiza them along with a possible log transformation.

In [None]:
start =1 
end = 3
cols = ['population','median_income','households','total_bedrooms','total_rooms']
ax, fig = plt.subplots(nrows=5,ncols=2,figsize=(20,20))
for col in cols:
    for i in range(start,end):
        plt.subplot(5,2,i)
        sns.distplot(housing[col], label = col)
        plt.legend()
        try:
            plt.subplot(5,2,i+1)
        except:
            plt.subplot(5,2,i)
        sns.distplot(np.log(housing[col]), label= [str(col)+'_log  base'])
        plt.legend()
        break
    start=end
    end=end+2
        

We can see that using the np.log methods increases the uniformity of our features, making them more normally distributed. Let's apply the transformation.

In [None]:
cols = ['population','median_income','households','total_bedrooms','total_rooms']
for col in cols:
    housing[col] = np.log(housing[col])

Looking at the dataset features, we can also extract some additional features from. These can include:
- rooms_per_household. Knowing the total number of rooms in a district is not very informative for our prediction. However, having the number of rooms per household could be useful information.
- bedrooms_per_room. Here we calculate the number of beds per room in each household.
- population_per_household. Here we calculate the number of people in each household.
- bedrooms_per_household. Here we calculate the number of bedrooms in each household.

In [None]:
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_per_household'] = housing['population'] / housing['households']
housing['bedrooms_per_household'] = housing['total_bedrooms'] / housing['households']

We can now if any of this new features are correlated with our target value

In [None]:
housing_corr = housing.corr()
housing_corr['median_house_value'].sort_values(ascending=False)

The new 'bedrooms_per_room' is much more correlated with our target than the total number of rooms or bedrooms. Apparently, houses wih a lower bedroom/room ratio tend to be more expensive. We also find that the number of 'population_per_household' is also more informative than the total population (houses with lower population/household ratio tend to be more expensive). 'bedrooms_per_household' has a lower correlation coefficient, it seems to indicate the lack of a linear relationship with our target.

Before we build our model we have one more things to do: deal with the categorical features (as scikit-learn only works with numerical ones). List of categorical features:

In [None]:
housing.select_dtypes(include=['O']).columns

'county' is a feature that we have extracted from the latitude and longitude coordinates. It is important to note, that we have used approximate lat/long coordinates so the real location might be slightly different. However, we have seen the trend that the most expensive houses tend to be closer to the ocean. This information is already encoded in 'ocean_proximity' with a cardinality of 5 possible labels. Hence, we can discount 'county' as using this additional feature will just increase our feature space (which we do not want) and encode ocean_proximity.

In [None]:
housing.drop(['county'],axis=1,inplace=True)

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
encoded_f = OneHotEncoder(handle_unknown='ignore').fit_transform(housing.ocean_proximity.values.reshape(-1,1)).toarray()
n = housing['ocean_proximity'].unique()
cols = ['{}_{}'.format('Ocean_prox_',n)for n in n]
encoded_f = pd.DataFrame(encoded_f,index=housing.index,columns=cols)
housing = pd.concat([housing,encoded_f],axis=1)
housing.drop(['ocean_proximity'],axis=1,inplace=True)
housing.head()

## 4. Build the pipeline

We have imputed the missing values, transformed our skewed features and handled our outliers, extracted some additional features, and encoded the categorical attributes. Now let's put them all together inside a pipeline. First, I will start with a fresh clean copy of the original stratified training set and saperate our target labels.

In [None]:
housing = strat_train_set.drop(['median_house_value'],axis=1).copy()
housing_labels = strat_train_set['median_house_value'].copy()

In [None]:
print(housing.shape, housing_labels.shape)
housing.columns

Now we have to create a couple of custom transformers to pass to our pipeline. To do that we can use the method FunctionTransformer. I will create two transformer: add_extra_features (to add the features we have previously extracted back to the training set) and log_transformation (which will transform our skewed features and make them more normal).

In [None]:
from sklearn.preprocessing import FunctionTransformer

def add_extra_features(X, add_bedrooms_per_room=True):
    # here I take the col index of each feature of interest
    rooms_ix, bedrooms_ix, population_ix, household_ix, median_income_ix = [
    list(housing.columns).index(col) for col in ("total_rooms", "total_bedrooms", "population", "households",'median_income')]
    
    # here I replicate the calculations I did before but this time I am using directly the col indexes
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    bedrooms_per_household = X[:,bedrooms_ix] / X[:,household_ix]
    median_income_per_household = X[:,median_income_ix] / X[:,household_ix]
    #I let the user decide if return bedrooms_per_room additional to the above calculate
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_household,
                     median_income_per_household,bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_household,median_income_per_household]


def log_transformation(X):
    import numpy as np
    # get index cols
    population_ix,median_income_ix,household_ix,bedrooms_ix,rooms_ix =[
        list(housing.columns).index(col) for col in ('population','median_income','households','total_bedrooms','total_rooms')
    ]
    # log tranformation
    population_log = np.log(X[:,population_ix].astype('float64'))
    median_income_log = np.log(X[:,median_income_ix].astype('float64'))
    household_log = np.log(X[:,household_ix].astype('float64'))
    bedrooms_log = np.log(X[:,bedrooms_ix].astype('float64'))
    rooms_log = np.log(X[:,rooms_ix].astype('float64'))
    # return results
    return np.c_[X,population_log,median_income_log,household_log,bedrooms_log,rooms_log]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": True})
log_transformed = FunctionTransformer(log_transformation,validate=False)


In [None]:
housing_extra_attribs = attr_adder.fit_transform(housing.values)
housing_log_transformed = log_transformed.fit_transform(housing.values)

FunctionTransfomer will return an array. We can also visualise them as dataframes.

In [None]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+['rooms_per_household', 'population_per_household','bedrooms_per_household',
                     'median_income_per_household','bedrooms_per_room'],
    index=housing.index)
housing_extra_attribs.head()

In [None]:
housing_log_transformed = pd.DataFrame(
    housing_log_transformed
    ,columns=list(housing.columns) + ['population_log','median_income_log','household_log','bedrooms_log','rooms_log']
    ,index=housing.index
)
housing_log_transformed.head()

Now we can create our transformation pipeline :).

In [None]:
housing.columns

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

numerical_pipeline= Pipeline([
    ('imputer',SimpleImputer(strategy='median',missing_values=np.nan))
    ,('log_transform',FunctionTransformer(log_transformation,validate=False))
    ,('add_features',FunctionTransformer(add_extra_features,validate=False))
    ,('std_scaler',StandardScaler())
])

housing_numerical_transformed = numerical_pipeline.fit_transform(housing.drop(['ocean_proximity'],axis=1))
housing_numerical_transformed

We apply the ColumnTransfomer because it allows us to apply different transformations to different features.

In [None]:
from sklearn.compose import ColumnTransformer
numerical_f = list(housing.drop(['ocean_proximity'],axis=1))
categorical_f = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ('numericals',numerical_pipeline,numerical_f)
    ,('categorical',OneHotEncoder(),categorical_f)
])
housing_completed = full_pipeline.fit_transform(housing)

In [None]:
housing_completed

We can also visualise our array as the original dataframe. To do that we simply pass the indexes and columns from our original housing set + the columns of the transformed features (following the order output of the custom transform functions) and dropping the categorical feature as this has already been encoded with onehotencoder.

In [None]:
housing_completed_df = pd.DataFrame(housing_completed, columns=list(housing.drop(['ocean_proximity'],axis=1).columns) +['population_log','median_income_log','household_log','bedrooms_log','rooms_log','rooms_per_household', 'population_per_household','bedrooms_per_household','median_income_per_household','bedrooms_per_room']+['Ocean_prox__<1H OCEAN',
 'Ocean_prox__NEAR OCEAN','Ocean_prox__INLAND','Ocean_prox__NEAR BAY','Ocean_prox__ISLAND'],index=housing.index)
housing_completed_df.head()

## 5. Model Preparation

Now the fun part, here we can finally start using ML models to find the ones which best fit our data.

#### 5.1 Linear Regression

Since we have standardized our features, we can initially attempt using a linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_completed_df,housing_labels)

Let's check how this model performs. First I take a subset from the original train set and its corresponding target labels from the housing_labels set. I put the former through our pipeline transformation and compare the predictions of the linear model with the actual values.

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print('Predictions', np.round(lin_reg.predict(some_data_prepared),1))
print('Labels:',list(some_labels))

Are they close enough? We have to use some metrics before we can answer this question.

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_completed_df)
lin_mse = mean_squared_error(housing_labels,housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

The RMSE measures the average squared difference between the estimated values and the actual value (the errors). A lower value indicates a good estimator. In this case a prediction error of 65856 with target value median of 179500 is not satisfying. We can also use another metric, r2_score which provides a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.

In [None]:
print('Target Summary Statistics:\nMean: {:.2f}\nMedian: {:.2f}\nStandard Deviation: {:.2f}'.format(housing_labels.mean(),housing_labels.median(),housing_labels.std()))

In [None]:
from sklearn.metrics import r2_score
r_score = r2_score(housing_labels,housing_predictions)
r_score

Our predictions will be correct 67% of the time. This an example of model underfitting the training data. To solve this issue we can choose a more powerful model, engineer better features and feed them to the linear model, or reduce any constraints on the model. The last option can be ruled out as it has not introduced any regularization. Therefore we are left with the remaining two. Before spending time in extracting additional features, let's choose to train our training set with a more powerful model.

#### 5.2 DecisionTree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_completed_df,housing_labels)

In [None]:
housing_predictions = tree_reg.predict(housing_completed_df)
tree_mse = mean_squared_error(housing_labels,housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

An error of 0 seems to indicate that our model has overfit the data. In this case we can use the cross-validation from scikit-learn to select a fold of the set and train it against the other folds one at the time. In this case I am choosing 10 folds, meaning that 1 fold will be picked and evaluated against the other 9 folds, this will be done 10 times over.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_completed_df,housing_labels,scoring='neg_mean_squared_error',cv=10)
rmse_score = np.sqrt(-scores)

In [None]:
print('Scores:\n{}\nMean: {}\nStandard Deviation:{} '.format(rmse_score,rmse_score.mean(),rmse_score.std()))

The decision tree model has a score approximately of 71471±2268(meaning 95% of values range between 66935 and 76007). Clearly, this model doesn't seem to score that well. Actually, it is underperforming when compared to our linear regression score (65856.0815). Just to be that this is the case, let's cross validate on the linear regression model too.

In [None]:
lin_score = cross_val_score(lin_reg,housing_completed_df,housing_labels,scoring='neg_mean_squared_error',cv=10)
lin_rmse_score = np.sqrt(-lin_score)
print('Scores:\n{}\nMean: {}\nStandard Deviation:{} '.format(lin_rmse_score,lin_rmse_score.mean(),lin_rmse_score.std()))

#### 5.3 RandomForest Regressor

Another model we can apply is the RandomForestRegressor. This model works like Decision Tree but creates the trees on random subsets of the features, then averaging out their prediction.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=10,random_state=0)
forest_reg.fit(housing_completed_df,housing_labels)


In [None]:
housing_predictions = forest_reg.predict(housing_completed_df)
forest_mse = mean_squared_error(housing_labels,housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_scores = cross_val_score(forest_reg,housing_completed_df,housing_labels,scoring='neg_mean_squared_error',cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
print('Scores:\n{}\nMean: {}\nStandard Deviation:{} '.format(forest_rmse_scores,forest_rmse_scores.mean(),forest_rmse_scores.std()))

This model looks more promising than the previous two. We can save this model for future use and jump in the hyperparamter tuning phase (which is essentialy where we tune the parameters of our model).

In [None]:
import joblib
joblib.dump(forest_reg,'forest_reg.pkl')

## 6. Model Tuning

The most efficient way to find the optimal parameters for our models is to make use of a GridSearchCV which will automatically create all the possible combinations of input parameters configurations.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30,40,50], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]
forest_reg = RandomForestRegressor(random_state=0)
grid_search = GridSearchCV(forest_reg,param_grid,cv=5,scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_completed_df,housing_labels)

Let's check the best parameters and estimator:

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

We can also check the scores of each hyperparameter combination:

In [None]:
grid_results = grid_search.cv_results_
grid_results.keys()

In [None]:
for mean_score,params in zip(grid_results['mean_test_score'],grid_results['params']):
    print(np.sqrt(-mean_score),params)

In this example, we obtain the the best solution by setting max_features to 8 and the n_estimators to 50. The RMSE score for this combination is 51604, which is a slight better score than using the default parameters (53129). The fact that our gridsearch has chosen 50 for n_estimators (which was the max value we have provided), may indicate that we should re-run the search using higher parameters. In this case, when the hyperparameter search space is large, it's often preferable to use a RandomsizedSearchCV.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distributions = {
    'n_estimators': randint(low=1, high=200)
    ,'max_features': randint(low=1, high=8)
    }

forest_reg = RandomForestRegressor(random_state=0)
random_search = RandomizedSearchCV(forest_reg,param_distributions=param_distributions,n_iter=20,cv=5,scoring='neg_mean_squared_error', return_train_score=True)
random_search.fit(housing_completed_df,housing_labels)

In [None]:
random_search.best_params_

In [None]:
for mean_scores,params in zip(random_search.cv_results_['mean_test_score'],random_search.cv_results_['params']):
    print(np.sqrt(-mean_scores),params)

Hey, our randomised search with 7 features and 151 estimators has given us a slight better error score (51212 vs 51604 of the previous one). Once picked our model with the best hyperparameters, we can check the relative importance of each feature for making accurate predictions

In [None]:
model = random_search.best_estimator_
model.feature_importances_

They don't mean much, unless we pair them with our corresponding features:

In [None]:
feature_names = housing_completed_df.columns
sorted(zip(model.feature_importances_,feature_names),reverse=True)

It looks like only one categorical feature from the onehotencoder actually contributes to the model performance. And in general some our log transformed feature perform better than their not-transformed correspondants. So, we can train another regressor to see the differences (keep the same hyperparameters).

In [None]:
housing_completed_less_features = housing_completed_df.drop(['Ocean_prox__<1H OCEAN','Ocean_prox__INLAND', 'Ocean_prox__NEAR BAY', 'Ocean_prox__ISLAND','population','total_rooms','household_log','household_log'],axis=1)

forest_reg_2 = RandomForestRegressor(n_estimators=151,random_state=0,max_features=7)
forest_reg_2.fit(housing_completed_less_features,housing_labels)


In [None]:
housing_predictions = forest_reg_2.predict(housing_completed_less_features)
forest_mse = mean_squared_error(housing_labels,housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

Not bad, especially when compared to the the first default parameters random forest rmse scores (22279). However, it's better to cross-validate once more.

In [None]:
forest_scores = cross_val_score(forest_reg_2,housing_completed_less_features,housing_labels,scoring='neg_mean_squared_error',cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
print('Scores:\n{}\nMean: {}\nStandard Deviation:{} '.format(forest_rmse_scores,forest_rmse_scores.mean(),forest_rmse_scores.std()))

We went from a mean error score of 53129 to 50613. Not bad. In theory we could have another random gridsearch to find even more optimal parameters for our model. But hey, r2_score is 80% and this is my second notebook. I call it a win for today. In the future we could also try to utilise a SVM or a GradientBoostingRegressor and tune their parameters to search for the best scores. For now, this model will do just fine.

In [None]:
forest_r2_scores = cross_val_score(forest_reg_2,housing_completed_less_features,housing_labels,scoring='r2',cv=10)
print('Scores:\n{}\nMean: {}\nStandard Deviation:{} '.format(forest_r2_scores,forest_r2_scores.mean(),forest_r2_scores.std()))

## 7. Evaluation

Okay, in the final part we take our unused strat testing set, quickly transform it via the pipeline, make our predictions with the latest random forest regressor model and calculate rmse and r2 scores.

In [None]:
final_model = forest_reg_2
final_model

In [None]:
X_test = strat_test_set.drop(['median_house_value'],axis=1)
y_test = strat_test_set['median_house_value'].copy()

X_test.shape,y_test.shape

In [None]:
X_test_preprocessed = full_pipeline.transform(X_test)

# we have to remove the features that we are not using anymore - the original pipeline does not reflect the latest changes
X_test_preprocessed = pd.DataFrame(X_test_preprocessed,columns=housing_completed_df.columns,index=strat_test_set.index).drop(['Ocean_prox__<1H OCEAN','Ocean_prox__INLAND', 'Ocean_prox__NEAR BAY', 'Ocean_prox__ISLAND','population','total_rooms','household_log','household_log'],axis=1)

final_predictions = final_model.predict(X_test_preprocessed)

final_mse = mean_squared_error(y_test,final_predictions)
final_rmse = np.sqrt(final_mse)
final_r2_score = r2_score(y_test,final_predictions)
print('RMSE Score: {}\nR2 Score: {}'.format(final_rmse,final_r2_score))

We can also compute a 95% confidence interval z-scores for the RMSE test:

In [None]:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)


That's it folks. We have created a model with a r2 score of 81% and RMSE of 48842 (which is 25% lower than the original 65856 given by the linear regressor). I will come back to this notebook and updated it from time to time with new things I'll learn along the way. <br>

Ideas for future improvement:
- add references and expand description of the notebook
- use SVM, GradientBoostingRegressor
- create a single pipeline to do transformation and prediction all at once
- deal with the capping intervals (e.g. 500k,450k, etc.) shown by the scatter plot during the EDA section.