In [None]:
import pandas as pd
import numpy as np
import calendar
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
from IPython.display import IFrame

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)
%matplotlib inline

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

## Data overview (understanding the data)

In [None]:
df.head(10)

According to the study provided in the link below I've found description of column headers.<br><br>
https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices

id - Unique ID for each home sold<br>
date - Date of the home sale<br>
price - Price of the home sale<br>
bedrooms - Number of bedrooms<br>
bathrooms - Number of bathrooms<br>
sqft_living - Square footage of the apartments interior living space<br>
sqft_lot - Square footage of lot (area around the house) space<br>
floors - Number of floors<br>
waterfront - Dummy variable whether house is located next to water body<br>
view - Index from 0 to 4 of describing how good the view of the house is<br>
condition - Index from 1 to 5 describing what the condition of the building (1 is worst)<br>
grade - Index from 1 to 13 describing quality level of construction and design<br>
sqft_above - Square footage of house interior that is above the ground level<br>
sqft_basement - Square footage of house interior that is below the ground level<br>
yr_built - Year in which house was built<br>
yr_renovated - Year of last house renovation<br>
zipcode - Zipcode<br>
lat - Latitude<br>
long - Longitude<br>
sqft_living15 - The average house square footage for the closest 15 houses<br>
sqft_lot - The average lot square footage for the closest 15 houses<br><br>
Additional explanation you can find in the link below:<br>
https://<span>info.kingcounty.gov/assessor/esales/Glossary.as</span>px?type=r

In [None]:
df.info()

## Data preprocessing and initial analysis

This part will be devoted to transform columns from dataframe to desired form (or create new ones from existing). This includes early form of both visualization and feature engineering.

<br>Check whether there are any NA or NULL values in the dataset.

In [None]:
df.isna().any().any()

In [None]:
df.isnull().any().any()

<b>Date</b><br>
Values in date column are in form of datetime. My purpose is to create columns with years, months (abbreviation) and full date format.

In [None]:
df['year_sale'] = df['date'].str[:4].astype(int)

In [None]:
df['month_sale_num'] = df['date'].str[4:6].astype(int)
df['month_sale_name'] = df['month_sale_num'].apply(lambda x: calendar.month_abbr[x])

In [None]:
df['year_month_day_sale'] = pd.to_datetime(df['date'].str[:4] +"-"+ df['date'].str[4:6] + "-" +  df['date'].str[6:8])

In [None]:
df['month_sale_name'].sample(10)

<br><b>Bathrooms</b>

In [None]:
df['bathrooms'].unique()

Floating point may indicate that bathrooms are not only counted how many of them are in the house. Some bathrooms may have more facalities (bathtub, shower).

<br><b>Bedrooms</b>

In [None]:
df['bedrooms'].value_counts().sort_index()

In [None]:
df.loc[df['bedrooms'] >= 10]

For modeling purposes it is resonable to change top five categories to one.

In [None]:
df['bedrooms'] = df['bedrooms'].apply(lambda x: 8 if x>=8 else x)

<br><b>Floors</b>

In [None]:
df['floors'].value_counts().sort_index()

In my opinion there should be alternative floor column with only integer values.

In [None]:
df['floors_int'] = df['floors'].round(0).astype('int')

<br><b>Waterfront</b>

In [None]:
df['waterfront'].value_counts()

<br><b>View<b>

In [None]:
df['view'].value_counts().sort_index()

<br><b>Condition<b>

In [None]:
df['condition'].value_counts().sort_index()

<br><b>Grade</b>

In [None]:
df['grade'].value_counts().sort_index()

According to kingcounty.gov grades 1-6 are low quality, 7 - 8 average, high 9 - 11 and 12 - 13 very high. Therefore I am going to put these grades into four categories.

In [None]:
def grades_to_categories(col):
    if col in [1,2,3,4,5,6]:
        return 1
    elif col in [7,8]:
        return 2
    elif col in [9,10,11]:
        return 3
    elif col in [12,13]:
        return 4

In [None]:
df['grade_category'] = df['grade'].apply(grades_to_categories)

Column display just to check...

In [None]:
df['grade_category'].value_counts().sort_index()

<br><b>Year built and year renovated</b>

In [None]:
print("Year built:",df['yr_built'].value_counts().count(),"Year renovated:",df['yr_renovated'].value_counts().count())

So both year built and year renovated have large amount of unique values. Better idea is to display them on histograms.

In [None]:
plt.hist(df['yr_built'], bins=100)
plt.show()

In [None]:
plt.hist(df['yr_renovated'], bins=100)
plt.show()

From histograms I can read two conclusions. First is that during world war II there was significant drop in house building in the area. That may be not helpful but category indicating houses built before 1945 may help improve model.<br>
Second is kind of obvious. Most houses were not renovated. In my opinion best use of this column is to make another column indicating whether house was ever renovated.

In [None]:
df['built_after_ww2'] = df['yr_built'].map(lambda x: x>1945)
df['house_renovated'] = df['yr_renovated'].map(lambda x: x != 0)

<br><b>Year built and year renovated</b><br>
In the dataset description it is said that data was collected between May 2014 and May 2015. It is resonable to subtract date of built from appropriate date in year_sale column to see how many years have passed since.

In [None]:
df['years_since_construction'] = df['year_sale'] - df['yr_built']

In [None]:
plt.hist(df['years_since_construction'], bins=100)
plt.show()

Yes, this histogram is reversed yr_built column.

<br><b>Zipcode</b>

As long as in dataset there are geological coordinates, zipcode may not be useful to place households on the geological map but rather reveal better or worse districts.

In [None]:
print(df['zipcode'].unique())
print("Numer of unique districts:",df['zipcode'].unique().size)

These numbers don't tell too much. My idea is just try to put them into model on later stage.

### Visualization and Outliers
<br>The reason why I place this two issues into one topic is that visualzation techiques can easily help to catch outliers. This part may also contain further feature engineering.

<br><b>Price</b><br>


In [None]:
fig, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(y = df['price'], ax=axes[0])

sns.distplot(df['price'], ax=axes[1])
sns.despine(left=True, bottom=True)

axes[0].set(ylabel='Price')
axes[0].yaxis.tick_left()

axes[1].yaxis.set_label_position("left")
axes[1].yaxis.tick_left()
axes[1].set(xlabel='Price', ylabel='Distribution');

fig, axes = plt.subplots(1,2,figsize=(15,10))
sns.scatterplot(y = df['price'],x=df['sqft_living'], ax=axes[0])
sns.scatterplot(y = df['price'],x=df['sqft_lot'], ax = axes[1])
axes[0].set(xlabel = 'Square foot of living area',ylabel="Price")
axes[1].set(xlabel = 'Square foot of lot',ylabel="Price");

In [None]:
print(df.loc[df['price'] >=4000000].shape[0],df.loc[df['price'] >=3000000].shape[0],df.loc[df['price'] >=2000000].shape[0])

My conclusions are:<br>
Square footage of living area looks much more correlated to the price than area around the house. Therefore I will create the same scatterplots for each individual zipcode.

There are only 12 estates with price higher than 4m, 50 higher than 3 and 205 with price over 2m. I will keep this in mind and decide later whether to exclude these outliers (this may improve models).

In [None]:
g = sns.FacetGrid(df, col = "zipcode", height=5,col_wrap=5)
g.map(plt.scatter, "price",'sqft_living', color = 'red');

In [None]:
g = sns.FacetGrid(df, col = "zipcode", height=5,col_wrap=5)
g.map(plt.scatter, "price",'sqft_lot', color = 'blue');

Looking at these scatterplots it is possible to recognize better or worse zipcodes but what's interesting I can select zipcodes that are probably in downtown (area around the house doesn't increase with the price). I will make another column with 'urban' zipcodes.

In [None]:
zipcode_list = [98004,98006,98007,98008,98033,98034,98039,98040,98056,98102,98103,98105,98106,98107,98108,98109,98112,98115,98116,
               98117,98118,98119,98122,98125,98126,98133,98136,98144,98146,98148,98155,98166,98168,98177,98178,98188,98198,98199]

df['urban_zipcode'] = df['zipcode'].map(lambda x: x in zipcode_list)

Different approach is to check how many houses has lot area equal to 0.

In [None]:
print("sqft_lot equal to zero:",df.loc[df['sqft_lot']==0].shape[0])

There are no such cases.

<br><b>Longitude and Lattitude</b>
<br>First thing to do is to check boundaries of both columns. The purpose is to check correctness of data (difference more than two degrees of parallels or meridians will could be suspicious).

In [None]:
print("Min:",min(df['lat']), "Max:",max(df['lat']), "Difference:", max(df['lat']) - min(df['lat']))

In [None]:
print("Min:",min(df['long']), "Max:",max(df['long']), "Difference:",max(df['long']) - min(df['long']))

So this data looks okay.

In [None]:
import folium
from folium.plugins import HeatMap


m = folium.Map(location=[df['lat'].mean(), df['long'].mean(),],
                        zoom_start=9.4,
                        tiles="CartoDB dark_matter")


HeatMap(data=df[['lat','long']].groupby(['lat','long']).sum().reset_index().values.tolist(),radius=11.5).add_to(m)


#m.save("map.html")

m

In [None]:
#IFrame(src='map.html', width=700, height=600)

<i>Note: to generate html file folium module is needed. In order to do this, please use pip (pip install folium).</i>
<br><br>The heatmap above is displaying density of house offers. If only I could find geojson file (marks districts borders) for King County, then I might write code for choropleth map. Choropleth looks like a patchwork. Areas have different colors that corresponds to the chosen attributes or statistics.

<br><b>Price vs House age</br>

In [None]:
fig, axe = plt.subplots(1, 1,figsize=(15,7))
reg = sns.regplot(y = df['price'],x = df['years_since_construction'], scatter_kws={"s": 0.3})
axes = reg.axes
axes.set_ylim(0,1500000)

axe.yaxis.set_label_position("left")
axe.yaxis.tick_left()
axe.set(xlabel='House age', ylabel='Price');


As seen above, house age does not affect price so much. Regression line is quite parallel to the X axis descenging slightly with house age.

<br><b>Price vs Square footages</b>

In [None]:
def plot_sqft_regplot(outlier_limit,features):
    
    df_copy = df.copy()
    df_copy = df_copy.loc[df_copy['price'] <= outlier_limit]
    
    fig, axes = plt.subplots(len(features), 1,figsize=(20,60))
    
    for i, feature in enumerate(features):
        
        reg = sns.regplot(x=df_copy[feature],y=df_copy['price'], ax=axes[i], fit_reg=True, scatter_kws={"s": 0.5})
        reg.tick_params(labelsize=15)
        ax = reg.axes
        ax.set_xlabel(feature, fontsize = 30)
        ax.set_ylabel('Price',fontsize= 30)
        ax.grid(True)
        



In [None]:
plot_sqft_regplot(3000000,['sqft_living','sqft_lot','sqft_above','sqft_basement','sqft_living15','sqft_lot15'])

As this check was made with sqft_lot, I will check whether sqft_basement and sqft_lot15 has values equal to 0. If so, there will be needed another column for such cases.

In [None]:
print("sqft_basement equal to zero:",df.loc[df['sqft_basement']==0].shape[0])
print("sqft_lot15 equal to zero:",df.loc[df['sqft_lot15']==0].shape[0])

Over a half of houses have no basement. That deserves a separate category.

In [None]:
df['no_basement'] = df['sqft_basement'].map(lambda x: int(x==0))

<br><b>Price vs categorical or quasi-categorical features</b><br>
My visualizations code is inspired by work from link below. Boxplots are doing amazing job with presenting a few statistics on one plot while there are only several categories to compare. For better visibility of boxplots I encapsulate these box plots into function with outlier limit (cutoff) option.
<br>
Link:
<i>https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices</i>

In [None]:
def plot_categorical_features(outlier_limit):
    
    df_copy = df.copy()
    df_copy = df_copy.loc[df_copy['price'] <= outlier_limit]
    fig, axes = plt.subplots(5, 2,figsize=(17,45))

    sns.boxplot(x=df_copy['grade'],y=df_copy['price'], ax=axes[0][0])
    axes[0][0].set(xlabel='Grade', ylabel='Price')
    axes[0][0].yaxis.tick_left()
    axes[0][0].grid(True)

    sns.boxplot(x=df_copy['grade_category'],y=df_copy['price'], ax=axes[0][1])
    axes[0][1].yaxis.set_label_position("right")
    axes[0][1].yaxis.tick_right()
    axes[0][1].set(xlabel='Grade categorized', ylabel='Price')
    axes[0][1].grid(True)


    sns.boxplot(x=df_copy['view'],y=df_copy['price'], ax=axes[1][0])
    axes[1][0].yaxis.tick_right()
    axes[1][0].set(xlabel='View', ylabel='Price')
    axes[1][0].grid(True)

    sns.boxplot(x=df_copy['waterfront'],y=df_copy['price'], ax=axes[1][1])
    axes[1][1].yaxis.set_label_position("right")
    axes[1][1].yaxis.tick_right()
    axes[1][1].set(xlabel='Waterfront', ylabel='Price')
    axes[1][1].grid(True)


    sns.boxplot(x=df_copy['built_after_ww2'],y=df_copy['price'], ax=axes[2][0])
    axes[2][0].yaxis.tick_right()
    axes[2][0].set(xlabel='Built after WW2?', ylabel='Price')
    axes[2][0].grid(True)

    sns.boxplot(x=df_copy['house_renovated'],y=df_copy['price'], ax=axes[2][1])
    axes[2][1].yaxis.set_label_position("right")
    axes[2][1].yaxis.tick_right()
    axes[2][1].set(xlabel='House renovated?', ylabel='Price')
    axes[2][1].grid(True)

    sns.boxplot(x=df_copy['condition'],y=df_copy['price'], ax=axes[3][0])
    axes[3][0].yaxis.tick_right()
    axes[3][0].set(xlabel='Condition', ylabel='Price')
    axes[3][0].grid(True)

    sns.boxplot(x=df_copy['urban_zipcode'],y=df_copy['price'], ax=axes[3][1])
    axes[3][1].yaxis.set_label_position("right")
    axes[3][1].yaxis.tick_right()
    axes[3][1].set(xlabel='Has urban zipcode?', ylabel='Price')
    axes[3][1].grid(True)
    
    sns.boxplot(x=df_copy['month_sale_name'],y=df_copy['price'], ax=axes[4][0])
    axes[4][0].yaxis.tick_right()
    axes[4][0].set(xlabel='Month of sale', ylabel='Price')
    axes[4][0].grid(True)

    sns.boxplot(x=df_copy['no_basement'],y=df_copy['price'], ax=axes[4][1])
    axes[4][1].yaxis.set_label_position("right")
    axes[4][1].yaxis.tick_right()
    axes[4][1].set(xlabel='Has basement?', ylabel='Price')
    axes[4][1].grid(True)

    fig, axes = plt.subplots(3, 1,figsize=(17,25))

    sns.boxplot(x=df_copy['bathrooms'],y=df_copy['price'], ax=axes[0])
    axes[0].yaxis.tick_left()
    axes[0].set(xlabel='Bathrooms', ylabel='Price')
    axes[0].grid(True)

    sns.boxplot(x=df_copy['bedrooms'],y=df_copy['price'], ax=axes[1])
    axes[1].yaxis.tick_left()
    axes[1].set(xlabel='Bedrooms', ylabel='Price')
    axes[1].grid(True)
    
    sns.boxplot(x=df_copy['floors'],y=df_copy['price'], ax=axes[2])
    axes[2].yaxis.tick_left()
    axes[2].set(xlabel='Floors', ylabel='Price')
    axes[2].grid(True);



In [None]:
plot_categorical_features(2000000)

Conclusions:<br>
Grade, waterfront, view, condition, number of bedrooms, number of bathrooms, house renovation, urban zipcode. These features, in differend degree, lift price up. What is suprising, unlike main trend - house age, houses built before WW2 are slightly more expensive.

## Ideas (feature engineering)

In this part I am going to produce some fancy features. Multiplying, adding together, division two different columns may help inprove the model in unexpected way. Altogether with prevoiusly created features I plan to use only original (not processed) or resulting column in order to not confuse (overfit) model. 

In [None]:
def divide_bathrooms_bedrooms(bathrooms,bedrooms):
    if bedrooms != 0:
        return bathrooms/bedrooms
    else:
        return 0

In [None]:
df['bathrooms/bedrooms'] = df.apply(lambda x: divide_bathrooms_bedrooms(x.bathrooms,x.bedrooms),axis=1)
df.loc[df['bedrooms'] == 0].head(1)

In [None]:
df['bathrooms*bedrooms'] = df['bathrooms']*df['bedrooms']

In [None]:
df['waterfront+view'] = df['waterfront'] + df['view']

In [None]:
df['over_one_floor'] = df['floors'].map(lambda x: int(x>1.))
df['over_two_floors'] = df['floors'].map(lambda x: int(x>2.))

In [None]:
df['view_over_zero'] = df['view'].map(lambda x: int(x>0))

## Correlation

Correlation between a target variable (price) and other is a good indicator which features may be worth of using in model. On the other hand when two features correlate with each other strongly, that may lead model to overfitting. So resonable is to use only one of them.
I make two correlation heatmap. One with Spearman more categorical variables and Pearson (default value) with continuous data. 

In [None]:
df.columns

In [None]:
df_correlation = df[['price','bedrooms','bathrooms', 'over_one_floor','over_two_floors','view_over_zero',
                     'waterfront','view','condition', 'grade','house_renovated',
                    'grade_category', 'built_after_ww2','urban_zipcode','no_basement','waterfront+view']].copy()
plt.rcParams['figure.figsize']=(15,10)
sns.heatmap(df_correlation.corr(method='spearman'), vmax=1., vmin=-1., annot=True, linewidths=.8, cmap="YlGnBu");

In [None]:
df_correlation = df[['price','sqft_living','sqft_lot','sqft_above', 'sqft_basement','sqft_living15', 'sqft_lot15',
                    'year_sale','month_sale_num','years_since_construction','bathrooms/bedrooms',
                     'bathrooms*bedrooms', 'yr_built', 'yr_renovated','floors']].copy()
plt.rcParams['figure.figsize']=(15,10)
sns.heatmap(df_correlation.corr(), vmax=1., vmin=-1., annot=True, linewidths=.8, cmap="YlGnBu");

Finally one big correlation heatmap to see how features correlate with each other.

In [None]:
df_correlation = df[['price','sqft_living','sqft_lot','sqft_above', 'sqft_basement','sqft_living15', 'sqft_lot15',
                    'year_sale','month_sale_num','years_since_construction','bathrooms/bedrooms',
                     'bathrooms*bedrooms', 'yr_built', 'yr_renovated','floors',
                    'bedrooms','bathrooms', 'over_one_floor','over_two_floors','view_over_zero',
                     'waterfront','view','condition', 'grade','house_renovated',
                    'grade_category', 'built_after_ww2','urban_zipcode','no_basement','waterfront+view']].copy()
plt.rcParams['figure.figsize']=(15,10)
sns.heatmap(df_correlation.corr(), vmax=1., vmin=-1., annot=True, linewidths=.8, cmap="YlGnBu");

## Modeling

In [None]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

from functools import partial
from hyperopt import hp
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
import random
from math import sqrt

random.seed(100)

Declare target variable

In [None]:
y = df['price'].values

Below there is declarated Cross Validation function. Cross Validation is used to comapre different models on one dataset. If we were dividing dataset on train and testing sets, there is a chance that there might be inequalities that could favor some algorithms. Stratified K Fold tries to distribute values and classes evenly between training and validation parts.

In [None]:
def train_validate(model, metric, X, y):
    skf = KFold(n_splits = 8, shuffle= True)
    
    scores_metric = []
    for train_idx, test_idx in skf.split(X,y):
        model.fit(X[train_idx],y[train_idx])
        y_pred = model.predict(X[test_idx])
        
        score = metric(y[test_idx],y_pred)
        
        scores_metric.append(score)

        
    result = np.mean(scores_metric)

    return result

My experimental function. It is randomly trying to find what set of features gives best result for given model and metric. This should be treated just as additional help. I will not run it for all model as it takes a lot of time to perfom one run. This could be deveolped in future.

In [None]:
def best_features(dataframe, model,metric, features, repeats = 20, min_features = 1, max_features = 18):

    best_score = 100000000000000000
    best_feats = []
    np.random.seed(2000)
    
    y = dataframe['price'].values
    
    if max_features > len(features):
        max_features = len(features)
        

    for i in range(min_features,max_features): 
        for a in range(repeats): # repeat n times for this number of features
            feats = np.random.choice(features,i,replace = False).tolist()
            X = dataframe[feats].values
            score = train_validate(model,metric,X,y)
            if score < best_score:
                best_score = score
                best_feats = feats
        print("Best score for {0} features is {1}".format(len(feats),best_score))
        print(feats)


    print('\n\nBest score is {0} with features: {1}'.format(best_score,best_feats))

<br><b>Metrics</b>

In regression problems there are several metrics used to estimate how good the model is. Most common used are Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).<br>MAE, is just average absolute error (distance) between model estimation and real point.<br>RMSE on the other hand squares these errors and finally takes square root of its sum.
RMSE is more fragile on large errors than MAE.
<br>In both cases the lower value it is better.
<br><br>
R squared measures what is the quality of the model. It expands from 0 to 1. In regression if R^2 is equal to 1, then we can say that all positions of points on Y axis can be perfectly explained by their position on X axis, so model strictly fits the data.


<br><b>Multilinear Regression</b>

<b>Standarization</b><br><br>
If dataset contains features in different scales these both techniques allows to rescale them. This usualy improves some models (regression, k-nearest neighbours, SVM).<br>
There are two mainly used techniques to do this :<br> - normalization scales all values between 0 and 1, <br> - standarization (z-score) maps mean of feature values as 0 and unit is standard deviation.<br>
Standarization is doing better when we deal with features containing outliers so I will use this one.<br>
I propose to create separate pandas dataframe with standarized values.

In [None]:
df_copy = df.copy()
df_copy = df_copy.drop(columns=['id','date','month_sale_name','year_month_day_sale','zipcode'])
names = df_copy.columns

scaler = StandardScaler()
standarized_df = pd.DataFrame(scaler.fit_transform(df_copy), columns = names)
standarized_df.head()

In [None]:
standarized_df.columns

In [None]:
features = standarized_df.columns.to_list()
features.remove('price')

In [None]:
#function below is hashed because it takes several minutes to get the result.
#best_features(standarized_df, LinearRegression(),mean_squared_error, features, max_features=25, repeats = 35)

Best score is 0.2824923891898764 with features: ['bathrooms', 'sqft_above', 'over_one_floor', 'waterfront+view', 'bathrooms*bedrooms', 'lat', 'month_sale_num', 'year_sale', 'sqft_living15', 'years_since_construction', 'sqft_basement', 'condition', 'sqft_living', 'no_basement', 'urban_zipcode', 'grade', 'house_renovated', 'bedrooms', 'over_two_floors', 'waterfront', 'yr_built', 'long', 'floors']

R squared:

In [None]:
feats = ['bathrooms', 'sqft_above', 'over_one_floor', 'waterfront+view', 'bathrooms*bedrooms', 'lat', 
         'month_sale_num', 'year_sale', 'sqft_living15', 'years_since_construction', 'sqft_basement', 
         'condition', 'sqft_living', 'no_basement', 'urban_zipcode', 'grade', 'house_renovated', 'bedrooms', 
         'over_two_floors', 'waterfront', 'yr_built', 'long', 'floors']

y = standarized_df['price'].values
X = standarized_df[feats].values

lin_reg = LinearRegression()
lin_reg.fit(X,y)
lin_reg.score(X,y)

Now do the same steps with unstandarized dataset.

In [None]:
features = df.columns.to_list()

remove_list = ['price','id','month_sale_name','year_month_day_sale','date','zipcode']

for elem in remove_list:
    features.remove(elem)

In [None]:
#function below is hashed because it takes several minutes to get the result.
#best_features(df, LinearRegression(),mean_squared_error, features, max_features=25, repeats = 35)

Best score is 38082141181.18445 with features: ['bathrooms', 'sqft_above', 'over_one_floor', 'waterfront+view', 'bathrooms*bedrooms', 'lat', 'month_sale_num', 'year_sale', 'sqft_living15', 'years_since_construction', 'sqft_basement', 'condition', 'sqft_living', 'no_basement', 'urban_zipcode', 'grade', 'house_renovated', 'bedrooms', 'over_two_floors', 'waterfront', 'yr_built', 'long', 'floors']

In [None]:
feats = ['bathrooms', 'sqft_above', 'over_one_floor', 'waterfront+view', 'bathrooms*bedrooms', 'lat', 
         'month_sale_num', 'year_sale', 'sqft_living15', 'years_since_construction', 'sqft_basement', 
         'condition', 'sqft_living', 'no_basement', 'urban_zipcode', 'grade', 'house_renovated', 'bedrooms', 
         'over_two_floors', 'waterfront', 'yr_built', 'long', 'floors']

y = df['price'].values
X = df[feats].values

lin_reg = LinearRegression()
lin_reg.fit(X,y)
lin_reg.score(X,y)

Function has found the same sets of features in both cases. R squared is quite the same. Standarization didn't improve regression.

RMSE is just square root of MSE.

In [None]:
rmse = round(sqrt(38082141181.18445),None)
rmse

<b>Ridge Regression</b> can improve model adding some Bias by alpha penalty in order to lower variance between datasets.

In [None]:
ridge = Ridge()
parameters = {'alpha':[0,1e-15,1e-10,1e-8,1e-4,1e-3,1e-2,1,2,3,5,10,15,20]}

In [None]:
ridge_regressor = GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error', cv=10)

In [None]:
ridge_regressor.fit(X,y)

In [None]:
print(ridge_regressor.best_params_)

In [None]:
print(ridge_regressor.best_score_)

In [None]:
rmse = round(sqrt(38214893294.43146),None)
rmse

Ridge regression has slightly worse RMSE metric value but it's due to adding bias to the model.

<br><b>K Nearest Neighbors</b>

Having GPS coordinates (lattitude and longitude) it comes to mind to use them somehow. Maybe expensive neighborhood lifts price high. Here comes KNeighborsRegressor aglorithm.
<br>Let's try first with a few featuers and determine how many neighbors are doing best job.

In [None]:
feats = ['lat','long','sqft_living']
X = df[feats].values

In [None]:
for i in range(1,16):

    KNR = KNeighborsRegressor(n_neighbors=i)
    score = train_validate(KNR,mean_squared_error,X,y)
    rmse = sqrt(score)
    print('Neighbors: {0}, MSE: {1}'.format(i,rmse))

In [None]:
features = df.columns.to_list()

remove_list = ['price','id','month_sale_name','year_month_day_sale','date','zipcode']

for elem in remove_list:
    features.remove(elem)

In [None]:
#function below is hashed because it takes several minutes to get the result.
#best_features(df, KNeighborsRegressor(n_neighbors=9),mean_squared_error, features, max_features=25, repeats = 35)

Best score is 29912478092.967567 with features: ['lat', 'grade', 'view_over_zero', 'long']

In [None]:
features = ['lat', 'grade', 'view_over_zero','long']

X = df[features].values
y = df['price']

KNR = KNeighborsRegressor(n_neighbors=9)
score = train_validate(KNR,mean_squared_error,X,y)
rmse = round(sqrt(score),None)
print('Neighbors: {0}, RMSE: {1}'.format(9,rmse))

As seen above, this algorithm has better performance than regression. Finding best features selection manually could bring better solutions but it also takes more time.<br>
Finally there is a chance that models overfits data, so it should be runned again on testing dataset to verify results.

<br><b>Random Forest Regressor and XGBoost Regressor</b>

In [None]:
feats = ['lat','long','sqft_living']
X = df[feats].values

In [None]:
# XGB
print(sqrt(train_validate(xgb.XGBRegressor(),mean_squared_error,X,y)))
# Random Forest Regressor
print(sqrt(train_validate(RandomForestRegressor(),mean_squared_error,X,y)))

Even without feature selection and hyperparameters adjustment these models are doing better than previous aglorithms.

<b>XGBoost Regressor</b>

In [None]:
feats = ['sqft_living','sqft_lot','sqft_living15', 'sqft_lot15','years_since_construction',
                     'bathrooms*bedrooms', 'yr_built', 'yr_renovated','floors_int',
                     'over_one_floor','over_two_floors', 'sqft_basement',
                     'waterfront','view','condition', 'grade','house_renovated',
                     'built_after_ww2','urban_zipcode','no_basement', 'lat','long']
X = df[feats].values
print(sqrt(train_validate(xgb.XGBRegressor(),mean_squared_error,X,y)))

Hyperparameters Optimization<br>
<i>It takes significant amount of time to run hyperopt loop, so I hash code below (the same with Random Forest Regressor) and paste results in markdown</i>

In [None]:
#function below is hashed because it takes several minutes to get the result.
# def objective(space):
#     params = {
#         'eta':space['eta'],
#         'max_depth':int(space['max_depth']),
#         'min_child_weight':int(space['min_child_weight']),
        
#     }
    
#     model = xgb.XGBRegressor(**params)
    
#     score = sqrt(train_validate(model,mean_squared_error,X,y))
#     print('Score: {0}'.format(score))
#     return {'loss':score,'status':STATUS_OK}


# space = {
#     'eta':hp.uniform('eta',0.1,1),
#     'max_depth':hp.quniform('max_depth',1,70,1),
#     'min_child_weight':hp.quniform('min_child_weight',0,150,1)
# }


# trials = Trials()
# best_params = fmin(fn = objective,
#                   space = space,
#                   algo=partial(tpe.suggest, n_startup_jobs = 10),
#                   max_evals = 20,
#                   trials = trials)

# print('Best params: ', best_params)

best loss: 124122.52545991719
Best params:  {'eta': 0.43924184444710274, 'max_depth': 6.0, 'min_child_weight': 30.0}

<br><b>Random Forest Regressor</b>

In [None]:
feats = ['sqft_living','sqft_lot','sqft_living15', 'sqft_lot15','years_since_construction',
                     'bathrooms*bedrooms', 'yr_built', 'yr_renovated','floors_int',
                     'over_one_floor','over_two_floors', 'sqft_basement',
                     'waterfront','view','condition', 'grade','house_renovated',
                     'built_after_ww2','urban_zipcode','no_basement', 'lat','long']
X = df[feats].values
print(sqrt(train_validate(RandomForestRegressor(),mean_squared_error,X,y)))

Hyperparameter Optimization

In [None]:
#function below is hashed because it takes several minutes to get the result.
# def objective(space):
#     params = {
#         'max_depth':int(space['max_depth']),
#         'min_samples_split':int(space['min_samples_split']),
#         'max_features':int(space['max_features'])
#     }
    
#     model = RandomForestRegressor(**params)
    
#     score = sqrt(train_validate(model,mean_squared_error,X,y))
#     print('Score: {0}'.format(score))
#     return {'loss':score,'status':STATUS_OK}


# space = {
#     'max_depth':hp.quniform('max_depth',1,35,1),
#     'min_samples_split':hp.quniform('min_samples_split',2,100,1),
#     'max_features':hp.quniform('max_features',1,10,1)
# }


# trials = Trials()
# best_params = fmin(fn = objective,
#                   space = space,
#                   algo=partial(tpe.suggest, n_startup_jobs = 50),
#                   max_evals = 150,
#                   trials = trials)

# print('Best params: ', best_params)


'max_depth': 30.0, 'max_features': 10.0, 'min_samples_split': 2.0

<br><b>Dataset split and testing</b>

Now it is time to put hyperparameters to models and simulate (by dividing dataset on train and test part) how would they work in reallife problem.

In [None]:
def draw_feature_importances(model, features):
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]
    plt.figure(figsize=(10, 5))
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), model.feature_importances_[indices],
           color="b", align="center")
    plt.xticks(range(X.shape[1]), [ features[x] for x in indices] )
    #plt.xticks(range(X.shape[1]), model.feature_importances_[indices])
    plt.xticks(rotation=90)
    plt.xlim([-1, X.shape[1]])
    plt.show()

<i>Feature importance function above</i>

In [None]:
feats = ['sqft_living','sqft_lot','sqft_living15', 'sqft_lot15','years_since_construction',
                     'bathrooms*bedrooms', 'yr_built', 'yr_renovated','floors_int',
                     'over_one_floor','over_two_floors', 'sqft_basement',
                     'waterfront','view','condition', 'grade','house_renovated',
                     'built_after_ww2','urban_zipcode','no_basement', 'lat','long']
X = df[feats].values

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size = 0.3, random_state = 2)

<br><b>XGBoost Regressor result model</b>

In [None]:
model_XGBoostRegressor = xgb.XGBRegressor(eta = 0.4392, max_depth = 6, min_child_weight = 30)
model_XGBoostRegressor.fit(train_X,train_y)

In [None]:
y_pred = model_XGBoostRegressor.predict(test_X)
print("RMSE error: {0}".format(sqrt(mean_squared_error(test_y,y_pred))))
print("MAE error: {0}".format(mean_absolute_error(test_y,y_pred)))

In [None]:
draw_feature_importances(model_XGBoostRegressor,feats)

<br><b>Random Forest Regressor result model</b>

In [None]:
model_RandomForestRegressor = RandomForestRegressor(max_depth=30, max_features='auto',min_samples_split=2)
model_RandomForestRegressor.fit(train_X,train_y)

In [None]:
y_pred = model_RandomForestRegressor.predict(test_X)
print("RMSE error: {0}".format(sqrt(mean_squared_error(test_y,y_pred))))
print("MAE error: {0}".format(mean_absolute_error(test_y,y_pred)))

In [None]:
draw_feature_importances(model_RandomForestRegressor,feats)

In [None]:
fig, axe = plt.subplots(1, 1,figsize=(15,7))
scatter = sns.scatterplot(x = y_pred,y = test_y)
axes = scatter.axes
plt.title('Random Forest Price actual vs predicted')
plt.grid(True)
axe.yaxis.set_label_position("left")
axe.yaxis.tick_left()
axe.set(xlabel= 'Price predicted', ylabel='Price actual');


Scatterplot above compares actual house price with predicted price. It is good way to catch wrongly predicted cases.

## Conclusions and further steps

Finally XGBoost Regressor has lower RMSE error and a little bit MAE error. This result does not mean that this model is better for the task. Results may change in effect of further adjustment. Moreover XGBoost is much slower than Random Forest (or should be used more computation power). <br><br>
This analysis and modeling were made to cleary show steps of how the analyst approaches regression problem. In real life conditions analysis and visualizations should go parallel to data preparation, feature engineering and modeling. Analyst chooses only best solutions, does not use jupyter notebook for this purposes, possibly takes advantages of OOP.
Model I presented here can be developed. Here are some of my suggestions how: <br>

- find real utility and public building locations (schools, hospitals, shop centres etc...). This can explain price differences,
- compare different models (CatBoost LightGBM),
- spend more time on feature selection and hyperparameter optimization (it takes significant amount of time to run hyperopt loop),
- use two or more models to receive output value.