In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns;sns.set(style='whitegrid')
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

line_colors = ["#7CEA9C", '#50B2C0', "rgb(114, 78, 145)", "hsv(348, 66%, 90%)", "hsl(45, 93%, 58%)"]

import warnings
warnings.filterwarnings("ignore")

heval = True # load heavier load cells visualisation
heval2 = True # load model cells 

![](https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/8cc1eeaa-4046-4c4a-ae93-93d656f68688/deelezu-4612bda5-9711-419f-8a13-f9ef7127198d.jpg?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOiIsImlzcyI6InVybjphcHA6Iiwib2JqIjpbW3sicGF0aCI6IlwvZlwvOGNjMWVlYWEtNDA0Ni00YzRhLWFlOTMtOTNkNjU2ZjY4Njg4XC9kZWVsZXp1LTQ2MTJiZGE1LTk3MTEtNDE5Zi04YTEzLWY5ZWY3MTI3MTk4ZC5qcGcifV1dLCJhdWQiOlsidXJuOnNlcnZpY2U6ZmlsZS5kb3dubG9hZCJdfQ.N4zM3kLB9YXHN_tBadKXv-2Gkyg6kVABLzIbrEFJEqc)
<span>Photo by <a href="https://unsplash.com/@fadder8?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Fadzai Saungweme</a> on <a href="https://unsplash.com/s/photos/perth?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

# 1. <span style='color:rgb(205, 0, 153)'> Introduction</span>

- The aim of this notebook is to build some models that can predict [Perth](https://www.australia.com/en/places/perth-and-surrounds/guide-to-perth.html) (located in Western Australia) housing prices based on a set of scrapped features made available in the [Perth Housing Dataset](https://www.kaggle.com/syuzai/perth-house-prices). 
- The current dataset has recently been updated, so it's interesting to explore the significance of new features, and their impact on model accuracy.
- The dataset is very interesting to explore since __Perth__ is not a city that is commonly used in data analysis by people outside Australia. When it comes to Australian cities, __Melbourne__ & __Sydney__ are the big two that one might think about. It's also interesting to explore the Plotly library capability & create interactive choropleth maps, similar to the notebook I wrote about [Australian Geographic Data Plots](https://www.kaggle.com/shtrausslearning/australian-geographic-data-plots).
- Any recommendations to improve the notebook, such as ideas or areas of improvement are more than welcome.

# 2. <span style='color:rgb(205, 0, 153)'> Perth Housing Dataset</span>

Having been updated recently version (all_perth_310121.csv), let's review some of the features that are available in the __Perth Housing Dataset__.
- <code>ADDRESS</code> : Physical address of the property ( we will set to index )
- <code>SUBURB</code> : Specific locality in Perth; a list of all Perth suburb can be found [here](https://www.homely.com.au/find-suburb-by-region/perth-greater-western-australia)
- <code>PRICE</code> : Price at which a property was sold (AUD)
- <code>BEDROOMS</code> : Number of bedrooms
- <code>BATHROOMS</code> : Number of bathrooms
- <code>GARAGE</code> : Number of garage places
- <code>LAND_AREA</code> : Total land area (m^2)
- <code>FLOOR_AREA</code> : Internal floor area (m^2)
- <code>BUILD_YEAR</code> : Year in which the property was built
- <code>CBD_DIST</code> : Distance from the centre of Perth (m)
- <code>NEAREST_STN</code> : The nearest public transport station from the property
- <code>NEAREST_STN_DIST</code> : The nearest station distance (m)
- <code>DATE_SOLD</code> : Month & year in which the property was sold
- <code>POSTCODE</code> : Local Area Identifier
- <code>LATITUDE</code> : Geographic Location (lat) of <code>ADDRESS</code>
- <code>LONGITIDE</code> : Geographic Location (long) of <code>ADDRESS</code>
- <code>NEAREST_SCH</code> : Location of the nearest School
- <code>NEAREST_SCH_DIST</code> : Distance to the nearest school
- <code>NEAREST_SCH_RANK</code> : Ranking of the nearest school 

## 2.1. <span style='color:rgb(97, 47, 205)'> New Dataset Additions</span>

- As opposed to a [previous notebook](https://www.kaggle.com/shtrausslearning/perth-housing-price-prediction-eda), the locations of individual addresses is available to us (<code>LONGITUDE</code>,<code>LATITUDE</code>), which is much more handly than what we had before, and allows more leeway to experiment with property relations/tune models, and is a welcome addition, since we have the exact locations.
- Postcodes (<code>POSTCODES</code>) are interesting additions, allowing us to not only link any useful outside data but there is also a likely relation to property prices due to the ordering. It would be iteresting to visualise the distribution of these zip codes as well.
- Nearest School Information (__name__,__distance__,__rank__) are interesting additions as well, good schools are often located in expensive suburbs, and visa versa. Proximity may also have an effect on <code>PRICE</code>, however on first impression, <code>NEAREST_SCH_RANK</code>, could change from year to year, during which the properties were sold <code>DATE_SOLD</code>, nevertheless it would be interesting to explore these school related features. As indicated in the dataset, there is some data missing in <code>NEAREST_SCH_RANK</code>, which we have to deal with.

In [None]:
df_perth0 = pd.read_csv('/kaggle/input/perth-house-prices/all_perth_310121.csv')
df_perth0.columns

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import shap
from catboost import CatBoostClassifier,CatBoostRegressor
from sklearn.feature_selection import SelectKBest,f_regression
from xgboost import plot_importance,XGBClassifier,XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing

# Notebook Helper Class 
class transformer(BaseEstimator,TransformerMixin):
    
    def __init__(self,drop_nan=False,select_dtype=False,show_nan=False,title='Title',show_counts=False,
                 figsize=(None,None), feature_importance = False, target = 'PRICE'):
        self.drop_nan = drop_nan
        self.select_dtype = select_dtype
        self.show_nan = show_nan
        self.title = title
        self.show_counts = show_counts
        self.figsize = figsize
        self.feature_importance = feature_importance
        self.target = target  # target variable
        
    # Apply Fit
    def fit(self,X,y=None):
        return self
        
    # Apply Some Transformation to the Feature Matrix
    def transform(self,X):
        
        '''show NaN % in DataFrame'''
        if(self.show_nan):
            
            fig, ax = plt.subplots(figsize = self.figsize)
            nan_val = (X.isnull().sum()/len(X)*100).sort_values(ascending = False)
            cmap = sns.color_palette("plasma")
            for i in ['top', 'right', 'bottom', 'left']:
                ax.spines[i].set_color('black')
            ax.spines['top'].set_visible(True);ax.spines['right'].set_visible(False)
            ax.spines['bottom'].set_visible(False);ax.spines['left'].set_visible(False)
            sns.barplot(x=nan_val,y=nan_val.index, edgecolor='k',palette = 'rainbow')
            plt.title(self.title);ax.grid(ls='--',alpha = 0.9);plt.show()
            return
        
        ''' Plot df.value_counts '''
        if(self.show_counts):
        
            tdf = X.value_counts()
            cmap = sns.color_palette("plasma")
            fig, ax = plt.subplots(figsize = self.figsize)
            for i in ['top', 'right', 'bottom', 'left']:
                ax.spines[i].set_color('black')
            ax.spines['top'].set_visible(True);ax.spines['right'].set_visible(False)
            ax.spines['bottom'].set_visible(False);ax.spines['left'].set_visible(False)
            sns.barplot(tdf.index,tdf.values,edgecolor='k',palette = 'rainbow',ax=ax);
            plt.title(self.title);ax.grid(ls='--',alpha = 0.9);plt.show()
        
        ''' Drop All NAN values in DataFrame'''
        if(self.drop_nan):
            X = X.dropna();
            return X
            
        ''' Split DataFrame into Numerical/Object features'''
        if(self.select_dtype):
            X1 = X.select_dtypes(include=['float64','int64','uint8'])     # return only numerical features from df
            X2 = X.select_dtypes(exclude=['float64','int64','uint8'])
            return X1,X2
        
        ''' Plot Feature Importance '''
        if(self.feature_importance):
            
             # Plot Correlation to Target Variable only
            def corrMat2(df,target=self.target,figsize=(9,0.5),ret_id=False):

                corr_mat = df.corr().round(2);shape = corr_mat.shape[0]
                corr_mat = corr_mat.transpose()
                corr = corr_mat.loc[:, df.columns == self.target].transpose().copy()

                if(ret_id is False):
                    f, ax = plt.subplots(figsize=figsize)
                    sns.heatmap(corr,vmin=-0.3,vmax=0.3,center=0, 
                                cmap=cmap,square=False,lw=2,annot=True,cbar=False)
                    plt.title(f'Feature Correlation to {self.target}')

                if(ret_id):
                    return corr

            ''' Plot Relative Feature Importance '''
            def feature_importance(tldf,feature=self.target,n_est=500):

                # X : Numerical / Object DataFrame
                ldf0,_ = transformer(select_dtype=True).transform(X=tldf)
                ldf = transformer(drop_nan=True).transform(X=ldf0)  

                # Input dataframe containing feature & target variable
                X = ldf.copy()
                y = ldf[feature].copy()
                del X[feature]

            #   CORRELATION
                imp = corrMat2(ldf,feature,figsize=(15,0.5),ret_id=True)
                del imp[feature]
                s1 = imp.squeeze(axis=0);s1 = abs(s1)
                s1.name = 'Correlation'

            #   SHAP
                model = CatBoostRegressor(silent=True,n_estimators=n_est).fit(X,y)
                explainer = shap.TreeExplainer(model)
                shap_values = explainer.shap_values(X)
                shap_sum = np.abs(shap_values).mean(axis=0)
                s2 = pd.Series(shap_sum,index=X.columns,name='Cat_SHAP').T

            #   RANDOMFOREST
                model = RandomForestRegressor(n_est,random_state=0, n_jobs=-1)
                fit = model.fit(X,y)
                rf_fi = pd.DataFrame(model.feature_importances_,index=X.columns,
                                    columns=['RandForest']).sort_values('RandForest',ascending=False)
                s3 = rf_fi.T.squeeze(axis=0)

            #   XGB 
                model=XGBRegressor(n_estimators=n_est,learning_rate=0.5,verbosity = 0)
                model.fit(X,y)
                data = model.feature_importances_
                s4 = pd.Series(data,index=X.columns,name='XGB').T

            #   KBEST
                model = SelectKBest(k=X.shape[1], score_func=f_regression)
                fit = model.fit(X,y)
                data = fit.scores_
                s5 = pd.Series(data,index=X.columns,name='K_best')

                # Combine Scores
                df0 = pd.concat([s1,s2,s3,s4,s5],axis=1)
                df0.rename(columns={'target':'lin corr'})

                x = df0.values 
                min_max_scaler = preprocessing.MinMaxScaler()
                x_scaled = min_max_scaler.fit_transform(x)
                df = pd.DataFrame(x_scaled,index=df0.index,columns=df0.columns)
                df = df.rename_axis('Feature Importance via', axis=1)
                df = df.rename_axis('Feature', axis=0)
                df['total'] = df.sum(axis=1)
                df = df.sort_values(by='total',ascending=True)
                del df['total']
                fig = px.bar(df,orientation='h',barmode='stack',color_discrete_sequence=line_colors)
                fig.update_layout(template='plotly_white',height=self.figsize[1],width=self.figsize[0],margin={"r":0,"t":60,"l":0,"b":0});
                for data in fig.data:
                    data["width"] = 0.6 #Change this value for bar widths
                fig.show()
                
            feature_importance(X)
            
# Class to Visualise Things Only
class visualise(BaseEstimator,TransformerMixin):
    
    def __init__(self,target=None,option=False):

        self.target = target             # target varable [str]
        self.option = option

    @staticmethod 
    def corrMat2(df,target='demand',figsize=(9,0.5),ret_id=False):

        corr_mat = df.corr().round(2);shape = corr_mat.shape[0]
        corr_mat = corr_mat.transpose()
        corr = corr_mat.loc[:, df.columns == target].transpose().copy()

        if(ret_id is False):
            f, ax = plt.subplots(figsize=figsize)
            sns.heatmap(corr,vmin=-0.3,vmax=0.3,center=0, 
                         cmap=cmap,square=False,lw=2,annot=True,cbar=False)
            plt.title(f'Feature Correlation to {target}')

        if(ret_id):
            return corr
        
    def fit(self):
        return self
    
    # X -> Numerical (feature matrix + target variable)
    def transform(self,X):
        
        # Pandas Static Histogram
        if(self.option is 'histogram'):
            vdf_perth1_num,_ = transformer(select_dtype=True).transform(X=X)
            vdf_perth1_num.hist(bins=30, figsize=(20,15));plt.show()    
        
        # Seaborn Static Boxplot
        if(self.option is 'boxplot'):
            
            lX,_ = transformer(select_dtype=True).transform(X=X)
            fig,axs = plt.subplots(ncols=5,nrows=4,figsize=(900,400))
            index = 0
            axs = axs.flatten()
            for k,v in lX.items():
                flierprops = dict(marker='o',mfc='k',ls='none',mec='k')
                ax = sns.boxplot(x=k,data=lX,orient='h',flierprops=flierprops,
                                ax=axs[index],width=0.5)
                index += 1
            plt.tight_layout();plt.show()
            
        # Outlier Quantile Information
        if(self.option is 'outliers'):
            
    #       2. Define Outliers
            lX,_ = transformer(select_dtype=True).transform(X=X)
            for k, v in lX.items():
                q1 = v.quantile(0.25); q3 = v.quantile(0.75); irq = q3 - q1
                v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
                perc = np.shape(v_col)[0] * 100.0 / np.shape(lX)[0]
                print("Column %s outliers = %.2f%%" % (k, perc))

We are interested in property prices, so let's change the index to <code>ADDRESS</code> and remove any duplicates if they exist.

In [None]:
# Some Data Cleaning 
df_perth0.drop_duplicates(subset=['ADDRESS'],inplace=True) # Some addresses actually have multiple entries
df_perth0.index = df_perth0['ADDRESS'] # set dataframe index, since it's not really a useful feature 
del df_perth0['ADDRESS'] # let's also delete the column

## 2.2. <span style='color:rgb(97, 47, 205)'>  Categrocal & Ordinal Features </span>
- We have a few __categorical features__ which can be handy for EDA, as well as for model features, such as One-Hot Encoding/ GetDummies.
- <code>SOLD_MONTH</code> &<code>SOLD_YEAR</code> can be extracted from <code>DATE_SOLD</code>.
- <code>SUBURB</code> probably doesn't tell us any more than the <code>POSTCODE</code> does, but useful for __EDA__.
- Together with <code>NEAREST_STN</code> & <code>NEAREST_SCH</code>, it is possible to create some form of ranking based on the names. Road names definitely could be something we can tie to ranking. At this stage let's just hang on to these dataframes. 

In [None]:
df_num,df_cat = transformer(select_dtype=True).transform(X=df_perth0.copy())
df_num[['SOLD_MONTH', 'SOLD_YEAR']] = df_cat['DATE_SOLD'].str.split('-', 1, expand=True).astype('float64')
df_cat.drop(['DATE_SOLD'],axis=1,inplace=True)
df_EDA = pd.concat([df_num,df_cat],axis=1) # combine 
df_cat.columns

In [None]:
import plotly.express as px

# Plot Histogram, Boxplot using Plotly
def px_stats(df, n_cols=4, to_plot='box',height=800,w=None):
    
    ldf,_ = transformer(select_dtype=True).transform(X=df)
    numeric_cols = ldf.columns
    n_rows = -(-len(numeric_cols) // n_cols)  # math.ceil in a fast way, without import
    row_pos, col_pos = 1, 0
    fig = make_subplots(rows=n_rows, cols=n_cols,subplot_titles=numeric_cols.to_list())
    
    for col in numeric_cols:
        if(to_plot is 'histogram'):
            trace = go.Histogram(x=ldf[col],showlegend=False,autobinx=True,
                                 marker = dict(color = 'rgb(27, 79, 114)',
                                 line=dict(color='white',width=0)))
        else:
            trace = getattr(px, to_plot)(ldf[col],x=ldf[col])["data"][0]
            
        if col_pos == n_cols: 
            row_pos += 1
        col_pos = col_pos + 1 if (col_pos < n_cols) else 1
        fig.add_trace(trace, row=row_pos, col=col_pos)

    fig.update_traces(marker = dict(color = 'rgb(27, 79, 114)',
                     line=dict(color='white',width=0)))
    fig.update_layout(template='plotly_white');fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0})
    fig.update_layout(height=height,width=w);fig.show()

# 3. <span style='color:rgb(205, 0, 153)'>  Quick Dataset Analysis</span>

Without digging too deep, we can make very general observations based on a quick preliminary dataset investigation & come to some conclusions to better understand what data we are dealing with & try to note some thing we can try since there is no particular task set in the dataset, other than price prediction.

## 3.1. <span style='color:rgb(97, 47, 205)'> Data Distributions Histograms</span>

- The most common price range of a property; 400-500k AUD, which makes up about 10,000 properties.
- We can note how uncommon __1 bedroom appartment__ properties are in Perth, most common being a __4 bedroom property__, typically having __1 or 2 bathrooms__ & __garage__ with __two car slots__.
- We can see a very steady __increase in property sales__ & number of __properties built__ in the last 6 years or so.  I'm guessing real estate agents in Perth are kept busy. 
- Some of the properties have a very large number of garages slots, so it would make sense to just remove them but lets just keep them anyway. Quite a large number of features have __skewed distributions__. Let's note to do soem   
- There seems to exist an interesting grouping of zipcodes; four groups are visible with larger bincounts.

In [None]:
if(heval):
    px_stats(df_EDA, to_plot='histogram') # interactive

## 3.2. <span style='color:rgb(97, 47, 205)'> Data Distributions Boxplots
- Complementary to histograms, boxplots, indicate outliers a little more clearly, as well as useful statistics about __min__, __max__, __median__ & __q1/q3__ values.
- We will lean towards using tree based methods; It is often stated that, ensemble approaches such as RF are not sensitive to outliers, such as this article on [medium](https://arsrinevetha.medium.com/ml-algorithms-sensitivity-towards-outliers-f3862a13c94d), however there are counter arguments that state the complete opposite as shown on [stackexchange](https://stats.stackexchange.com/questions/187200/how-are-random-forests-not-sensitive-to-outliers). 
- That said our data contains quite a lot of outliers, which is to be expected from a non consistent selling standard/rules for properties, allowing certain properties to be prices above/below values of similar properties depending on specific circumstaces.
- It is interesting to look into __creating models for a specific subset__ of our data ( eg. similar suburbs, low cost suburbs, presold properties and so on ), in an attempt to get around these outliers. One model for the entire dataset seems like a huge stretch, and most definitely will have accuracy limits.

In [None]:
if(heval):
    px_stats(df_EDA, to_plot='box',height=550)

## 3.3. <span style='color:rgb(97, 47, 205)'> Target Model Feature Importance Evaluation
- We can use multiple approaches, even early on, to quickly evaluate __which features have most weight__ in a model evaluation to get a better understanding of not only their imporance but also how different models use these features in their evaluation. An interesting article about the relevance of feature importance, [machinelearningmastery](https://machinelearningmastery.com/calculate-feature-importance-with-python/).
- In addition to previously evaluated features, <code>NEAREST_SCH_RANK</code> & <code>POSTCODE</code> are quite impactful features. <code>LONGITUDE</code> & <code>LATITUDE</code> are quite similar to that what it was previously, so even with the addition of more accurate address geotagging, there is little difference. <code>NEAREST_SCH_DIST</code> was found to be one of the less impactful features.
- Although there are a number of features that have little to no impact, the only irrelevant feature seems to be <code>SOLD_MONTH</code>, which we ought to drop, a little later.

In [None]:
transformer(feature_importance=True,figsize=(800,400),target='PRICE').transform(X=df_EDA)

In [None]:
import geopandas as gpd

''' Plotly Geography Choropeth Plots /w Menu Layout '''
def plot_geo_menu(ldf,feature):
    
    print(ldf.info())
    
    # Load Geometry File
    wa_gdf = gpd.read_file('/kaggle/input/wa-gda2020/WA_LOCALITY_POLYGON_SHP-GDA2020.shp')    # Load the data using 
    wa_gdf.drop(['POSTCODE','PRIM_PCODE','LOCCL_CODE','DT_GAZETD','STATE_PID','DT_RETIRE','DT_CREATE','LOC_PID'],axis=1,inplace=True)

    # Display Values
    wa_gdf.index = wa_gdf['NAME']
    median_price = ldf.groupby(['SUBURB']).median()      # Suburb Median Groupby
    median_price.index = median_price.index.str.upper()
    df_merged = wa_gdf.join(median_price).dropna() 
#     df_merged = wa_gdf.join(median_price)

    # Convert geometry to GeoJSON
    df_merged = df_merged.to_crs(epsg=4327)
    lga_json = df_merged.__geo_interface__

    # Unique Token ID
    MAPBOX_ACCESSTOKEN = 'pk.eyJ1Ijoic2h0cmF1c3NhcnQiLCJhIjoiY2tqcDU2dW56MDVkNjJ6angydDF3NXVvbyJ9.nx2c5XzUH9MwIv4KcWVGLA'
#     lst_val = [df_merged[feature].min(),df_merged[feature].max()]

    trace = []    
    # Set the data for the map
    for i in feature:
        trace.append(go.Choroplethmapbox(geojson = lga_json,locations = df_merged.index,    
                               z = df_merged[i].values,                     
                               text = df_merged.index,
                               hovertemplate = "<b>%{text}</b><br>" +
                                                "%{z}<br>" +
                                                "<extra></extra>",
                               colorbar=dict(thickness=10, ticklen=3,outlinewidth=0),
                               marker_line_width=1, marker_opacity=0.8, colorscale="turbo",
                               visible=False)
                        )
    trace[0]['visible'] = True

    layout = go.Layout(mapbox1 = dict(domain = {'x': [0, 1],'y': [0, 1]},
                                      center = dict(lat=-31.95, lon=115.8),
                       accesstoken = MAPBOX_ACCESSTOKEN,zoom = 8),
                       autosize=True,height=500)
    
    lst = [];ii=-1
    for i in feature:
        ii+=1
        tlist = [False for z in range(len(feature))]
        tlist[ii] = True
        temp = dict(args=['visible',tlist],label=i,method='restyle') 
        lst.append(temp)

    # add a dropdown menu in the layout
    layout.update(updatemenus=list([dict(x=0.8,y=1.1,xanchor='left',yanchor='middle',buttons=lst)]))
    fig=go.Figure(data=trace, layout=layout)
    fig.update_layout(title_text='Suburb Mean Values', title_x=0.01)
    fig.update_layout(margin={"r":0,"t":80,"l":0,"b":80},mapbox_style="light")
    fig.show()
    
''' Plotly Geography Choropeth Plots '''
def plot_geo(ldf,feature,title=None,lst_val=None):
    
    # Load Geometry File
    wa_gdf = gpd.read_file('/kaggle/input/wa-gda2020/WA_LOCALITY_POLYGON_SHP-GDA2020.shp')    # Load the data using 
    wa_gdf.drop(['POSTCODE','PRIM_PCODE','LOCCL_CODE','DT_GAZETD','STATE_PID','DT_RETIRE','DT_CREATE','LOC_PID'],axis=1,inplace=True)

    wa_gdf.index = wa_gdf['NAME']
    median_price = ldf.groupby(['SUBURB']).median()
    median_price.index = median_price.index.str.upper()
    df_merged = wa_gdf.join(median_price).dropna() # some perth suburbs don't have data & drop other WA region suburbs to speed up map load

    # Convert geometry to GeoJSON
    df_merged = df_merged.to_crs(epsg=4327)
    lga_json = df_merged.__geo_interface__

    MAPBOX_ACCESSTOKEN = 'pk.eyJ1Ijoic2h0cmF1c3NhcnQiLCJhIjoiY2tqcDU2dW56MDVkNjJ6angydDF3NXVvbyJ9.nx2c5XzUH9MwIv4KcWVGLA'

    if(lst_val is None):
        lst_val = [df_merged[feature].min(),df_merged[feature].max()]

    # Set the data for the map
    data = go.Choroplethmapbox(geojson = lga_json,
                               locations = df_merged.index,    
                               z = df_merged[feature], 
                               text = title,
                               colorbar=dict(thickness=20, ticklen=3,outlinewidth=0),
                               marker_line_width=1, marker_opacity=0.8, colorscale="viridis",
                               zmin=lst_val[0], zmax=lst_val[1])

    layout = go.Layout(mapbox1 = dict(domain = {'x': [0, 1],'y': [0, 1]},center = dict(lat=-31.95, lon=115.8),
                       accesstoken = MAPBOX_ACCESSTOKEN,zoom = 9),
                       autosize=True,height=650)

    # Generate the map
    fig=go.Figure(data=data, layout=layout)
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

## 3.4. <span style='color:rgb(97, 47, 205)'> Target Variable &  Other Features  (Suburb Median)</span>

### Boundary Data

- We'll use the __General District Area__ (GDA) datafile for <b>Western Australia</b>, where Perth is located. 
- The geographic data is available in this source [data.gov](https://data.gov.au/dataset/ds-dga-6a0ec945-c880-4882-8a81-4dbcb85e74e5/distribution/dist-dga-9fff5439-7af5-42f4-9102-42c4199c5c1c/details?q=).

### Median Values

- Properties closer to the __Indian Ocean__ tend to have a higher median price than those next to the Perth __CBD__.
- South-East & Easterns suburbs being mostly associated with the dataset mean value.
- When it comes to beachside properties, almost all suburbs bordering the __Indian Ocean__, west and northwest & southweast of Perth are quite expensive, however suburbs south of __Kwinana Beach__ are more affordable, especially considering the close proximity to the ocean.
- Some notable __suburbs__; __Kwinana Beach & Town__ have quite low median price values, however the number of crimes commited doesn't quite justify the price range, for example; there were only a handful of drug related crimes in __Kwinana Town__ in the last year, in fact th number of offenses commited has dropped quite significantly since 2016 in this suburb. There are quite a few suburbs with higher crime rates that the mentioned two. You can find the data at [WA Police](https://www.police.wa.gov.au/Crime/CrimeStatistics#/). An interesting report on crime data is aso available on [ABC](https://www.abc.net.au/news/2018-02-17/crime-data-for-every-perth-suburb-revealed-by-wa-police/9447642?nw=0).
- Other median feature values can also be investigated using the geographic data using the data below.

In [None]:
temp,_ = transformer(select_dtype=True).transform(X=df_EDA)
tlist = temp.columns.to_list()
tlist.remove('LONGITUDE');tlist.remove('LATITUDE');
if(heval):
    plot_geo_menu(ldf=df_EDA,feature=tlist)

# 4. <span style='color:rgb(205, 0, 153)'> Missing Data & Cleaning </span>

We have three features with missing data, <code>GARAGE</code>, <code>BUILD_YEAR</code> & <code>NEAREST_SCH_RANK</code>.

In [None]:
transformer(show_nan=True,figsize=(9,5),title='Feature (NaN) %').transform(X=df_EDA)

__BUILD_YEAR Imputation__
- <code>garage</code> seems quite straightforward & probably makes sense to set to zero, it was likely left out because there was no garage present.

In [None]:
df_EDA['GARAGE'] = df_EDA['GARAGE'].fillna(0)  # fill missing data with 0

### <span style='color:rgb(27, 79, 114)'> Imputation Overview</span>

- In this notebook, let's use a __model based imputation__ approach using the current dataset features, this is one of our main assumptions in this notebook that can affect our model accuracy, so __let's flag the imputed data indicies__, so we don't forget which properties had the missing data. The flagging feature simply is a numerical value of the order in which it was input into the inpute function.

__BUILD_YEAR Imputation__

- Having multiple columns associated with property features, <code>BUILD_YEAR</code> seems like a feature for which we can use a model to predict these values with at least some confidence, the values will be somewhere in the right ballpark, there is also less than 10% missing, so the effects hopefully will be minimal.

__NEAREST_SCH_RANK Imputation__

- On the contrary, quite a large portion of <code>NEAREST_SCH_RANK</code> are missing, however that is not a reason in itself to drop properties. The lack of __Better Education Rank__ for some schools is probably the result of them having fewer than 20 students with an __ATAR__ value as seen on the [website](https://bettereducation.com.au/Results/WA/wace.aspx), and other factors as well. 
- There are logistical issue associated with predicting missing data to rank schools using __property__ related features, nevertheless, what we can do is, flag properties which has missing data, and investigate later the effects that this has on the accuracy of __price__ prediction models. 
- If that introduces significant errors, we can use an alternative distance based search or perhaps a model (if there are several options nearby) that will __find a property__ that is in close proximity to the __property with missing school rank data__, that would be more logical.

### <span style='color:rgb(27, 79, 114)'> Build Year Model Feature Importance</span>

__Model based imputation__ will build a model to estimate the missing data, so we can check what features would be most relevant if we wanted to predict <code>BUILD_YEAR</code>:

In [None]:
transformer(feature_importance=True,figsize=(800,400),target='BUILD_YEAR').transform(X=df_EDA) 

### <span style='color:rgb(27, 79, 114)'> NEAREST_SCH_RANK Model Feature Importance</span>

- Let's take a look at <code>NEAREST_SCH_RANK</code> as well; both __feature_importance__ & __bivariate relations__.
- The top 5 features give a good indication that <code>NEAREST_SCH_RANK</code> is tried to some form of suburb ranking/ordering, with <code>CBD_DIST</code> being a key feature. It's quite possible that missing values are very small schools that are much further away from the CBD, and don't have sufficient statistics to meet the specific ranking criteria.
- Whilst there is a tendency for the <code>NEAREST_SCH_RANK</code> value to reduce (better ranking) with increasing <code>PRICE</code> up to about a rank of 30. Below this value there tends to be a very widespread relation, 

In [None]:
transformer(feature_importance=True,figsize=(800,400),target='NEAREST_SCH_RANK').transform(X=df_EDA)

In [None]:
from IPython.display import display_html

# Display Multiple Dataframe in HTML format
def pd_html(dfs, names=[]):
    html_str = ''
    for i in dfs:
        i.style.background_gradient(cmap='viridis') 
    
    if names:
        html_str += ('<tr>' + 
                     ''.join(f'<td style="text-align:center">{name}</td>' for name in names) + 
                     '</tr>')
    html_str += ('<tr>' + 
                 ''.join(f'<td style="vertical-align:top"> {df.to_html(index=True)}</td>' 
                         for df in dfs) + 
                 '</tr>')
    html_str = f'<table>{html_str}</table>'
    html_str = html_str.replace('table','table style="display:inline"')
    display_html(html_str, raw=True)

In [None]:
%load_ext Cython

In [None]:
%%cython -a 
import numpy as np
cimport numpy as np

# Regularised Model
cdef class Rtg:

    cdef public np.float64_t lamda, gamma, gain
    cdef public int bfeat_id, min_size, max_depth
    cdef public np.float64_t bfeat_val, value
    cpdef public Rtg lhs
    cpdef public Rtg rhs
    
    def __init__(self, int max_depth=3, np.float64_t lamda=1.0, np.float64_t gamma=0.1, min_size=5):
        self.max_depth = max_depth
        self.gamma = gamma; self.lamda = lamda
        self.lhs = None; self.rhs = None
        self.bfeat_id = -1 
        self.bfeat_val = 0 
        self.value = -7e10
        self.min_size = min_size
        
        return
    
    def fit(self, np.ndarray[np.float64_t, ndim=2] X, np.ndarray[np.float64_t, ndim=1] y):
        
        cpdef long ntot = X.shape[0]
        cpdef long SL = X.shape[0]
        cpdef long SR = 0
        cpdef long idx = 0
        cpdef long thres = 0
        cpdef np.float64_t GL, GR, gain
        cpdef np.ndarray[long, ndim=1] idxs
        cpdef np.float64_t x = 0.0
        cpdef np.float64_t best_gain = -self.gamma
        
        if self.value == -7e10:
            self.value = y.mean()
        if(self.max_depth <= 1):
            return
        
        error0 = ((y - self.value) ** 2).sum()
        error = error0; fid = 0
        n_feat = X.shape[1]
        left_value = 0; right_value = 0
        
        for feat in range(n_feat):
            
            idxs = np.argsort(X[:,feat])
            GL,GR = y.sum(),0.0
            SL,SR, thres = ntot, 0, 0
            
            while thres < ntot - 1:
                
                SL = SL - 1; SR = SR + 1
                idx = idxs[thres]
                x = X[idx, feat]
                
                GL = GL - y[idx]; GR = GR + y[idx]
                gain1 = (GL**2) / (SL + self.lamda)  + (GR**2) / (SR + self.lamda)
                gain2 = - ((GL + GR)**2) / (SL + SR + self.lamda) + self.gamma
                gain = gain1+gain2
                
                if thres < ntot - 1 and x == X[idxs[thres + 1], feat]:
                    thres += 1
                    continue
                
                if (gain > best_gain) and (min(SL,SR) > self.min_size):
                    
                    fid = 1
                    best_gain = gain
                    left_value = -GL / (SL + self.lamda)
                    right_value = -GR / (SR + self.lamda)
                    
                    self.bfeat_id = feat
                    self.bfeat_val = x

                thres += 1
        
        self.gain = best_gain
        if self.bfeat_id == -1:
            return
                
        self.lhs = Rtg(max_depth=self.max_depth - 1, gamma=self.gamma, lamda=self.lamda)
        self.rhs = Rtg(max_depth=self.max_depth - 1, gamma=self.gamma, lamda=self.lamda)
        self.lhs.value = left_value
        self.rhs.value = right_value

        idxs_l = (X[:, self.bfeat_id] > self.bfeat_val)
        idxs_r = (X[:, self.bfeat_id] <= self.bfeat_val)
        self.lhs.fit(X[idxs_l, :], y[idxs_l])
        self.rhs.fit(X[idxs_r, :], y[idxs_r])
        
        if (self.lhs.lhs == None or self.rhs.lhs == None):
            if self.gain < 0.0:
                self.lhs = None; self.rhs = None; self.bfeat_id = -1

    def ppredict(self, np.ndarray[np.float64_t, ndim=1] x):
        if self.bfeat_id == -1:
            return self.value
        if x[self.bfeat_id] > self.bfeat_val:
             return self.lhs.ppredict(x)
        else:
            return self.rhs.ppredict(x)
        
    def predict(self, np.ndarray[np.float64_t, ndim=2] X):
        y = np.zeros(X.shape[0])
        
        for i in range(X.shape[0]):
            y[i] = self.ppredict(X[i])
            
        return y
    
# Bagging Model Regularised Model
class RtgBag():
    
    def __init__(self,min_size=5,max_depth=3,n_samples=10):
            
        self.max_depth = max_depth
        self.min_size = min_size
        self.n_samples = n_samples
        self.subsample_size = None
        self.lst_tree = [Rtg(min_size=self.min_size,max_depth=self.max_depth) for _ in range(self.n_samples)]
    
    def get_samples(self,X,y):

        i = np.random.randint(0, len(X), (self.n_samples, self.subsample_size))
        sampX = X[i]; sampy = y[i]
        return sampX, sampy
    
    def fit(self,X,y):
        
        ntot = X.shape[0]
        self.subsample_size = int(ntot)
        sampX, sampy = self.get_samples(X,y)
        for i in range(self.n_samples):
            self.lst_tree[i].fit(sampX[i], sampy[i].reshape(-1))
        return self
        
    def predict(self,X):
        
        mtot = X.shape[0]; pred = []
        for i in range(self.n_samples):
            pred.append(self.lst_tree[i].predict(X))
        pred = np.array(pred).T

        return np.array([np.mean(pred[i]) for i in range(mtot)])

In [None]:
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import mean_squared_error as mse

# Gradient Boosting Model (XGB/XGB+Bagging)
class GBoost(BaseEstimator,RegressorMixin):
    
    def __init__(self, n_estimators=10, learning_rate=0.5, max_depth=3, 
                 n_samples = 15, min_size = 5, tree_id='xgb_bagging'):
            
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.min_size = min_size
        self.treem_id = []
        self.n_samples = n_samples
        self.tree_id = tree_id
        self.dm_depth = 1 
        self.mse_cond = 1.5 
    
    def fit(self, X, y):
        
        if(type(X) is not np.ndarray):
            X = X.values;
            y = y.values
        self.X = X; self.y = y
        
        ntot = X.shape[0]
        y0 = np.mean(y) * np.ones([ntot])
        prediction = y0.copy()
        prm1 = 1.1; prm2 = 1.5
        
        for t in range(self.n_estimators):
                        
            if t == 0:
                resid = y
            else:
                resid = (y - prediction)
                if (mse(temp_resid,resid) < self.mse_cond):
                    self.learning_rate = self.learning_rate/prm1
                    self.mse_cond = self.mse_cond/prm2
                    self.dm_depth = self.dm_depth+1
            
            d0 = self.min_size; d1 = self.max_depth+self.dm_depth
            if self.tree_id == 'xgb':
                submodel = Rtg(min_size=d0,max_depth=d1)
            if self.tree_id == 'xgb_bagging':
                submodel = RtgBag(min_size=d0,max_depth=d1,n_samples=self.n_samples)
                
            submodel.fit(X,-resid.astype('float64'))
            y0 = submodel.predict(X).reshape([ntot])
            self.treem_id.append(submodel)
            prediction += self.learning_rate * y0
            temp_resid = -resid

        return self
    
    def predict(self,X):
        
        if(type(X) is not np.ndarray):
            X = X.values;
        
        mtot = X.shape[0]
        y_pred_gb = np.mean(self.y)*np.ones([mtot])
        for t in range(self.n_estimators):
            y_pred_gb += self.learning_rate * self.treem_id[t].predict(X).reshape([mtot])
            
        return y_pred_gb

### <span style='color:rgb(27, 79, 114)'> Imputation Method</span>

Let's use a __two model based ensemble average__, one prediction which is based on __unsupervised__ and the other based on __supervised__ learning regression & simply average the two, __Bagging XGB__ is used in an attempt to not overfit the data, [previous tests](https://www.kaggle.com/shtrausslearning/xgb-bagging-regressor-tests) have shown such an approach consistently outperforms the based model.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# function that imputes a dataframe 
def impute_model(df,cols=None):

    # separate dataframe into numerical/categorical
    ldf = df.select_dtypes(include=[np.number])           # select numerical columns in df
    ldf_putaside = df.select_dtypes(exclude=[np.number])  # select categorical columns in df
    # define columns w/ and w/o missing data
    cols_nan = ldf.columns[ldf.isna().any()].tolist()         # list of features w/ missing data 
    cols_no_nan = ldf.columns.difference(cols_nan).values     # get all colun data w/o missing data
    
    if(cols is not None):
        cols_nan = cols
        df1 = ldf[cols_nan].describe()
    
    fill_id = -1
    for col in cols_nan:    
        fill_id+=1
        imp_test = ldf[ldf[col].isna()]   # indicies which have missing data will become our test set
        imp_train = ldf.dropna()          # all indicies which which have no missing data 
        model0 = GBoost(n_estimators=10,tree_id='xgb_bagging')
        model1 = KNeighborsRegressor(n_neighbors=15)  # KNR Unsupervised Approach
        knr = model0.fit(imp_train[cols_no_nan], imp_train[col])
        xgb = model1.fit(imp_train[cols_no_nan], imp_train[col])
        knrP = knr.predict(imp_test[cols_no_nan])
        xgbP = xgb.predict(imp_test[cols_no_nan])
        pred = (knrP + xgbP)*0.5
        ldf.loc[df[col].isna(), col] = pred
        ldf.loc[df[col].isna(),'fill_id'] = fill_id
        
    df2 = ldf[cols_nan].describe()
    pd_html([df1,df2],['before imputation','after imputation'])
        
    return pd.concat([ldf,ldf_putaside],axis=1)

In [None]:
df_EDA2 = impute_model(df_EDA,cols=['BUILD_YEAR','NEAREST_SCH_RANK'])

- We've added  about __10,000 properties__ without affecting the data statistics too much, that said it it won't guarantee we have correctly labeled them, so if these properties are associated with large errors in our __price__ prediction models, we can always reimpute them with some alterntive method.
- It useful to flag these points as well, so they can be removed easily if we need to, this is done via <code>fill_id</code>, 2003 imputed for <code>BUILD_YEAR</code> &  10938 for <code>NEAREST_SCH_RANK</code>.

In [None]:
df_EDA2['fill_id'].value_counts()

In [None]:
df_EDA2.info()

# 5. <span style='color:rgb(205, 0, 153)'> Model Evaluation </span>
- The main goal of all of these models is to predict the target variable <code>PRICE</code>.
- For our __evaluation metric__, lets use a more commonly used __RMSLE__ metric, some example applications can be found on [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S1877050920316318) & why it is used over RMSE can be found on [medium](https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a). 
- __Cross Validation__ is an important strategy to get an idea of how much the model varies across the dataset, here we'll use a 6 fold cross validation approach, together with a 70/30 split for training/evaluation model prediction as well, whilst we do have access to __test results__, we are not going to be tuning or using it as as an indicator to improve our model, since we don't want to overfit to test data. 
- __eval class__ is used as the main evaluation class. The class <code>.transform()</code> output stores the orignal feature together with the predicted training/test models results & and their indivual group_id, so we can identfy which data was used as __training data__ & which was used for __evaluation prediction__, group_id for __post model data exploration__.

__Baseline Model__

In [None]:
from sklearn.metrics import make_scorer
import seaborn as sns; import time
from sklearn.model_selection import cross_val_score, train_test_split
from tqdm import tqdm_notebook

# Class to Build Model
class eval(BaseEstimator,TransformerMixin):
    
    ''' Input: DataFrame Feature matrix + Target Variable '''
    
    def __init__(self,split_id=None,shuffle=False,verbose=False,time=False,
                 target=None,n_cv=6,cv_yrange=None,hm_vvals=[0.5,1.0,0.75]):
        self.split_id = split_id    # test side [float]
        self.shuffle = shuffle      # shuffle option in train/test split [T/F]
        self.target = target        # target name [str]
        self.models = models        # list of models used in evaluation [list]
        self.n_cv = n_cv          
        self.cv_yrange=cv_yrange # May need to adjust yrane value of cv plot
        self.hm_vvals = hm_vvals # Heatmap min,max,mid display values
        self.verbose = verbose 
        self.time = time 
        
    def fit(self,X,y=None):
        return self
        
    def transform(self,X):
            
        if(self.time):
            t0 = time.time()
            
        ''' Split input dataframe into Numerical & Categorical Features '''
        lX,lXo = transformer(select_dtype=True).transform(X=X)

        ''' Split Data into Training & Evaluation Datasets '''
        train_df,eval_df = train_test_split(lX,test_size=self.split_id,shuffle=self.shuffle,random_state=32)
        y_train = train_df[self.target]
        X_train = train_df.loc[:, train_df.columns != self.target]
        y_eval = eval_df[self.target]
        X_eval = eval_df.loc[:, eval_df.columns != self.target]
              
        ''' Print Info '''
        print('features:')
        print(f'X_train {X_train.columns}')
        print(f'X_train shape: {X_train.shape}')
        print(f'X_eval shape: {X_eval.shape}')
        
        ''' Use Numpy instead of DataFrame '''
        if(type(X) is not np.ndarray):
            X_train = X_train.values;
            y_train = y_train.values
            X_eval = X_eval.values
            y_eval = y_eval.values
        
        ''' Main Evaluation Loop, cycle though models in global list '''
        lst_res = []; names = []; lst_train = []; lst_eval = []; lst_res_mean = []
        for name, model in self.models:  # cycle through models & evaluate either cv or train/test
            
            if(self.verbose):
                print('')
                print(f'Running Model: {name}')
            names.append(name)
        
            ''' Pipeline Train/Test Split Prediction '''
            model.fit(X_train,y_train)

            ''' I. Standard Cross Validation '''
            
            # RMSLE score
            def rmsle(y, y0):
                assert len(y) == len(y0)
                return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))
            rmsle_score = make_scorer(rmsle, greater_is_better=True)
            
#             cv_score = np.sqrt(-cross_val_score(pipe,X_train,y_train,cv=self.n_cv,scoring='neg_mean_squared_error'))
            cv_score = cross_val_score(model,X_train,y_train,cv=self.n_cv,scoring=rmsle_score)
    
            # Print & Store in List CV results
            if(self.verbose):
                print("Scores:",cv_score);print("Mean:", cv_score.mean().round(3));print("std:", cv_score.std().round(3)) 
            lst_res.append(cv_score)              # store cross validation scores in list
            lst_res_mean.append(cv_score.mean())  # store mean cross validation score in list

            ''' II. Train/Test Split Evaluation '''
            y_model1 = model.predict(X_train)     # predict on training data
            y_model2 = model.predict(X_eval);     # predict on test data
            
            if(self.time):
                t1 = time.time()
                print(f'{name} - time: {round((t1-t0),4)}')
            
            # Print & Store in List Train/Eval Prediction Scores
            if(self.verbose):
                print(f'Train/Test Scores: {rmsle(y_train,y_model1).round(3)} : {rmsle(y_eval,y_model2).round(3)}') 
            lst_train.append(rmsle(y_train,y_model1).round(3))
            lst_eval.append(rmsle(y_eval,y_model2).round(3))
            
            ''' Store Results '''
            train_df[f'{name}_error'] = abs(y_model1-y_train) # store abs of difference between model & true value (train)
            eval_df[f'{name}_error'] = abs(y_model2-y_eval) # store abs of difference between model & true value (test)
            train_df['group_id'] = 0; eval_df['group_id'] = 1   # define training(0)/test(1) group data identifier (useful for plot)
            train_df[f'{name}'] = y_model1 # store training model prediction
            eval_df[f'{name}'] = y_model2  # store evaluation model prediction
            ldf_out = pd.concat([train_df,eval_df],axis=0) 
            
        ''' Regroup Numerical & Categorical Features '''
        ldf_out_all = pd.concat([ldf_out,lXo],axis=1) # add non numerical features to output
        
        ''' Plot Cross Validation Bar Plots & Heatmap of median CV + Train/Test Results '''
        
        # For Heatmap Output
        s0 = pd.Series(np.array(lst_res_mean),index=names)
        s1 = pd.Series(np.array(lst_train),index=names)
        s2 = pd.Series(np.array(lst_eval),index=names)
        pdf = pd.concat([s0,s1,s2],axis=1)
        pdf.columns = ['cv_average','train','test']
        
        # Plot Results
        sns.set(style="whitegrid")
        fig,ax = plt.subplots(1,2,figsize=(15,4))
        ax[0].set_title(f'{self.n_cv} Cross Validation Results')
        sns.boxplot(data=lst_res, ax=ax[0], orient="v",width=0.2)
        ax[0].set_xticklabels(names)
        sns.stripplot(data=lst_res,ax=ax[0], orient='v',color=".3",linewidth=1)
        ax[0].set_xticklabels(names)
        ax[0].xaxis.grid(True)
        ax[0].set(xlabel="")
        if(self.cv_yrange is not None):
            ax[0].set_ylim(self.cv_yrange)
        sns.despine(trim=True, left=True)
    
        sns.heatmap(pdf,vmin=self.hm_vvals[0],vmax=self.hm_vvals[1],center=self.hm_vvals[2],
                    ax=ax[1],square=False,lw=2,annot=True,fmt='.4f',cmap='jet')
#         ax[1].set_title(f'{scoring} scores')
        plt.show()
        
            
        return ldf_out_all

In [None]:
print('All Current Dataset Features')
print(df_EDA2.columns)

## 5.1. <span style='color:rgb(97, 47, 205)'> Baseline Models</span>
### Model Exploration
- Let's start by creating a __baseline model__, which will use the features that were [available before the dataset update](https://www.kaggle.com/shtrausslearning/perth-housing-price-prediction-eda).
- The only variation that exist between the current set of features and the older dataset is that <code>LONGITUDE</code> & <code>LATITUDE</code> are more precise in this dataset, which actually improves the accuracy ( not shown here ).
- Let's try out different models along the way & __pick out some promising ones__ as suggested in the [ML checklist](https://www.kaggle.com/shtrausslearning/machine-learning-project-checklist), the idea is to get an overall picture of how different models tend to perform on this dataset.
- We will output the __cross validation__ results in the form of a barplot & cross validation mean, training & test RMSLE results in a seaborn heatmap.
- Blank spots in the __heatmap__ indicates the model results have all or partial NaN results.

In [None]:
# base column feature list
print('Baseline Feature List')
tdf_base = df_EDA2.drop(['LONGITUDE','LATITUDE','SOLD_MONTH','SOLD_YEAR','fill_id','NEAREST_SCH_DIST','NEAREST_SCH_RANK','POSTCODE'],axis=1)
print(tdf_base.columns)

In [None]:
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.gaussian_process.kernels import ConstantKernel, RBF
from sklearn.ensemble import (RandomForestRegressor,GradientBoostingRegressor,
                              ExtraTreesRegressor,AdaBoostRegressor)
from sklearn.linear_model import LinearRegression,Lasso,ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

''' Evaluate some promising models '''
# let's look at how models perform overall, using default settings

models = [] 
models.append(('LR',  LinearRegression()))
models.append(('LASSO',Lasso()))
models.append(('EN',ElasticNet()))  
models.append(('KNN',KNeighborsRegressor()))        
models.append(('CART',DecisionTreeRegressor()))     
models.append(('SVR',SVR()))                        
models.append(('ABR', AdaBoostRegressor()))
models.append(('GBR', GradientBoostingRegressor()))
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
models.append(('XGB', XGBRegressor(verbose_eval=False)))
models.append(('CAT', CatBoostRegressor(silent=True))) 

if(heval2):
    # Evaluate Base Model & Return Input DataFrame w/ Results 
    out_1 = eval(target='PRICE',split_id=0.3,shuffle=False,verbose=False,time=False,
                 cv_yrange=(None,None),hm_vvals=[0.0,1.0,0.325]).transform(X=tdf_base)

__Model Performance__
- Whilst we have evaluation of __test data__, we will be making our observations based on __CV & training__ only, it is nice to know the __test results__ as well though.
- Among the better performing models, we have <code>RandomForest</code>,<code>ExtraTreesRegressor</code>,<code>XGBRegressor</code> & <code>CatboostRegressor</code>, all of which are powerful ensemble approaches.
- <code>CatboostRegressor</code> as the owners suggested is quite good right out of the box, performing quite well in the cross validation.
- <code>RandomForest</code> performed quite well as well, having a good training score as well, which to some extent however is likely an overfit model.
- <code>XGBoost Regressor</code> performed quite well on the trianing data (relative to other models), perhaps we need to tune its hyperparameters a little more. The default XGB doesn't always quite perform very well using the base class instantiation hyperparameters, as shown in [this notebook](https://www.kaggle.com/shtrausslearning/xgb-class-bagging-regressor-tests), let's also use the class shown in the same notebook, over the XGBoost library in the next test.
- It was interesting to note that similar to other linear models, on ocassion, <code>XGBoost Regressor</code> had a __NaN__ result on the test set, other ensemble models did not have this issue.

__Models that didn't work__

<code>Gaussian Process Regressor</code>, was also attempted, however the code used in [this notebook](https://www.kaggle.com/shtrausslearning/airfoil-noise-prediction-modeling-using-gpr) unfortunatelly is not very efficient to evaluate the likelihood <code>objective function</code> quick enough for it to be viable, the same can be said for sklearn's version, which makes you wonder what you could do to make the approach more viable (aside from just using gridsearch). There are some libraries that increase the efficiency of <code>scipy's optimise.</code>

### Baseline XGB + Bagging Model
Let's check out how our baseline XGB class model fairs, using two variants, __The Standard XGB Model__ &  __The Bagging Ensemble XGB Model__, using the same hyperparameter settings and compare it to the XGBRegressor.

In [None]:
models = [] 
models.append(('XGB-Library', XGBRegressor(n_estimators=25,learning_rate=0.3,max_depth=6,verbose_eval=False)))
models.append(('BAG-XGB',GBoost(n_estimators=25,tree_id='xgb_bagging',learning_rate=0.3,max_depth=6)))
models.append(('XGB',GBoost(n_estimators=25,tree_id='xgb',learning_rate=0.3,max_depth=6)))

# Evaluate Base Model & Return Input DataFrame w/ Results 
if(heval2):
    out_2 = eval(target='PRICE',split_id=0.3,shuffle=False,verbose=False,time=True,
                 cv_yrange=(None,None),hm_vvals=[0.0,1.0,0.325]).transform(X=tdf_base)

- Performance of <code>cross validation</code> is very similar to the XGB library, a little worse for the identical models. The current code is definitely not quite on par with the XGB library when it comes to efficiency, so some things to think about.
- Like [previous tests](https://www.kaggle.com/shtrausslearning/xgb-class-bagging-regressor-tests), the __bagging model performs slightly better than the default model__, even for a more realistic problem, so that was interesting.
- The current custom XGB class is also not quite as efficient, despite its slightly better cross validation for the __bagging model__, let's stick to the standard XGB library for now, since the efficiency of the library is much better.

### Baseline Feature Importance

In [None]:
transformer(feature_importance=True,figsize=(800,300),target='PRICE').transform(X=tdf_base) 

- It was interesting to note that <code>BEDROOMS</code> was not one of the more important features, desipite its high correlation to <code>PRICE</code>.
- <code>FLOOR_AREA</code>, <code>CBD_DIST</code> & <code>BATHROOMS</code> are on the otherhand one of them more important features in the model.
- Despite its low correlation to <code>PRICE</code>, in models like <code>CatBoost</code>, <code>LAND_AREA</code> is a relatively important feature. 

## 5.2. <span style='color:rgb(97, 47, 205)'> All Numerical Feature Models</span>
- The next models use __all current numerical features available__ (<code>EDA2 DataFrame</code> w/o <code>fill_id</code>).
- We'll also compare the feature importance and compare with the baseline feature evaluation, and see if there are any variations.
- <code>RandomForest</code> & <code>CatBoost</code> were quite promising models as well, so let's compare the accuracy as well.

In [None]:
# base column feature list
tdf_num = df_EDA2.drop(['fill_id'],axis=1)
# tdf_num = df_EDA2.copy()
tdf_num.columns

In [None]:
models = [] 
models.append(('GBR', GradientBoostingRegressor()))
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
models.append(('XGB', XGBRegressor(verbose_eval=False)))
models.append(('CAT', CatBoostRegressor(silent=True))) 

# Evaluate Base Model & Return Input DataFrame w/ Results
if(heval2):
    out_3 = eval(target='PRICE',split_id=0.3,shuffle=False,verbose=False,
                 cv_yrange=(None,None),hm_vvals=[0.0,1.0,0.325]).transform(X=tdf_num)

- The results are __significantly lower__ than before which is quite promising since we need to improve the model as much as possible.
- <code>XGB</code> & <code>CAT</code> models interestingly enough failed on some __inner cross validation__, training & test sets, <code>RandomForest</code> on the other hand have no such issues with the set of features. 

In [None]:
transformer(feature_importance=True,figsize=(800,350),target='PRICE').transform(X=tdf_num) 

- <code>NEAREST_SCH_RANK</code> & <code>ZIPCODE</code> are __quite important additions__ to the dataset, as shown in the <code>feature importance</code> table.
- Looking at <code>XGBRegressor</code> & <code>CatboostRegressor</code>, it was interesting to note that even for this set of features, <code>bedrooms</code> & <code>bathrooms</code> aren't among the top of most important features, usually they are quite significant factors when house properties are presented to customers.
- <code>SOLD_MONTH</code> is definitely a feature worth dropping. <code>GARAGE</code> also seems to be among the less important features, we definitely need to do more investigation into __feature transformation__. 

## 5.3. <span style='color:rgb(97, 47, 205)'> Numerical Feature XGB Bagging & Ensemble Models : Reduced Feature Update </span>
- <code>NEAREST_SCH_DIST</code> and <code>SOLD_MONTH</code> are among the least important features as shown in the previous section, let's get rid of these features and see if there is any impact. 
- A reduction of features that have little contribution is quite desirable since they affect the training times.

In [None]:
# Numerical Features w/o SOLD_MONTH & NEAREST_SCH_DIST
tdf_num_red1 = tdf_num.drop(['SOLD_MONTH','NEAREST_SCH_DIST'],axis=1)
tdf_num_red1.columns

In [None]:
models = [] 
models.append(('GBR', GradientBoostingRegressor()))
models.append(('RFR', RandomForestRegressor()))
models.append(('ETR', ExtraTreesRegressor()))
models.append(('XGB', XGBRegressor(verbose_eval=False)))
models.append(('CAT', CatBoostRegressor(silent=True))) 

# Evaluate Base Model & Return Input DataFrame w/ Results 
# if(heval2):
out_4 = eval(target='PRICE',split_id=0.3,shuffle=False,verbose=False,
             cv_yrange=(None,None),hm_vvals=[0.0,1.0,0.325]).transform(X=tdf_num_red1)

# 6. <span style='color:rgb(205, 0, 153)'> EDA: Presold Housing Properties</span>
- An interesting concept often encountered in housing are presold properties, not just in Australia, but worldwide in general. A __presold property__ is one which sold before it is physically built. 
- Interest in singling out __presold properties__ stems from the fact that it is a slightly different way of selling a property and could well be acting as outliers, we have yet to investigate in detail, what are the outliers in this problem.
- The underlying commonality between such a lot of such properties is that __they generally tend to be cheaper__ than properties that are sold after they are built & one possible reason being that there are a certain risk associated with investing a large amount of savings into something that doesn't yet exist, which opens up buyers to the possibility of fraud from the constructors side.

__Some things that can interest us:__

- It would be of interest to build a model specifically to predict prebuilt property <code>PRICE</code>.
- And investigate some potentially interesting statistics about these properties.

## 6.1. <span style='color:rgb(97, 47, 205)'> The Numbers </span>
- It's probably interesting to understand the scale we're dealing with, how many properties are actually presold every year.
- __568 out of 32k properties__ isn't quite a lot, let's check the numbers by <code>SOLD_YEAR</code> as well.

In [None]:
df_EDA_presold = tdf_num.copy()
df_EDA_presold['PRESOLD'] = df_EDA_presold['SOLD_YEAR'].astype('int') < df_EDA_presold['BUILD_YEAR']
df_EDA_presold['PRESOLD'].value_counts()

In [None]:
dfx = df_EDA_presold.groupby(['SOLD_YEAR']).sum()
dfx2 = df_EDA_presold['BUILD_YEAR'].value_counts()

# Plot
fig = go.Figure()
fig.add_trace(go.Bar(x=dfx.index, y=dfx['PRESOLD'],width=1,name='Presold Properties by Year'))
fig.add_trace(go.Bar(x=dfx2.index, y=dfx2.values,width=1,name='Properties Built by Year'))

# Figure Aesthetics
# fig.update_traces(marker_color='rgb(158,202,225)',marker_line_width=1, marker_line_color='rgb(8,48,107)',
fig.update_traces(opacity=0.9)
fig.update_layout(barmode='stack',template='plotly_white',height=300,title='Properties Built by Year')
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0});fig.show()

- We can note that the __number of presold properties is quite small__ compared to the total number of properties built every year.
- __2013 & 2014__ saw a very high number of prebuilt properties being built, __exceeding 100 properties__ for the first time.

## 6.2. <span style='color:rgb(97, 47, 205)'> Bivariate Scatter Feature Relation</span>
- Let's take a look a scattermatrix relation, __differentiating between presold & non presold properties__, 
- Outlining some of the key features as per <code>feature importance</code> table seen in previous sections, since we have too many features.

In [None]:
# Plot Scatter Matrix using Plotly Express
def scat_mat(ldf,dim=None,colour=None,hov_name=None,title=None):
    
    fig = px.scatter_matrix(ldf,dimensions=dim,opacity=0.5,color=colour,hover_name=hov_name,height=1000)
    fig.update_traces(marker=dict(size=5,line=dict(width=0.5,color='black')))
    fig.update_layout(template='plotly_white',title=title) # stack/overlay/group
    fig.update_traces(diagonal_visible=False)
    fig.show()

In [None]:
tlist = ['PRICE','NEAREST_SCH_RANK','BATHROOMS','CBD_DIST','FLOOR_AREA']
scat_mat(ldf=df_EDA_presold,dim=tlist,colour='PRESOLD',hov_name=df_EDA_presold.index,title='Presold Property Scatter Matrix Relations')

- First things first, as mentioned at the start of this section, __presold properties__ have tended to be much cheaper than properties that are sold after completion.
- Although less common, there were a number of __presold properties__ with schools that are within the top 50-60 & were quite affordable, however like __non presold properties__, properties with the nearest school being even higher ranked, the <code>PRICE</code> tended to go up quite steeply.
- The number of <code>BATHROOMS</code> & <code>FLOOR_AREA</code> don't tend be very different to non prebuilt properties.
- Presold properties tend to be slightly further away from the CBD, in the region of 20k-40k (m).

Let's use geospatial maps to visualse their locations as well.

In [None]:
if(heval):
    plot_geo(ldf=df_EDA_presold[df_EDA_presold['PRESOLD']==True],feature='PRICE',title='Presold Properties Median House Price')

## 6.3. <span style='color:rgb(97, 47, 205)'> Build Year & Sold Year </span>
- From __scattered data__, we already can notice that presold properties tend to be on the lower side compared to non-presold properties, however this doesn't shed any information about different moments in time.
- Using pivot tables, let's find out __what kind of trends have been occuring every year__, looking at the __minimum__ & __median__ values.
- We'll take a look at two features, <code>BUILD_YEAR</code> & <code>SOLD_YEAR</code>. <code>BUILD_YEAR</code> can oscillate quite a bit, given that it represents the entire property market, during which owners of properties could have already changed hands, even for <code>prebuilt</code> properties. Even in such a case, presold properties tend to be on the cheaper lower end of <code>PRICE</code>.

In [None]:
df_EDA2['PRESOLD'] = df_EDA2['SOLD_YEAR'].astype('int') < df_EDA2['BUILD_YEAR']

# Subset of Interest
presold = df_EDA2[(df_EDA2['PRESOLD'] == True) & (df_EDA2['fill_id'] != 0)]  # presold properties w/o some build year data
not_presold = df_EDA2[(df_EDA2['PRESOLD'] == False) & (df_EDA2['fill_id'] != 0)] # non presold properties w/o some build year data

# Presold Properties 
pre_buildyear_min = presold.pivot_table('PRICE',index='BUILD_YEAR',columns='SUBURB').min(axis=1)
pre_soldyear_min = presold.pivot_table('PRICE',index='SOLD_YEAR',columns='SUBURB').min(axis=1)
pre_buildyear_med = presold.pivot_table('PRICE',index='BUILD_YEAR',columns='SUBURB').median(axis=1)
pre_soldyear_med = presold.pivot_table('PRICE',index='SOLD_YEAR',columns='SUBURB').median(axis=1)

# Non Presold Properties
buildyear_min = not_presold.pivot_table('PRICE',index='BUILD_YEAR',columns='SUBURB').min(axis=1)
soldyear_min = not_presold.pivot_table('PRICE',index='SOLD_YEAR',columns='SUBURB').min(axis=1)
buildyear_med = not_presold.pivot_table('PRICE',index='BUILD_YEAR',columns='SUBURB').median(axis=1)
soldyear_med = not_presold.pivot_table('PRICE',index='SOLD_YEAR',columns='SUBURB').median(axis=1)

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objs as go

fig = make_subplots(rows=1,cols=2,subplot_titles=['BUILD_YEAR','SOLD_YEAR'])

# Build Year Plots

fig.add_trace(go.Scatter(x=pre_buildyear_min.index, y=pre_buildyear_min.values,name='Presold: Min Price by Build Year'),row=1, col=1)
fig.add_trace(go.Scatter(x=buildyear_min.index, y=buildyear_min.values,name='Non-Presold: Min Price by Build Year'),row=1, col=1)

# Sold Year Plots
fig.add_trace(go.Scatter(x=pre_soldyear_min.index, y=pre_soldyear_min.values,name='Presold: Min Price by Sold Year'),row=1, col=2)
fig.add_trace(go.Scatter(x=soldyear_min.index, y=soldyear_min.values,name='Non-Presold: Min Price by Sold Year'),row=1, col=2)

# Plot Aesthetics
fig.update_layout(template='plotly_white',title='Minimum Property PRICE')
fig.update_yaxes(range=[0,500000]);fig.update_xaxes(range=[1990,2020])
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0},height=300)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2,subplot_titles=['BUILD_YEAR','SOLD_YEAR'])

# Build Year Plots
fig.add_trace(go.Scatter(x=pre_buildyear_med.index, y=pre_buildyear_med.values,name='Presold: Median Price by Build Year'),row=1, col=1)
fig.add_trace(go.Scatter(x=buildyear_med.index, y=buildyear_med.values,name='Non-Presold: Median Price by Build Year'),row=1, col=1)

# Sold Year Plots
fig.add_trace(go.Scatter(x=pre_soldyear_med.index, y=pre_soldyear_med.values,name='Presold: Median Price by Sold Year'),row=1, col=2)
fig.add_trace(go.Scatter(x=soldyear_med.index, y=soldyear_med.values,name='Non-Presold: Median Price by Sold Year'),row=1, col=2)

# Plot Aesthetics
fig.update_layout(template='plotly_white',title='Median Property Price')
fig.update_yaxes(range=[0,800000])
fig.update_xaxes(range=[1990,2020])
fig.update_layout(margin={"r":0,"t":60,"l":0,"b":0},height=300)
fig.show()

__Build Year__

- For a non-presold property, __min__ & __median__ values of <code>PRICE</code> can fluctuate a bit, due to the variability of market & properties having gone through multiple sales by the time the latest <code>PRICE</code> value was recorded.
- Presold properties on the other hand tend to be more on the lower end for both __minimum__ & __median__ values, when looking at data from 1990.

__Sold Year__

- <code>SOLD_YEAR</code> in contrast tells us how the market has tended to evolve, since the value represents the last sale year. The __minimum value__ for both has tended to be very similar up to about 2000, both __minimum__ & __median__ <code>PRICE</code> values. 
- In the recent decade there has tended to be quite a large gap between the __minimum__ & __median__ <code>PRICE</code> values between pre-sold & non-presold properties built during the same year.

## 6.4. <span style='color:rgb(97, 47, 205)'> Presold Property Model </span>
- Having gained some insight into out data, we noted that __presold__ properties do exist in our data, and that they tend to be on the lower end of the <code>PRICE</code> range. 
- It would be interesting to figure out what kind of model works for this subset of data. The reason is motivated by the fact that __higher error__ tend to be associated with higher <code>PRICE</code> values, so if we can narrow down our data and create specific subgroups without simply using <code>PRICE</code>, we might be able to improve our model.
- Very simple thought, let's create One-Hot-Encoding of a <code>PRESOLD</code> logistical feature to start off.

In [None]:
tdf_num_red1['PRESOLD'] = tdf_num_red1['SOLD_YEAR'].astype('int') < tdf_num_red1['BUILD_YEAR']
presold_dummy = pd.get_dummies(tdf_num_red1['PRESOLD'],prefix='PRESOLD',drop_first=True)
tdf_num_red2 = pd.concat([tdf_num_red1,presold_dummy],axis=1)
tdf_num_red2.columns

__Feature Importance__

Aside from the __XGB Model__, One-Hot-Encoding doesn't seem to have any serious influence on feature importance.

In [None]:
transformer(feature_importance=True,figsize=(800,350),target='PRICE').transform(X=tdf_num_red2) 

__Model Evaluation__

- We can note that the addition has had a positive effect, the overall score has gone down in cross validation.
- The __CAT model__ most notably seems to perform quite well, however there still is a subset of data that tends to cause the model to output NaN values.

In [None]:
out_5 = eval(target='PRICE',split_id=0.3,shuffle=False,verbose=False,
             cv_yrange=(None,None),hm_vvals=[0.0,1.0,0.325]).transform(X=tdf_num_red2)